本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-07-03)

今日共更新840篇论文,其中:

  • 自然语言处理174篇(Computation and Language (cs.CL))
  • 计算机视觉180篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能230篇(Artificial Intelligence (cs.AI))
  • 机器学习277篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] KV Cache Compression But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches
[NLP-0] KV缓存压缩,但我们必须付出什么作为回报?具有长期背景能力的方法的全面基准

链接: https://arxiv.org/abs/2407.01527
作者: Jiayi Yuan,Hongyi Liu,Shaochen(Henry)Zhong,Yu-Neng Chuang,Songchen Li,Guanchu Wang,Duy Le,Hongye Jin,Vipin Chaudhary,Zhaozhuo Xu,Zirui Liu,Xia Hu
关键词: digest long-form texts, large language models, long-form texts, crucial competency, competency for large
中文关键词: 消化长篇文本、大型语言模型、长篇文本、关键能力、大型能力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long context capability is a crucial competency for large language models (LLMs) as it mitigates the human struggle to digest long-form texts. This capability enables complex task-solving scenarios such as book summarization, code assistance, and many more tasks that are traditionally manpower-intensive. However, transformer-based LLMs face significant challenges with long context input due to the growing size of the KV cache and the intrinsic complexity of attending to extended inputs; where multiple schools of efficiency-driven approaches – such as KV cache quantization, token dropping, prompt compression, linear-time sequence models, and hybrid architectures – have been proposed to produce efficient yet long context-capable models. Despite these advancements, no existing work has comprehensively benchmarked these methods in a reasonably aligned environment. In this work, we fill this gap by providing a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks. Our work reveals numerous previously unknown phenomena and offers insights – as well as a friendly workbench – for the future development of long context-capable LLMs. The source code will be available at this https URL
摘要:长上下文能力是大型语言模型(LLM)的一项重要能力,因为它缓解了人类消化长格式文本的困难。此功能支持复杂的任务解决方案,如图书摘要、代码帮助以及传统上人力密集型的更多任务。然而,由于KV高速缓存的不断增长和处理扩展输入的内在复杂性,基于变压器的LLMS在长上下文输入方面面临着巨大的挑战;其中已经提出了多种效率驱动的方法–例如KV高速缓存量化、令牌丢弃、即时压缩、线性时序模型和混合体系结构–以产生高效但具有长上下文能力的模型。尽管有这些进步,但现有的工作还没有在合理一致的环境中对这些方法进行全面的基准测试。在这项工作中,我们通过提供当前方法的分类并评估七类长上下文任务中的10多种最先进的方法来填补这一空白。我们的工作揭示了许多以前未知的现象,并为长上下文支持的LLM的未来开发提供了见解–以及一个友好的工作台。源代码将在此HTTPS URL上提供

[NLP-1] Empowering 3D Visual Grounding with Reasoning Capabilities
[NLP-1] 通过推理能力增强3D视觉基础

链接: https://arxiv.org/abs/2407.01525
作者: Chenming Zhu,Tai Wang,Wenwei Zhang,Kai Chen,Xihui Liu
关键词: explicit textual descriptions, reason human intentions, Large Language Model, Multi-modal Large Language, implicit instructions
中文关键词: 显式文本描述、推理人类意图、大型语言模型、多模式大型语言、隐式指令
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ECCV24. A comprehensive and hierarchical 3D reasoning grounding benchmark in the era of foundation models. Project page: this https URL

点击查看摘要

Abstract:Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.
摘要:尽管在3D视觉基础方面取得了很大进展,但当前的模型仍然依赖于显式的文本描述来基础,并且缺乏从隐性指令中推理人类意图的能力。我们提出了一项名为3D推理基础的新任务,并引入了一个新的基准ScanReason,该基准提供了来自五种需要推理和基础协同的推理类型的超过10 K个问答位置对。我们进一步设计了我们的方法ReGround 3D,该方法由多模式大型语言模型(MLLM)支持的以视觉为中心的推理模块和3D基础模块组成,通过回顾增强的几何形状和来自3D场景的细粒度细节来获得准确的对象位置。提出了一种基础链机制,通过推理期间的交叉推理和基础步骤进一步提高性能。对拟议基准的大量实验验证了我们提出的方法的有效性。

[NLP-2] MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
[NLP-2] MMLongBench-Doc:通过可视化对长上下文文档理解进行基准测试

链接: https://arxiv.org/abs/2407.01523
作者: Yubo Ma,Yuhang Zang,Liangyu Chen,Meiqi Chen,Yizhu Jiao,Xinze Li,Xinyuan Lu,Ziyu Liu,Yan Ma,Xiaoyi Dong,Pan Zhang,Liangming Pan,Yu-Gang Jiang,Jiaqi Wang,Yixin Cao,Aixin Sun
关键词: long-standing and practical, practical task, Recent Large Vision-Language, Large Vision-Language Models, single-page document understanding
中文关键词: 长期存在且实际的、实际的任务、最近的大型视觉语言、大型视觉语言模型、单页文档理解
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: this https URL
摘要:理解具有丰富布局和多通道成分的文档是一项长期而实用的任务。最近的大型视觉语言模型(LVLM)在各种任务中取得了显著的进展,特别是在单页文档理解(DU)方面。然而,他们在长语境DU上的能力仍然是一个悬而未决的问题。这项工作提出了MMLongBch-Doc,这是一个长上下文、多模式的基准,包括1,062个专家注释的问题。与以前的数据集不同,它是在130个PDF格式的长文档上构建的,平均有49.4页和20971个文本标记。对于综合评价,这些问题的答案依赖于来自(1)不同来源(文本、图像、图表、表格和布局结构)和(2)不同位置(即页码)的证据。此外,33.2%的问题是跨页问题,需要跨多页提供证据。22.8%的问题被设计成无法回答潜在的幻觉。在14个LVLMS上的实验表明,长上下文DU极大地挑战了现有的模型。值得注意的是,表现最好的车型GPT-40的F1得分仅为42.7%,而第二好的GPT-4V得分为31.4%。此外,12个LLM(除GPT-4o和GPT-4V之外)的性能甚至比LLm对应的LLM更差,后者提供的是有损解析的OCR文档。这些结果验证了未来研究更有能力的长上下文LVLM的必要性。项目页面:此HTTPS URL

[NLP-3] MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
[NLP-3] MIA-Bench:在多模式LLM评估后实现更好的教学

链接: https://arxiv.org/abs/2407.01509
作者: Yusu Qian,Hanrong Ye,Jean-Philippe Fauconnier,Peter Grasch,Yinfei Yang,Zhe Gan
关键词: large language models, evaluate multimodal large, multimodal large language, introduce MIA-Bench, language models
中文关键词: 大型语言模型,评估多模式大型、多模式大型语言,引入MIA-Bench,语言模型
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models’ compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models’ ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.
摘要:我们引入了MIA-Bench,这是一个新的基准测试,旨在评估多模式大型语言模型(MLLM)严格遵守复杂指令的能力。我们的基准测试由400个不同的图像提示对组成,每一个图像提示对都是为了挑战模型在生成满足特定请求模式的准确响应方面对分层指令的遵守性。各种最先进的MLLM的评估结果揭示了性能的显着差异,凸显了教学保真度需要改进的领域。此外,我们还创建额外的训练数据并探索有监督的微调,以增强模型严格遵循指令的能力,而不影响其他任务的性能。我们希望这个基准不仅可以作为衡量MLLM遵守指令的工具,还可以指导MLLM培训方法的未来发展。

[NLP-4] Self-Cognition in Large Language Models: An Exploratory Study
[NLP-4] 大型语言模型中的自我认知:探索性研究

链接: https://arxiv.org/abs/2407.01505
作者: Dongping Chen,Jiawen Shi,Yao Wan,Pan Zhou,Neil Zhenqiang Gong,Lichao Sun
关键词: Large Language Models, Large Language, achieved remarkable success, Language Models, self-cognition
中文关键词: 大型语言模型,大型语言,取得了显着的成功,语言模型,自我认知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2024 Large Language Models and Cognition Workshop

点击查看摘要

Abstract:While Large Language Models (LLMs) have achieved remarkable success across various applications, they also raise concerns regarding self-cognition. In this paper, we perform a pioneering study to explore self-cognition in LLMs. Specifically, we first construct a pool of self-cognition instruction prompts to evaluate where an LLM exhibits self-cognition and four well-designed principles to quantify LLMs’ self-cognition. Our study reveals that 4 of the 48 models on Chatbot Arena–specifically Command R, Claude3-Opus, Llama-3-70b-Instruct, and Reka-core–demonstrate some level of detectable self-cognition. We observe a positive correlation between model size, training data quality, and self-cognition level. Additionally, we also explore the utility and trustworthiness of LLM in the self-cognition state, revealing that the self-cognition state enhances some specific tasks such as creative writing and exaggeration. We believe that our work can serve as an inspiration for further research to study the self-cognition in LLMs.
摘要:尽管大型语言模型在各种应用中取得了显著的成功,但它们也引起了人们对自我认知的关注。在这篇论文中,我们进行了一项开创性的研究,以探索学习记忆中的自我认知。具体地说,我们首先构建了一个自我认知教学提示库来评估LLM在哪里表现出自我认知,并构建了四个精心设计的原则来量化LLM的自我认知。我们的研究显示,在聊天机器人竞技场上的48个模型中,有4个–特别是Command R、Claude3-Opus、Llama-3-70b-Indict和Reka-core-表现出某种程度的可检测到的自我认知。我们观察到模型大小、训练数据质量和自我认知水平之间存在正相关。此外,我们还考察了LLM在自我认知状态下的实用性和可信度,发现自我认知状态会促进创造性写作和夸张等特定任务的完成。我们相信,我们的工作可以为进一步研究低收入者的自我认知提供启发。

[NLP-5] RegMix: Data Mixture as Regression for Language Model Pre-training
[NLP-5] RegMix:数据混合作为语言模型预训练的回归

链接: https://arxiv.org/abs/2407.01492
作者: Qian Liu,Xiaosen Zheng,Niklas Muennighoff,Guangtao Zeng,Longxu Dou,Tianyu Pang,Jing Jiang,Min Lin
关键词: large language model, language model pre-training, mixture remains unclear, effective mixture remains, data mixture
中文关键词: 大型语言模型、语言模型预训练、混合仍然不清楚、有效混合仍然存在、数据混合
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance given their respective mixtures. With the fitted regression model, we simulate the top-ranked mixture and use it to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens of different mixtures to fit the regression model and find the optimal mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Further, our method demonstrates superior performance compared to human selection and achieves results that match or surpass DoReMi, while utilizing only 10% of the compute budget. Our experiments also show that (1) Data mixtures significantly impact performance with single-task performance variations of up to 14.6%; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws, and our approach captures the complexity by considering all domains together. Our code is available at this https URL.
摘要:大型语言模型训练前的数据混合对性能有很大影响,但如何确定一个有效的混合仍不清楚。我们建议RegMix通过将其描述为回归任务来自动识别高性能数据混合。RegMix涉及用不同的数据混合训练一组小模型,并拟合回归模型,以预测它们在给定各自混合数据时的表现。用拟合出的回归模型来模拟排名靠前的混合物,并用它来训练一个计算数量级较多的大规模模型。为了实证验证RegMix,我们针对不同混合的1B令牌训练了512个具有1M个参数的模型,以拟合回归模型并找到最优混合。使用该混合模型,我们对25B符号(即大1000倍、长25倍)训练了一个1B参数模型,我们发现该模型在与其他混合的候选1B参数模型中执行得最好。此外,与人工选择相比,我们的方法表现出更好的性能,并且在仅利用10%的计算预算的情况下,获得了与DoReMi匹配或超过DoReMi的结果。我们的实验还表明:(1)数据混合显著影响性能,单任务性能差异高达14.6%;(2)Web语料库而不是像维基百科这样被认为是高质量的数据与下游性能有最强的正相关;(3)领域以复杂的方式交互,往往与常识相矛盾,因此需要像RegMix这样的自动方法;(4)数据混合效应超越了标度律,我们的方法通过综合考虑所有领域来捕捉复杂性。我们的代码可以在这个HTTPS URL上找到。

[NLP-6] Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning
[NLP-6] 通过缓慢级联学习对大型模型进行表达性和可推广的低等级适应

链接: https://arxiv.org/abs/2407.01491
作者: Siwei Li,Yifan Yang,Yifei Shen,Fangyun Wei,Zongqing Lu,Lili Qiu,Yuqing Yang
关键词: Efficient fine-tuning plays, Efficient fine-tuning, low-rank adaptation emerging, modern large models, fine-tuning plays
中文关键词: 高效微调剧目,高效微调,低等级改编新兴,现代大型号,微调剧目
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA’s expressiveness and generalization capabilities while preserving its training efficiency. Our approach augments expressiveness through a cascaded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model’s ability to capture complex patterns. Additionally, we introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. The extensive experiments on various language and vision datasets, as well as robustness benchmarks, demonstrate that the proposed method not only significantly outperforms existing baselines, but also mitigates overfitting, enhances model stability, and improves OOD robustness. Code will be release in this https URL very soon.
摘要:有效的微调在现代大型模型中起着基础性的作用,低阶自适应是一种特别有前途的方法。然而,现有的LORA变体受到表现力有限、过度适应的趋势以及对超参数设置的敏感性的阻碍。本文提出了LORA慢级联学习(LoRASC),这是一种创新的技术,旨在提高LORA的表达能力和泛化能力,同时保持其训练效率。我们的方法通过级联学习策略增强了表现力,该策略允许混合低级适应,从而增强了模型捕获复杂模式的能力。此外,我们引入了慢-快更新机制和级联噪声调优来支持泛化。在各种语言和视觉数据集上的大量实验以及健壮性基准测试表明,该方法不仅显著优于现有的基线,而且可以缓解过拟合,增强模型的稳定性,并提高面向对象设计的健壮性。代码将很快在此HTTPS URL中发布。

[NLP-7] LLM See LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives
[NLP-7] LLM看到LLM做:引导数据生成以实现非差异化目标

链接: https://arxiv.org/abs/2407.01490
作者: Luísa Shimabucoro,Sebastian Ruder,Julia Kreutzer,Marzieh Fadaee,Sara Hooker
关键词: synthetic data, synthetic data raises, data, synthetic, widespread adoption
中文关键词: 合成数据,合成数据提出,数据,合成,广泛采用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread adoption of synthetic data raises new questions about how models generating the data can influence other large language models (LLMs) via distilled data. To start, our work exhaustively characterizes the impact of passive inheritance of model properties by systematically studying the consequences of synthetic data integration. We provide one of the most comprehensive studies to-date of how the source of synthetic data shapes models’ internal biases, calibration and generations’ textual attributes and preferences. We find that models are surprisingly sensitive towards certain attributes even when the synthetic data prompts appear “neutral”. which invites the question whether this sensitivity can be exploited for good. Our findings invite the question can we explicitly steer the models towards the properties we want at test time by exploiting the data generation process? This would have historically been considered infeasible due to the cost of collecting data with a specific characteristic or objective in mind. However, improvement in the quality of synthetic data, as well as a shift towards general-purpose models designed to follow a diverse way of instructions, means this question is timely. We propose active inheritance as a term to describe intentionally constraining synthetic data according to a non-differentiable objective. We demonstrate how active inheritance can steer the generation profiles of models towards desirable non-differentiable attributes, e.g. high lexical diversity or low toxicity. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2407.01490 [cs.CL] (or arXiv:2407.01490v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.01490 Focus to learn more arXiv-issued DOI via DataCite
摘要:合成数据的广泛采用提出了新的问题,即生成数据的模型如何通过提取的数据影响其他大型语言模型(LLM)。首先,我们的工作通过系统地研究合成数据集成的结果来详尽地表征模型属性的被动继承的影响。我们提供了到目前为止最全面的研究之一,关于合成数据的来源如何塑造模型的内部偏差、校准以及几代人的文本属性和偏好。我们发现,即使合成数据提示看起来是“中性的”,模型对某些属性的敏感度也出奇地高。这就引出了这样一个问题:这种敏感性能否被永久利用?我们的发现引发了这样一个问题:我们是否可以通过利用数据生成过程,在测试时显式地将模型引向我们想要的属性?这在历史上被认为是不可行的,因为在脑海中收集特定特征或目标的数据的成本很高。然而,合成数据质量的提高,以及向旨在遵循多样化指导方式的通用模型的转变,意味着这个问题是及时的。我们提出主动继承作为一个术语来描述根据不可微目标有意地约束合成数据。我们展示了主动遗传如何将模型的生成配置文件引导到理想的不可区分属性,例如高词汇多样性或低毒性。主题:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG)引用AS:arxiv:2407.01490cs.CLhttps://doi.org/10.48550/arXiv.2407.01490 Focus通过DataCite了解更多arxiv发布的目标文件

[NLP-8] Agentless: Demystifying LLM-based Software Engineering Agents
[NLP-8] 无限制:揭开基于LLM的软件工程代理的神秘面纱

链接: https://arxiv.org/abs/2407.01489
作者: Chunqiu Steven Xia,Yinlin Deng,Soren Dunn,Lingming Zhang
关键词: including code synthesis, large language models, software development tasks, Recent advancements, software development
中文关键词: 包括代码合成、大型语言模型、软件开发任务、最新进展、软件开发
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic two-phase process of localization followed by repair, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (27.33%) and lowest cost (\ 0.34) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.
摘要:大型语言模型(LLM)的最新进展极大地促进了软件开发任务的自动化,包括代码综合、程序修复和测试生成。最近,研究人员和行业从业者开发了各种自主的LLM代理来执行端到端的软件开发任务。这些代理配备了使用工具、运行命令、观察来自环境的反馈以及计划未来操作的能力。然而,这些基于代理的方法的复杂性,以及当前LLM有限的能力,引发了以下问题:我们真的必须使用复杂的自主软件代理吗?为了尝试回答这个问题,我们构建了无代理–一种自动解决软件开发问题的无代理方法。与基于代理的方法繁琐而复杂的设置相比,无代理采用了简单的两阶段本地化过程,然后进行修复,而不是让LLM决定未来的操作或使用复杂的工具进行操作。我们在流行的SWE-BENCH Lite基准测试上的结果显示,令人惊讶的是,与所有现有的开源软件代理相比,简单化的代理能够实现最高的性能(27.33%)和最低的成本(0.34)!此外,我们手动对SWE-BENCH Lite中的问题进行了分类,发现了准确的基本事实补丁或问题描述不充分/具有误导性的问题。因此,我们通过排除此类问题来构建SWE-BENCH Lite-S,以进行更严格的评估和比较。我们的工作突出了目前在自主软件开发中被忽视的简单、可解释的技术的潜力。我们希望无代理将有助于重置自主软件代理的基线、起点和视野,并激励未来沿着这一关键方向开展工作。

[NLP-9] ree Search for Language Model Agents
[NLP-9] ree搜索语言模型代理

链接: https://arxiv.org/abs/2407.01476
作者: Jing Yu Koh,Stephen McAleer,Daniel Fried,Ruslan Salakhutdinov
关键词: Autonomous agents powered, Autonomous agents, perform decision-making tasks, demonstrated promise, search
中文关键词: 自主代理提供动力,自主代理,执行决策任务,展示承诺,搜索
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages. Models and code available at this https URL

点击查看摘要

Abstract:Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at this https URL.
摘要:由语言模型(LMS)驱动的自治代理在执行网络自动化等决策任务方面表现出了良好的前景。然而,一个关键的局限性仍然存在:LMS主要针对自然语言理解和生成进行了优化,在尝试解决现实的计算机任务时,它在多步推理、规划和使用环境反馈方面遇到了困难。为了解决这一问题,我们提出了一种推理时间搜索算法,供LM代理在交互式Web环境中显式执行探索和多步规划。我们的方法是一种在实际环境空间内操作的最佳优先树搜索形式,并与大多数现有的最先进的代理相辅相成。这是第一个针对LM代理的树搜索算法,在现实的Web任务中显示了有效性。在具有挑战性的VisualWebArena基准测试中,将我们的搜索算法应用到GPT-40代理之上,与没有搜索的相同基线相比,成功率相对增加了39.7%,达到了26.4%的最新成功率。在WebArena上,搜索的相对效率也比基准代理提高了28.0%,竞争成功率为19.2%。我们的实验突出了搜索Web代理的有效性,并证明了性能随测试时间计算的增加而扩展。我们对我们的结果进行了彻底的分析,以突出搜索的改进、局限性和未来工作的有希望的方向。我们的代码和模型在此HTTPS URL上公开发布。

[NLP-10] DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
[NLP-10] DogeRM:通过模型合并为奖励模型配备领域知识

链接: https://arxiv.org/abs/2407.01470
作者: Tzu-Han Lin,Chen-An Li,Hung-yi Lee,Yun-Nung Chen
关键词: aligning large language, Reinforcement learning, large language models, human feedback, desired behaviors
中文关键词: 对齐大型语言、强化学习、大型语言模型、人类反馈、所需行为
类目: Computation and Language (cs.CL)
备注: Preprint. Code will be released after the review results

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the \textbfDomain knowled\textbfge merged \textbfReward \textbfModel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.
摘要:来自人类反馈的强化学习(RL HF)是一种流行的策略,用于将大型语言模型(LLM)与所需行为对齐。奖励建模是WLHF的关键一步。然而,收集配对偏好数据来训练奖励模型通常成本高昂且耗时,尤其是对于需要专家注释的特定领域偏好。为了应对这一挑战,我们提出了\textbfDomain knowled\textbfge mixed\textbfReward \textbfModel(DogeRM),这是一个新颖的框架,通过模型合并将特定领域的知识集成到通用奖励模型中。实验表明,DogeRM增强了不同基准的性能,并提供了展示模型合并效果的详细分析,展示了促进模型对齐的巨大潜力。

[NLP-11] Retrieval-augmented generation in multilingual settings
[NLP-11] 多语言环境中的检索增强生成

链接: https://arxiv.org/abs/2407.01463
作者: Nadezhda Chirkova,David Rau,Hervé Déjean,Thibault Formal,Stéphane Clinchant,Vassilina Nikoulina
关键词: improving LLM factuality, large language models, studied in English-only, LLM factuality, English-only settings
中文关键词: 提高LLM真实性,大型语言模型,纯英语研究,LLM真实性,纯英语环境
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into large language models (LLMs) and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate which components and with which adjustments are needed to build a well-performing mRAG pipeline, that can be used as a strong baseline in future works. Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities. The main limitations to be addressed in future works include frequent code-switching in non-Latin alphabet languages, occasional fluency errors, wrong reading of the provided documents, or irrelevant retrieval. We release the code for the resulting mRAG baseline pipeline at this https URL.
摘要:检索增强生成(RAG)是一种很有前途的解决方案,可以将最新的或特定于领域的知识整合到大型语言模型(LLM)中,并提高LLM的真实性,但主要是在仅限英语的环境下进行研究。在这项工作中,我们考虑了多语言环境(MRAG)中的RAG,即具有13种语言的用户查询和数据存储,并调查了需要哪些组件和哪些调整才能建立一个性能良好的MRAG管道,这可以在未来的工作中用作强大的基线。我们的发现强调,尽管有高质量的现成多语言检索器和生成器,但需要针对特定任务的提示工程来实现用户语言的生成。此外,目前的评价指标需要调整多语种设置,以考虑到命名实体在拼写上的差异。未来工作中要解决的主要限制包括非拉丁字母语言的频繁代码转换、偶尔的流利错误、对所提供文件的错误阅读或不相关的检索。

[NLP-12] Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement
[NLP-12] 通过强化学习驱动的查询细化增强大型语言模型的能力和鲁棒性

链接: https://arxiv.org/abs/2407.01461
作者: Zisu Huang,Xiaohua Wang,Feiran Zhang,Zhibo Xu,Cenyuan Zhang,Xiaoqing Zheng,Xuanjing Huang
关键词: large language models, helpful responses heavily, responses heavily relies, capacity of large, large language
中文关键词: 大型语言模型、大量有用的响应、大量依赖的响应、大量语言的容量
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The capacity of large language models (LLMs) to generate honest, harmless, and helpful responses heavily relies on the quality of user prompts. However, these prompts often tend to be brief and vague, thereby significantly limiting the full potential of LLMs. Moreover, harmful prompts can be meticulously crafted and manipulated by adversaries to jailbreak LLMs, inducing them to produce potentially toxic content. To enhance the capabilities of LLMs while maintaining strong robustness against harmful jailbreak inputs, this study proposes a transferable and pluggable framework that refines user prompts before they are input into LLMs. This strategy improves the quality of the queries, empowering LLMs to generate more truthful, benign and useful responses. Specifically, a lightweight query refinement model is introduced and trained using a specially designed reinforcement learning approach that incorporates multiple objectives to enhance particular capabilities of LLMs. Extensive experiments demonstrate that the refinement model not only improves the quality of responses but also strengthens their robustness against jailbreak attacks. Code is available at: this https URL .
摘要:大型语言模型生成诚实、无害和有用的响应的能力在很大程度上取决于用户提示的质量。然而,这些提示往往简短而含糊,从而极大地限制了LLM的全部潜力。此外,有害的提示可以被对手精心制作和操纵,以越狱LLM,诱导它们产生潜在的有毒内容。为了增强LLMS的能力,同时保持对有害越狱输入的强大健壮性,本研究提出了一个可移植和可插拔的框架,在将用户提示输入到LLMS之前对其进行提炼。这一策略提高了查询的质量,使LLMS能够生成更真实、良性和有用的响应。具体地说,引入了一种轻量级查询精化模型,并使用专门设计的强化学习方法进行训练,该方法结合了多个目标来增强LLMS的特定能力。大量实验表明,改进模型不仅提高了响应的质量,而且增强了对越狱攻击的健壮性。代码可在以下网址获得:这个HTTPS URL。

[NLP-13] meToM: Temporal Space is the Key to Unlocking the Door of Large Language Models Theory-of-Mind
[NLP-13] meToM:时空是打开大型语言模型之门的关键思维理论

链接: https://arxiv.org/abs/2407.01455
作者: Guiyang Hou,Wenqi Zhang,Yongliang Shen,Linjuan Wu,Weiming Lu
关键词: Theory of Mind, Large Language Models, advanced Large Language, ToM, Large Language
中文关键词: 心理理论、大型语言模型、高级大型语言、ToM、大型语言
类目: Computation and Language (cs.CL)
备注: 16 pages, 6 figures, ACL 2024(findings)

点击查看摘要

Abstract:Theory of Mind (ToM)-the cognitive ability to reason about mental states of ourselves and others, is the foundation of social interaction. Although ToM comes naturally to humans, it poses a significant challenge to even the most advanced Large Language Models (LLMs). Due to the complex logical chains in ToM reasoning, especially in higher-order ToM questions, simply utilizing reasoning methods like Chain of Thought (CoT) will not improve the ToM capabilities of LLMs. We present TimeToM, which constructs a temporal space and uses it as the foundation to improve the ToM capabilities of LLMs in multiple scenarios. Specifically, within the temporal space, we construct Temporal Belief State Chain (TBSC) for each character and inspired by the cognition perspective of the social world model, we divide TBSC into self-world beliefs and social world beliefs, aligning with first-order ToM (first-order beliefs) and higher-order ToM (higher-order beliefs) questions, respectively. Moreover, we design a novel tool-belief solver that, by considering belief communication between characters in temporal space, can transform a character’s higher-order beliefs into another character’s first-order beliefs under belief communication period. Experimental results indicate that TimeToM can dramatically improve the reasoning performance of LLMs on ToM questions while taking a big step towards coherent and robust ToM reasoning.
摘要:心理理论是对自己和他人的心理状态进行推理的认知能力,是社会交往的基础。尽管Tom对于人类来说是自然而然的,但它对最先进的大型语言模型(LLM)也构成了巨大的挑战。由于TOM推理中存在复杂的逻辑链,特别是在高阶TOM问题中,单纯使用思维链法等推理方法并不能提高LLMS的TOM能力。我们提出了TimeToM,它构造了一个时间空间,并以此为基础来提高多场景下LLMS的TOM能力。具体地说,在时间空间内,我们为每个角色构建了时间信念状态链,并受社会世界模型的认知视角的启发,将时间信念状态链分为自我世界信念和社会世界信念,分别对应于一阶TOM(一阶信念)和高阶TOM(高阶信念)问题。此外,我们设计了一种新颖的工具–信念求解器,通过考虑角色在时间空间中的信念交流,在信念交流周期内将一个角色的高阶信念转换为另一个角色的一阶信念。实验结果表明,TimeToM能够显著提高LLMS对TOM问题的推理性能,同时向连贯和健壮的TOM推理迈进了一大步。

[NLP-14] ColPali: Efficient Document Retrieval with Vision Language Models
[NLP-14] ColPali:使用视觉语言模型的高效文档检索

链接: https://arxiv.org/abs/2407.01449
作者: Manuel Faysse,Hugues Sibille,Tony Wu,Gautier Viaud,Céline Hudelot,Pierre Colombo
关键词: Retrieval Augmented Generation, document retrieval, visually rich structures, information through text, modern document retrieval
中文关键词: 检索增强生成、文档检索、视觉丰富的结构、文本信息、现代文档检索
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, ColPali, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.
摘要:文档是一种视觉丰富的结构,它通过文本以及表格、插图、页面布局或字体来传达信息。虽然现代文档检索系统在查询到文本匹配方面表现出很强的性能,但它们难以有效地利用视觉线索,这阻碍了它们在实际文档检索应用中的性能,例如检索增强生成。为了对当前系统的视觉丰富文档检索进行基准测试,我们引入了可视化文档检索基准ViDoRe,它由跨越多个域、语言和设置的各种页面级检索任务组成。现代系统的固有缺陷促使引入一种新的检索模型体系结构ColPali,它利用最新的Vision语言模型的文档理解能力,仅从文档页面的图像生成高质量的上下文嵌入。与后期交互匹配机制相结合,ColPali在很大程度上超过了现代文档检索管道,同时速度快得多,而且端到端可培训。

[NLP-15] Needle in the Haystack for Memory Based Large Language Models
[NLP-15] 基于内存的大型语言模型的大难不死

链接: https://arxiv.org/abs/2407.01437
作者: Subhajit Chaudhury,Soham Dan,Payel Das,Georgios Kollias,Elliot Nelson
关键词: augmented Large Language, Large Language Model, Large Language, memory augmented Large, augmented Large
中文关键词: 增强的大型语言、大型语言模型、大型语言、内存增强的大型、增强的大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages

点击查看摘要

Abstract:In this paper, we demonstrate the benefits of using memory augmented Large Language Model (LLM) architecture in improving the recall abilities of facts from a potentially long context. As a case study we test LARIMAR, a recently proposed LLM architecture which augments a LLM decoder with an external associative memory, on several long-context recall tasks, including passkey and needle-in-the-haystack tests. We demonstrate that the external memory can be adapted at test time to handle contexts much longer than those seen during training, while keeping readouts from the memory recognizable to the trained decoder and without increasing GPU memory footprint. Compared to alternative architectures for long-context recall tasks with models of a comparable parameter count, LARIMAR is able to maintain strong performance without any task-specific training.
摘要:在本文中,我们展示了使用内存增强大型语言模型(LLM)架构在提高潜在长上下文中事实的回忆能力方面的好处。作为案例研究,我们测试了LARIamar,这是一种最近提出的LLM架构,它通过外部联想存储器增强了LLM解码器,用于几项长上下文回忆任务,包括密钥和大海捞针测试。我们证明,外部存储器可以在测试时进行调整,以处理比训练期间看到的时间长得多的上下文,同时保持存储器的读出可被训练的解码器识别,并且不会增加图形处理器的内存占用空间。与具有可比参数计数模型的长上下文回忆任务替代架构相比,LARIVAR能够在无需任何特定任务培训的情况下保持强劲的性能。

[NLP-16] A Global-Local Attention Mechanism for Relation Classification
[NLP-16] 关系分类的全球-本地注意力机制

链接: https://arxiv.org/abs/2407.01424
作者: Yiping Sun
关键词: involves identifying connections, Relation classification, involves identifying, crucial component, identifying connections
中文关键词: 涉及识别联系,关系分类,涉及识别,关键组件,识别联系
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: This paper has been accepted by the 2024 20th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)

点击查看摘要

Abstract:Relation classification, a crucial component of relation extraction, involves identifying connections between two entities. Previous studies have predominantly focused on integrating the attention mechanism into relation classification at a global scale, overlooking the importance of the local context. To address this gap, this paper introduces a novel global-local attention mechanism for relation classification, which enhances global attention with a localized focus. Additionally, we propose innovative hard and soft localization mechanisms to identify potential keywords for local attention. By incorporating both hard and soft localization strategies, our approach offers a more nuanced and comprehensive understanding of the contextual cues that contribute to effective relation classification. Our experimental results on the SemEval-2010 Task 8 dataset highlight the superior performance of our method compared to previous attention-based approaches in relation classification.
摘要:关系分类是关系提取的重要组成部分,涉及识别两个实体之间的联系。之前的研究主要集中在将注意机制整合到全球范围内的关系分类中,而忽视了当地背景的重要性。为了解决这一差距,本文引入了一种新型的关系分类全球-局部注意力机制,该机制通过本地化的焦点增强全球注意力。此外,我们还提出了创新的硬本地化和软本地化机制来识别潜在的关键词以引起当地关注。通过结合硬本地化和软本地化策略,我们的方法提供了对有助于有效关系分类的上下文线索的更细致和全面的理解。我们在SemEval-2010 Task 8数据集上的实验结果凸显了与之前在关系分类中基于注意力的方法相比,我们的方法具有更好的性能。

[NLP-17] HyperLoader: Integrating Hypernetwork-Based LoRA and Adapter Layers into Multi-Task Transformers for Sequence Labelling
[NLP-17] HyperPlayer:将基于超网络的LoRA和适配器层集成到多任务转换器中以进行序列标签

链接: https://arxiv.org/abs/2407.01411
作者: Jesus-German Ortiz-Barajas,Helena Gomez-Adorno,Thamar Solorio
关键词: parameter-efficient fine-tuning methods, simple approach, multi-task setting, parameter-efficient fine-tuning, fine-tuning methods
中文关键词: 参数高效的微调方法、简单方法、多任务设置、参数高效的微调、微调方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present HyperLoader, a simple approach that combines different parameter-efficient fine-tuning methods in a multi-task setting. To achieve this goal, our model uses a hypernetwork to generate the weights of these modules based on the task, the transformer layer, and its position within this layer. Our method combines the benefits of multi-task learning by capturing the structure of all tasks while reducing the task interference problem by encapsulating the task-specific knowledge in the generated weights and the benefits of combining different parameter-efficient methods to outperform full-fine tuning. We provide empirical evidence that HyperLoader outperforms previous approaches in most datasets and obtains the best average performance across tasks in high-resource and low-resource scenarios.
摘要:我们介绍了HyperPlayer,这是一种简单的方法,在多任务设置中结合了不同的参数高效微调方法。为了实现这一目标,我们的模型使用超网络根据任务、Transformer层及其在该层中的位置生成这些模块的权重。我们的方法结合了多任务学习的好处,通过捕获所有任务的结构,同时通过将任务特定知识封装在生成的权重中来减少任务干扰问题,以及结合不同参数高效方法以优于全微调的好处。我们提供的经验证据表明,HyperPlayer在大多数数据集中优于以前的方法,并在高资源和低资源场景中的任务中获得最佳平均性能。

[NLP-18] Dynamic Few-Shot Learning for Knowledge Graph Question Answering
[NLP-18] 知识图谱问题解答的动态少镜头学习

链接: https://arxiv.org/abs/2407.01409
作者: Jacopo D’Abramo,Andrea Zugarini,Paolo Torroni
关键词: innovative Question Answering, Large language models, Knowledge Graphs, Question Answering, Answering over Knowledge
中文关键词: 创新的问题解答、大型语言模型、知识图、问题解答、知识解答
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models present opportunities for innovative Question Answering over Knowledge Graphs (KGQA). However, they are not inherently designed for query generation. To bridge this gap, solutions have been proposed that rely on fine-tuning or ad-hoc architectures, achieving good results but limited out-of-domain distribution generalization. In this study, we introduce a novel approach called Dynamic Few-Shot Learning (DFSL). DFSL integrates the efficiency of in-context learning and semantic similarity and provides a generally applicable solution for KGQA with state-of-the-art performance. We run an extensive evaluation across multiple benchmark datasets and architecture configurations.
摘要:大型语言模型为创新的知识图问题解答(KGQA)提供了机会。然而,它们本质上并不是为查询生成而设计的。为了弥合这一差距,人们提出了依赖于微调或临时架构的解决方案,以实现良好的结果,但域外分布概括有限。在这项研究中,我们引入了一种名为动态少镜头学习(DFSL)的新颖方法。DFSL集成了上下文学习和语义相似性的效率,并为KGQA提供了具有最先进性能的普遍适用的解决方案。我们对多个基准数据集和架构配置进行了广泛的评估。

[NLP-19] Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters
[NLP-19] 通过适配器使用知识图将多语言LLM适应低资源语言

链接: https://arxiv.org/abs/2407.01406
作者: Daniil Gurgurov,Mareike Hartmann,Simon Ostermann
关键词: named entity recognition, Large Language Models, multilingual Large Language, multilingual Large, Large Language
中文关键词: 命名实体识别、大型语言模型、多语言大型语言、多语言大型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, KaLLM workshop

点击查看摘要

Abstract:This paper explores the integration of graph knowledge from linguistic ontologies into multilingual Large Language Models (LLMs) using adapters to improve performance for low-resource languages (LRLs) in sentiment analysis (SA) and named entity recognition (NER). Building upon successful parameter-efficient fine-tuning techniques, such as K-ADAPTER and MAD-X, we propose a similar approach for incorporating knowledge from multilingual graphs, connecting concepts in various languages with each other through linguistic relationships, into multilingual LLMs for LRLs. Specifically, we focus on eight LRLs – Maltese, Bulgarian, Indonesian, Nepali, Javanese, Uyghur, Tibetan, and Sinhala – and employ language-specific adapters fine-tuned on data extracted from the language-specific section of ConceptNet, aiming to enable knowledge transfer across the languages covered by the knowledge graph. We compare various fine-tuning objectives, including standard Masked Language Modeling (MLM), MLM with full-word masking, and MLM with targeted masking, to analyse their effectiveness in learning and integrating the extracted graph data. Through empirical evaluation on language-specific tasks, we assess how structured graph knowledge affects the performance of multilingual LLMs for LRLs in SA and NER, providing insights into the potential benefits of adapting language models for low-resource scenarios.
摘要:为了提高低资源语言(LRLS)在情感分析(SA)和命名实体识别(NER)中的性能,利用适配器将语言本体中的图知识集成到多语言大语言模型(LLMS)中。在K-Adapter和MAD-X等成功的参数高效微调技术的基础上,我们提出了一种类似的方法来整合来自多语言图的知识,通过语言关系将不同语言中的概念相互连接到LRL的多语言LLM中。具体地说,我们专注于八种LRL–马耳他语、保加利亚语、印度尼西亚语、尼泊尔语、爪哇语、维吾尔语、藏语和僧伽罗语–并使用特定于语言的适配器,对从概念网的特定语言部分提取的数据进行微调,旨在实现知识图谱涵盖的语言之间的知识转移。我们比较了各种微调目标,包括标准掩蔽语言建模(MLM)、全词掩蔽的MLM和目标掩蔽的MLM,以分析它们在学习和整合提取的图形数据方面的有效性。通过对特定语言任务的实证评估,我们评估了结构化图知识如何影响SA和NER中LRL的多语言LLM的性能,从而为适应低资源情景下的语言模型提供了潜在的好处。

[NLP-20] Optimization of Retrieval-Augmented Generation Context with Outlier Detection
[NLP-20] 利用离群点检测优化检索增强生成上下文

链接: https://arxiv.org/abs/2407.01403
作者: Vitaly Bulgakov
关键词: prompt context required, Large Language Model, retrieved LLM responses, question-answering systems, reduce the size
中文关键词: 需要提示上下文、大型语言模型、检索的LLM回复、问答系统、缩小规模
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we focus on methods to reduce the size and improve the quality of the prompt context required for question-answering systems. Attempts to increase the number of retrieved chunked documents and thereby enlarge the context related to the query can significantly complicate the processing and decrease the performance of a Large Language Model (LLM) when generating responses to queries. It is well known that a large set of documents retrieved from a database in response to a query may contain irrelevant information, which often leads to hallucinations in the resulting answers. Our goal is to select the most semantically relevant documents, treating the discarded ones as outliers. We propose and evaluate several methods for identifying outliers by creating features that utilize the distances of embedding vectors, retrieved from the vector database, to both the centroid and the query vectors. The methods were evaluated by comparing the similarities of the retrieved LLM responses to ground-truth answers obtained using the OpenAI GPT-4o model. It was found that the greatest improvements were achieved with increasing complexity of the questions and answers.
摘要:在本文中,我们重点研究了减少问答系统所需提示上下文的大小和提高提示上下文质量的方法。当生成对查询的响应时,尝试增加检索到的分块文档的数量并由此扩大与查询相关的上下文可能会显著地使处理复杂化并降低大型语言模型(LLM)的性能。众所周知,响应于查询而从数据库检索的大量文档可能包含不相关的信息,这通常会导致结果答案中出现幻觉。我们的目标是选择语义最相关的文档,将被丢弃的文档视为离群值。我们提出并评估了几种通过创建特征来识别离群点的方法,这些特征利用从向量数据库中检索到的嵌入向量到质心和查询向量的距离。通过比较检索到的LLM响应与使用OpenAI GPT-4o模型获得的地面真相答案的相似性来评估这些方法。研究发现,问题和答案的复杂性越高,提高的程度越大。

[NLP-21] Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing
[NLP-21] Gloss 2text:使用LLM和语义感知标签平滑的手语Gloss翻译

链接: https://arxiv.org/abs/2407.01394
作者: Pooya Fayyazsanavi,Antonios Anastasopoulos,Jana Košecká
关键词: spoken text presents, text presents unique, presents unique challenges, unique challenges owing, expression nuances
中文关键词: 口语文本呈现,文本呈现独特,呈现独特的挑战,独特的挑战,表达的细微差别
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on \em Gloss2Text translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities improving significantly the performance of state-of-the-art approaches. Through extensive experiments and ablation studies on the PHOENIX Weather 2014T dataset, our approach surpasses state-of-the-art performance in \em Gloss2Text translation, indicating its efficacy in addressing sign language translation and suggesting promising avenues for future research and development.
摘要:从视频到口语文本的手语翻译由于不同的语法、表达细微差别以及不同说话者和上下文之间视觉外观的高度差异而带来了独特的挑战。视频的中间注释旨在指导翻译过程。在我们的工作中,我们重点关注\em Gloss 2文本翻译阶段,并通过利用预训练的大型语言模型(LLM)、数据增强和利用gloss翻译歧义的新型标签平滑丢失功能提出了几项进步,显着提高了最先进方法的性能。通过对PHOENIX Weather 2014 T数据集的广泛实验和消融研究,我们的方法超越了\em Gloss 2text翻译中的最新性能,表明其在解决手语翻译问题方面的功效,并为未来的研究和开发提出了有希望的途径。

[NLP-22] POLygraph: Polish Fake News Dataset
[NLP-22] POLygraph:波兰假新闻数据集

链接: https://arxiv.org/abs/2407.01393
作者: Daniel Dzienisiewicz,Filip Graliński,Piotr Jabłoński,Marek Kubis,Paweł Skórzewski,Piotr Wierzchoń
关键词: fake news detection, dataset, unique resource, detection in Polish, Polish
中文关键词: 假新闻检测、数据集、独特资源、波兰语检测、波兰语
类目: Computation and Language (cs.CL)
备注: 14 pages, 1 figure, accepted to the 14th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA’24)

点击查看摘要

Abstract:This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish. The dataset, created by an interdisciplinary team, is composed of two parts: the “fake-or-not” dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the “fake-they-say” dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them. Unlike existing datasets, POLygraph encompasses a variety of approaches from source literature, providing a comprehensive resource for fake news detection. The data was collected through manual annotation by expert and non-expert annotators. The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity. The tool and dataset are expected to benefit various entities, from public sector institutions to publishers and fact-checking organizations. Further dataset exploration will foster fake news detection and potentially stimulate the implementation of similar models in other languages. The paper focuses on the creation and composition of the dataset, so it does not include a detailed evaluation of the software tool for content authenticity analysis, which is planned at a later stage of the project.
摘要:本文介绍了波兰语中唯一的假新闻检测资源–测谎仪数据集。该数据集由一个跨学科团队创建,由两部分组成:包含11,360对新闻文章(通过它们的URL识别)和相应标签的“假不假”数据集,以及包含5082篇新闻文章(通过它们的URL识别)和对它们的评论的推文的“假他们说”数据集。与现有的数据集不同,Polygraph包含了来自原始文献的各种方法,为假新闻检测提供了全面的资源。这些数据是由专家和非专家注释员通过手工注解收集的。该项目还开发了一个软件工具,使用先进的机器学习技术来分析数据并确定内容的真实性。该工具和数据集预计将使各种实体受益,从公共部门机构到出版商和事实核查组织。进一步的数据集探索将促进假新闻检测,并可能刺激在其他语言中实施类似的模型。本文件侧重于数据集的创建和组成,因此不包括对计划在项目后期阶段进行内容真实性分析的软件工具的详细评估。

[NLP-23] Free-text Rationale Generation under Readability Level Control
[NLP-23] 可读级别控制下的自由文本基本原理生成

链接: https://arxiv.org/abs/2407.01384
作者: Yi-Sheng Hsu,Nils Feldhus,Sherzod Hakimov
关键词: Free-text rationales justify, justify model decisions, Free-text rationales, likable and accessible, accessible among approaches
中文关键词: 自由文本理由证明,证明模型决策,自由文本理由,可爱且易于理解,在方法中易于理解
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Free-text rationales justify model decisions in natural language and thus become likable and accessible among approaches to explanation across many tasks. However, their effectiveness can be hindered by misinterpretation and hallucination. As a perturbation test, we investigate how large language models (LLMs) perform the task of natural language explanation (NLE) under the effects of readability level control, i.e., being prompted for a rationale targeting a specific expertise level, such as sixth grade or college. We find that explanations are adaptable to such instruction, but the requested readability is often misaligned with the measured text complexity according to traditional readability metrics. Furthermore, the quality assessment shows that LLMs’ ratings of rationales across text complexity exhibit a similar pattern of preference as observed in natural language generation (NLG). Finally, our human evaluation suggests a generally satisfactory impression on rationales at all readability levels, with high-school-level readability being most commonly perceived and favored.
摘要:自由文本理论证明了自然语言中的模型决策是合理的,因此在许多任务的解释方法中变得受欢迎和容易理解。然而,它们的有效性可能会受到误解和幻觉的阻碍。作为一项扰动测试,我们考察了在可读性水平控制的影响下,大语言模型(LLM)如何执行自然语言解释(NLE)任务,即被提示针对特定专业水平的理论基础,如六年级或大学。我们发现,解释是适合这样的指导的,但根据传统的可读性度量,所要求的可读性往往与测量的文本复杂性不一致。此外,质量评估表明,LLMS对文本复杂性的基本原理的评分显示出与自然语言生成(NLG)中观察到的相似的偏好模式。最后,我们的人类评估表明,在所有可读性水平上,人们对基本原理的印象总体上是令人满意的,其中高中水平的可读性是最常见的感知和青睐。

[NLP-24] Badllama 3: removing safety finetuning from Llama 3 in minutes
[NLP-24] Badllama 3:几分钟内删除Lama 3的安全微调

链接: https://arxiv.org/abs/2407.01376
作者: Dmitrii Volkov
关键词: extensive LLM safety, LLM safety fine-tuning, extensive LLM, model weights, LLM safety
中文关键词: 广泛的LLM安全性、LLM安全微调、广泛的LLM、模型权重、LLM安全性
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.
摘要:我们表明,当攻击者能够访问模型权重时,广泛的LLM安全微调很容易被颠覆。我们评估了三种最先进的微调方法–QLoRA、ReFT和Ortho–并展示了算法进步如何通过减少FLOP和优化能力来实现持续的越狱性能。我们在单个图形处理器上在1分钟内取消了Llama 3 8 B的安全微调,在30分钟内取消了Llama 3 70 B的安全微调,并概述了进一步减少这一问题的方法。

[NLP-25] Bridging the Gap: Transfer Learning from English PLMs to Malaysian English
[NLP-25] 弥合差距:从英语PLM学习转移到马来西亚英语

链接: https://arxiv.org/abs/2407.01374
作者: Mohan Raj Chanthran,Lay-Ki Soon,Huey Fang Ong,Bhawani Selvaretnam
关键词: Malaysian English, Standard English, addition to Standard, Malaysian English text, English
中文关键词: 马来西亚英语,标准英语,标准英语的补充,马来西亚英语文本,英语
类目: Computation and Language (cs.CL)
备注: Accepted in 9th Workshop on Representation Learning for NLP (Rep4NLP) at ACL 2024

点击查看摘要

Abstract:Malaysian English is a low resource creole language, where it carries the elements of Malay, Chinese, and Tamil languages, in addition to Standard English. Named Entity Recognition (NER) models underperform when capturing entities from Malaysian English text due to its distinctive morphosyntactic adaptations, semantic features and code-switching (mixing English and Malay). Considering these gaps, we introduce MENmBERT and MENBERT, a pre-trained language model with contextual understanding, specifically tailored for Malaysian English. We have fine-tuned MENmBERT and MENBERT using manually annotated entities and relations from the Malaysian English News Article (MEN) Dataset. This fine-tuning process allows the PLM to learn representations that capture the nuances of Malaysian English relevant for NER and RE tasks. MENmBERT achieved a 1.52% and 26.27% improvement on NER and RE tasks respectively compared to the bert-base-multilingual-cased model. Although the overall performance of NER does not have a significant improvement, our further analysis shows that there is a significant improvement when evaluated by the 12 entity labels. These findings suggest that pre-training language models on language-specific and geographically-focused corpora can be a promising approach for improving NER performance in low-resource settings. The dataset and code published in this paper provide valuable resources for NLP research work focusing on Malaysian English.
摘要:马来西亚英语是一种低资源的克里奥尔语言,除了标准英语外,它还包含马来语、华语和泰米尔语的元素。命名实体识别(NER)模型在从马来西亚英语文本中捕获实体时表现不佳,这是由于其独特的形态句法适应、语义特征和代码转换(混合英语和马来语)。考虑到这些差距,我们引入了MENmBERT和MENBERT,这是一种专门为马来西亚英语量身定做的具有上下文理解的预训练语言模型。我们使用马来西亚英语新闻文章(MEN)数据集中的手动标注实体和关系对MENmBERT和MENBERT进行了微调。这一微调过程使PLM能够学习与NER和RE任务相关的马来西亚英语的细微差别。MENmBERT与BERT-BASE-MULTICAGE-CASE模式相比,MENmBERT在NER和RE任务上的成绩分别提高了1.52和26.27。虽然NER的整体性能没有明显的提高,但我们进一步的分析表明,当以12个实体标签来评估时,NER的性能有了显著的提高。这些发现表明,在特定语言和关注地理的语料库上预先训练语言模型可以成为在低资源环境下提高自主学习能力的一种很有前途的方法。本文发布的数据集和代码为以马来西亚英语为重点的自然语言处理研究工作提供了宝贵的资源。

[NLP-26] Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
[NLP-26] 干草堆摘要:对长上下文LLM和RAG系统的挑战

链接: https://arxiv.org/abs/2407.01370
作者: Philippe Laban,Alexander R. Fabbri,Caiming Xiong,Chien-Sheng Wu
关键词: capable of handling, handling millions, millions of input, input tokens, RAG systems
中文关键词: 能够处理数百万个输入、输入令牌、RAG系统
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In this work, we argue that summarization can play a central role in such evaluation. We design a procedure to synthesize Haystacks of documents, ensuring that specific \textitinsights repeat across documents. The “Summary of a Haystack” (SummHay) task then requires a system to process the Haystack and generate, given a query, a summary that identifies the relevant insights and precisely cites the source documents. Since we have precise knowledge of what insights should appear in a haystack summary and what documents should be cited, we implement a highly reproducible automatic evaluation that can score summaries on two aspects - Coverage and Citation. We generate Haystacks in two domains (conversation, news), and perform a large-scale evaluation of 10 LLMs and corresponding 50 RAG systems. Our findings indicate that SummHay is an open challenge for current systems, as even systems provided with an Oracle signal of document relevance lag our estimate of human performance (56%) by 10+ points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to study enterprise RAG systems and position bias in long-context models. We hope future systems can equal and surpass human performance on SummHay.
摘要:LLMS和RAG系统现在能够处理数百万个或更多的输入令牌。然而,在长背景任务上评估这类系统的输出质量仍然具有挑战性,因为像干草堆中的针这样的任务缺乏复杂性。在这项工作中,我们认为总结可以在这样的评估中发挥核心作用。我们设计了一个过程来合成一堆文档,确保特定的\文本洞察力在文档中重复。然后,“草堆摘要”(SummHay)任务需要系统处理草堆,并在给定查询的情况下生成摘要,以识别相关见解并准确引用源文档。由于我们精确地知道干草堆摘要中应该出现哪些见解以及应该引用哪些文件,我们实现了一个高度可重复性的自动评估,可以在两个方面对摘要进行评分-覆盖和引用。我们在两个领域(对话、新闻)生成草栈,并对10个LLM和对应的50个RAG系统进行大规模评估。我们的发现表明,SummHay对于当前的系统来说是一个开放的挑战,因为即使是提供了文档相关性Oracle信号的系统也比我们对人类性能的估计(56%)在联合得分上落后10分以上。在没有猎犬的情况下,像GPT-40和Claude 3 Opus这样的长语境LLM在SummHay上的得分低于20%。我们表明,SummHay也可以用于研究企业RAG系统和长期背景模型中的位置偏差。我们希望未来的系统能在SummHay上赶上并超过人类的表现。

[NLP-27] Nullpointer at ArAIEval Shared Task: Arabic Propagandist Technique Detection with Token-to-Word Mapping in Sequence Tagging
[NLP-27] ArAIEval共享任务中的空指针:在序列标记中使用标记到单词映射的阿拉伯语字母表技术检测

链接: https://arxiv.org/abs/2407.01360
作者: Abrar Abir,Kemal Oflazer
关键词: ArAIEval shared task, propaganda technique detection, Arabic text, detection in Arabic, including tweets
中文关键词: ArAIEval共享任务、宣传技术检测、阿拉伯语文本、阿拉伯语检测,包括推文
类目: Computation and Language (cs.CL)
备注: To appear in proceedings of 2024 Arabic NLP Conference

点击查看摘要

Abstract:This paper investigates the optimization of propaganda technique detection in Arabic text, including tweets \ news paragraphs, from ArAIEval shared task 1. Our approach involves fine-tuning the AraBERT v2 model with a neural network classifier for sequence tagging. Experimental results show relying on the first token of the word for technique prediction produces the best performance. In addition, incorporating genre information as a feature further enhances the model’s performance. Our system achieved a score of 25.41, placing us 4 ^th on the leaderboard. Subsequent post-submission improvements further raised our score to 26.68.
摘要:本文研究了ArAIEval共享任务1中阿拉伯文本(包括推文/新闻段落)中宣传技术检测的优化。我们的方法涉及使用用于序列标记的神经网络分类器微调AraBERT v2模型。实验结果表明,依赖单词的第一个标记进行技术预测可以产生最佳性能。此外,将流派信息作为功能进一步增强了模型的性能。我们的系统获得了25.41分,在排行榜上排名第4。随后的提交后改进进一步将我们的分数提高到26.68。

[NLP-28] Evaluating Knowledge-based Cross-lingual Inconsistency in Large Language Models
[NLP-28] 评估大型语言模型中基于知识的跨语言不一致性

链接: https://arxiv.org/abs/2407.01358
作者: Xiaolin Xing,Zhiwei He,Haoyu Xu,Xing Wang,Rui Wang,Yu Hong
关键词: Natural Language Processing, Large Language Models, observed in Large, Large Language, Natural Language
中文关键词: 自然语言处理,大型语言模型,在大型、大型语言、自然语言中观察
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates the cross-lingual inconsistencies observed in Large Language Models (LLMs), such as ChatGPT, Llama, and Baichuan, which have shown exceptional performance in various Natural Language Processing (NLP) tasks. Despite their successes, these models often exhibit significant inconsistencies when processing the same concepts across different languages. This study focuses on three primary questions: the existence of cross-lingual inconsistencies in LLMs, the specific aspects in which these inconsistencies manifest, and the correlation between cross-lingual consistency and multilingual capabilities of this http URL address these questions, we propose an innovative evaluation method for Cross-lingual Semantic Consistency (xSC) using the LaBSE model. We further introduce metrics for Cross-lingual Accuracy Consistency (xAC) and Cross-lingual Timeliness Consistency (xTC) to comprehensively assess the models’ performance regarding semantic, accuracy, and timeliness inconsistencies. By harmonizing these metrics, we provide a holistic measurement of LLMs’ cross-lingual consistency. Our findings aim to enhance the understanding and improvement of multilingual capabilities and interpretability in LLMs, contributing to the development of more robust and reliable multilingual language models.
摘要:本文研究了ChatGPT、Llama和白川等在自然语言处理(NLP)任务中表现出色的大型语言模型(LLM)中的跨语言不一致现象。尽管取得了成功,但这些模型在处理不同语言中的相同概念时往往表现出严重的不一致。本研究围绕三个主要问题:LLMS中跨语言不一致的存在,这些不一致的具体表现方面,以及这个http URL的跨语言一致性与多语言能力之间的相关性。针对这些问题,我们提出了一种基于LaBSE模型的跨语言语义一致性(XSC)评估方法。我们进一步引入了跨语言准确性一致性(XAC)和跨语言时效性一致性(XTC)的度量,以综合评估模型在语义、准确性和时效性方面的表现。通过协调这些指标,我们提供了一种LLMS跨语言一致性的整体测量。我们的发现旨在提高人们对LLMS中多语言能力和可解释性的理解和改进,有助于开发更健壮和可靠的多语言模式。

[NLP-29] Protecting Privacy in Classifiers by Token Manipulation
[NLP-29] 通过代币操纵保护分类器中的隐私

链接: https://arxiv.org/abs/2407.01334
作者: Re’em Harel,Yair Elboher,Yuval Pinter
关键词: remote service entails, service entails sending, entails sending private, sending private information, untrusted provider
中文关键词: 远程服务需要,服务需要发送,需要发送私人信息,发送私人信息,不受信任的提供商
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Using language models as a remote service entails sending private information to an untrusted provider. In addition, potential eavesdroppers can intercept the messages, thereby exposing the information. In this work, we explore the prospects of avoiding such data exposure at the level of text manipulation. We focus on text classification models, examining various token mapping and contextualized manipulation functions in order to see whether classifier accuracy may be maintained while keeping the original text unrecoverable. We find that although some token mapping functions are easy and straightforward to implement, they heavily influence performance on the downstream task, and via a sophisticated attacker can be reconstructed. In comparison, the contextualized manipulation provides an improvement in performance.
摘要:使用语言模型作为远程服务需要向不受信任的提供商发送私人信息。此外,潜在的窃听者可以拦截消息,从而暴露信息。在这项工作中,我们探索了在文本操作层面避免此类数据暴露的前景。我们专注于文本分类模型,检查各种标记映射和上下文化操纵功能,以了解是否可以在保持原始文本不可恢复的同时保持分类器的准确性。我们发现,尽管一些令牌映射函数易于实现且直接,但它们严重影响下游任务的性能,并且可以通过复杂的攻击者进行重建。相比之下,上下文化操纵提供了性能改进。

[NLP-30] Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning
[NLP-30] 免费增加模型容量:参数高效微调的简单策略

链接: https://arxiv.org/abs/2407.01320
作者: Haobo Song,Hao Zhao,Soumajit Majumder,Tao Lin
关键词: large pre-trained foundation, Fine-tuning large pre-trained, pre-trained foundation models, large pre-trained, pre-trained foundation
中文关键词: 大型预训练基础,微调大型预训练、预训练的基础模型,大型预训练、预训练的基础
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICLR 2024. Code at this https URL

点击查看摘要

Abstract:Fine-tuning large pre-trained foundation models, such as the 175B GPT-3, has attracted more attention for downstream tasks recently. While parameter-efficient fine-tuning methods have been proposed and proven effective without retraining all model parameters, their performance is limited by the capacity of incremental modules, especially under constrained parameter budgets. \ To overcome this challenge, we propose CapaBoost, a simple yet effective strategy that enhances model capacity by leveraging low-rank updates through parallel weight modules in target layers. By applying static random masks to the shared weight matrix, CapaBoost constructs a diverse set of weight matrices, effectively increasing the rank of incremental weights without adding parameters. Notably, our approach can be seamlessly integrated into various existing parameter-efficient fine-tuning methods. We extensively validate the efficacy of CapaBoost through experiments on diverse downstream tasks, including natural language understanding, question answering, and image classification. Our results demonstrate significant improvements over baselines, without incurring additional computation or storage costs. Our code is available at \urlthis https URL.
摘要:微调大型预先训练的基础模型,如175B GPT-3,最近为下游任务吸引了更多的关注。虽然参数高效的微调方法已经被提出并被证明是有效的,但它们的性能受到增量模块能力的限制,特别是在参数预算受限的情况下。为了克服这一挑战,我们提出了CapaBoost,这是一个简单但有效的策略,通过利用目标层中的并行权重模块进行低阶次更新来增强模型的能力。通过将静态随机掩码应用于共享权重矩阵,CapaBoost构建了一组不同的权重矩阵,在不增加参数的情况下有效地提高了增量权重的排名。值得注意的是,我们的方法可以无缝地集成到各种现有的参数高效微调方法中。我们通过对不同下游任务的实验,包括自然语言理解、问答和图像分类,广泛地验证了CapaBoost的有效性。我们的结果表明,在不产生额外的计算或存储成本的情况下,与基线相比有了显著的改进。我们的代码位于此HTTPS URL。

[NLP-31] Language Portability Strategies for Open-domain Dialogue with Pre-trained Language Models from High to Low Resource Languages
[NLP-31] 使用从高资源语言到低资源语言的预训练语言模型进行开放领域对话的语言移植策略

链接: https://arxiv.org/abs/2407.01315
作者: Ahmed Njifenjou,Virgile Sucal,Bassam Jabaian,Fabrice Lefèvre
关键词: linguistic portability strategies, open-domain dialogue systems, pre-trained language models, large pre-trained language, paper we propose
中文关键词: 语言可移植性策略、开放领域对话系统、预训练语言模型、大型预训练语言,我们提出的论文
类目: Computation and Language (cs.CL)
备注: The 13th International Workshop on Spoken Dialogue Systems Technology (IWSDS '23)

点击查看摘要

Abstract:In this paper we propose a study of linguistic portability strategies of large pre-trained language models (PLMs) used for open-domain dialogue systems in a high-resource language for this task. In particular the target low-resource language (L_T) will be simulated with French, as it lacks of task-specific resources and allows our human evaluation, when the source language (L_S) is English. For obvious reasons, recent works using such models for open-domain dialogue are mostly developed in English. Yet building specific PLMs for each possible target language supposes collecting new datasets and is costly. For this reason, trying to leverage all existing resources (PLMs and data) in both L_S and L_T , we wish to assess the performance achievable in L_T with different approaches. The first two approaches evaluate the usage of Neural Machine Translation (NMT) at different levels: TrainOnTarget where a L_S dataset is translated before fine-tuning in L_T and TestOnSource where a L_S model is coupled with NMT modules during inference. Then, the advent of BLOOM [2], the world first open-access multilingual large PLM, allow researchers to develop new approaches aiming to leverage not only the model’s full accessibility but also its multilingualism and translation abilities. In this context the task is learned in L_S first and adapted to L_T using the MAD-X Adapter architecture [16]. In the two sets of experiments models are evaluated in spoken dialogue conditions with human and the strategies can be compared in terms of perceived interaction quality.
摘要:针对这一任务,我们提出了一种用于开放领域对话系统的大型预训练语言模型(PLM)的语言可移植策略的研究。特别是,目标低资源语言(L_T)将用法语模拟,因为它缺乏特定任务的资源,并且允许我们进行人工评估,而源语言(L_S)是英语。由于显而易见的原因,最近使用这种模式进行开放领域对话的著作大多是用英语开发的。然而,为每种可能的目标语言建立特定的PLM需要收集新的数据集,而且成本高昂。因此,尝试利用L_S和L_T中的所有现有资源(PLM和数据),我们希望以不同的方法评估L_T可以实现的性能。前两种方法在不同的水平上评估神经机器翻译的使用:TrainOnTarget,其中L_S的数据集在L_T中进行微调之前被翻译;以及TestOnSource,其中L_S模型与神经机器翻译模块在推理过程中耦合。然后,Bloom[2]的问世,世界上第一个开放获取的多语言大型PLM,允许研究人员开发新的方法,旨在不仅利用该模型的完全可访问性,而且还利用其多语言和翻译能力。在这种情况下,该任务首先在L_S中学习,并使用MAD-X适配器体系结构适应L_T[16]。在这两组实验中,在与人的口语对话条件下对模型进行了评估,并从感知交互质量的角度对两种策略进行了比较。

[NLP-32] Collaborative Performance Prediction for Large Language Models
[NLP-32] 大型语言模型的协同性能预测

链接: https://arxiv.org/abs/2407.01300
作者: Qiyuan Zhang,Fuyuan Lyu,Xue Liu,Chen Ma
关键词: NLP research, Comprehensively understanding, challenge in NLP, large language models, diverse downstream tasks
中文关键词: NLP研究、全面理解、NLP挑战、大型语言模型、多样化的下游任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework, Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from online platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance, an area previously overlooked.
摘要:全面理解和准确预测大型语言模型在不同下游任务中的性能已经成为自然语言处理研究中的一个关键挑战。关于下游工程的开创性比例定律展示了模型族内部的内在相似性,并利用这些相似性进行性能预测。但是,它们往往忽略模型族之间的相似性,只考虑原始比例定律中列出的设计因素。为了克服这些局限性,我们引入了一个新的框架,协作性能预测(CPP),它通过利用各种模型在下游任务上的历史性能以及模型和任务的其他设计因素来显著提高预测精度。我们还收集了来自在线平台的协作数据,其中包含历史性能和其他设计因素。在协作数据的支持下,CPP不仅在预测缩放LLMS的性能方面超过了传统的标度律,而且还有助于对因素重要性的详细分析,这是以前被忽视的领域。

[NLP-33] Lightweight Zero-shot Text-to-Speech with Mixture of Adapters
[NLP-33] 具有混合适配器的轻量级零镜头文本到语音

链接: https://arxiv.org/abs/2407.01291
作者: Kenichi Fujita,Takanori Ashihara,Marc Delcroix,Yusuke Ijima
关键词: demonstrated high fidelity, based on large-scale, demonstrated high, high fidelity, fidelity in reproducing
中文关键词: 表现出高保真度,在大规模的基础上,表现出高、高保真度、复制度
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 pages,3 figures, Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40% of parameters at 1.9 times faster inference speed. Audio samples are available on our demo page (this https URL).
摘要:基于大规模模型的零镜头文本转语音(TTC)方法的进步已经证明了再现说话者特征的高保真度。然而,这些型号对于实际日常使用来说太大了。我们提出了一种使用混合适配器(MoA)的轻量级零发射TTC方法。我们提出的方法将MoA模块整合到非自回归TTC模型的解码器和方差适配器中。这些模块通过根据扬声器嵌入选择与扬声器特性相关的适当适配器,增强了以零触发方式适应各种扬声器的能力。我们的方法以最少的附加参数实现了高质量的语音合成。通过客观和主观评估,我们确认我们的方法以不到40%的参数、1.9倍的推理速度实现了比基线更好的性能。音频样本可在我们的演示页面(此https URL)上找到。

[NLP-34] We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
[NLP-34] We-Math:您的大型多峰模型能否实现类人的数学推理?

链接: https://arxiv.org/abs/2407.01284
作者: Runqi Qiao,Qiuna Tan,Guanting Dong,Minhui Wu,Chong Sun,Xiaoshuai Song,Zhuoma GongQue,Shanglin Lei,Zhe Wei,Miaoxuan Zhang,Runfeng Qiao,Yifan Zhang,Xiao Zong,Yida Xu,Muxi Diao,Zhimin Bao,Chen Li,Honggang Zhang
关键词: Large Multimodal Models, Multimodal Models, Large Multimodal, received widespread attention, Visual mathematical reasoning
中文关键词: 大型多峰模型,多峰模型,大型多峰,受到广泛关注,视觉数学推理
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: Work in progress

点击查看摘要

Abstract:Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs’ reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at this https URL.
摘要:可视化数学推理作为一种基本的可视化推理能力,受到了大型多通道模型(LMM)领域的广泛关注。现有的基准,如MathVista和MathVerse,更多地关注以结果为导向的绩效,而忽视了知识获取和概括的基本原则。受类似人类的数学推理的启发,我们引入了WE-MATH,这是第一个专门为探索端到端性能之外的问题解决原理而设计的基准测试。我们精心收集和归类6.5K可视化数学题,跨越67个层次化的知识概念和5层知识粒度。我们根据所需的知识概念将组合问题分解为子问题,并引入了一种新的四维度量,即知识不足(IK)、不充分概括(IG)、完全掌握(CM)和旋转记忆(RM),以分层地评估LMM推理过程中的内在问题。我们使用WE-MATH对现有的可视化数学推理中的LMM进行了全面的评估,发现求解步骤与问题具体表现之间存在负相关关系。我们证实,通过知识扩充策略可以有效地改善LMM的知识密集度问题。更值得注意的是,GPT-40的主要挑战已经显著地从IK过渡到IG,使其成为第一个迈向知识推广阶段的LMM。相比之下,其他LMM表现出明显的Rote记忆倾向–他们正确地解决了涉及多个知识概念的复合问题,但未能回答子问题。我们预计WE-MATH将为LMM的可视化数学推理的发展开辟新的途径。WE-数学数据和评估代码可在此HTTPS URL中找到。

[NLP-35] Leveraging Large Language Models for Actionable Course Evaluation Student Feedback to Lecturers
[NLP-35] 利用大型语言模型进行可操作课程评估学生向讲师的反馈

链接: https://arxiv.org/abs/2407.01274
作者: Mike Zhang,Euan D Lindsay,Frederik Bode Thorbensen,Danny Bøgsted Poulsen,Johannes Bjerva
关键词: End of semester, dominant mechanism, semester student evaluations, End, feedback
中文关键词: 学期结束,主导机制,学期学生评估,结束,反馈
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Accepted to SEFI 2024

点击查看摘要

Abstract:End of semester student evaluations of teaching are the dominant mechanism for providing feedback to academics on their teaching practice. For large classes, however, the volume of feedback makes these tools impractical for this purpose. This paper explores the use of open-source generative AI to synthesise factual, actionable and appropriate summaries of student feedback from these survey responses. In our setup, we have 742 student responses ranging over 75 courses in a Computer Science department. For each course, we synthesise a summary of the course evaluations and actionable items for the instructor. Our results reveal a promising avenue for enhancing teaching practices in the classroom setting. Our contribution lies in demonstrating the feasibility of using generative AI to produce insightful feedback for teachers, thus providing a cost-effective means to support educators’ development. Overall, our work highlights the possibility of using generative AI to produce factual, actionable, and appropriate feedback for teachers in the classroom setting.
摘要:学期末学生对教学的评价是向学者反馈教学实践的主要机制。然而,对于大班来说,反馈的数量使这些工具不适用于此目的。本文探讨了如何使用开源的生成性人工智能来从这些调查答复中综合出事实的、可操作的和适当的学生反馈摘要。在我们的设置中,我们有742名学生回应,涉及计算机科学系的75门课程。对于每门课程,我们为教师综合课程评估和可操作项目的摘要。我们的结果揭示了在课堂环境中加强教学实践的一条很有希望的途径。我们的贡献在于证明了使用生成性人工智能为教师产生有洞察力的反馈的可行性,从而为支持教育工作者的发展提供了一种经济有效的手段。总体而言,我们的工作突出了使用生成性人工智能在课堂环境中为教师产生事实的、可操作的和适当的反馈的可能性。

[NLP-36] Show Less Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER
[NLP-36] 少显示指导多:用零镜头NER的定义和指南丰富预算

链接: https://arxiv.org/abs/2407.01272
作者: Andrew Zamai,Andrea Zugarini,Leonardo Rigutini,Marco Ernandes,Marco Maggini
关键词: instruction-tuned Large Language, Large Language Models, specialized instruction-tuned Large, Large Language, Named Entity Recognition
中文关键词: 经过描述的大型语言、大型语言模型、专门经过描述的大型、大型语言、命名实体识别
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, several specialized instruction-tuned Large Language Models (LLMs) for Named Entity Recognition (NER) have emerged. Compared to traditional NER approaches, these models have strong generalization capabilities. Existing LLMs mainly focus on zero-shot NER in out-of-domain distributions, being fine-tuned on an extensive number of entity classes that often highly or completely overlap with test sets. In this work instead, we propose SLIMER, an approach designed to tackle never-seen-before named entity tags by instructing the model on fewer examples, and by leveraging a prompt enriched with definition and guidelines. Experiments demonstrate that definition and guidelines yield better performance, faster and more robust learning, particularly when labelling unseen Named Entities. Furthermore, SLIMER performs comparably to state-of-the-art approaches in out-of-domain zero-shot NER, while being trained on a reduced tag set.
摘要:最近,出现了几种专门的用于命名实体识别(NER)的描述调整大型语言模型(LLM)。与传统NER方法相比,这些模型具有很强的概括能力。现有的LLM主要关注域外分发中的零镜头NER,对大量通常与测试集高度或完全重叠的实体类进行微调。相反,在这项工作中,我们提出了SIMAER,这是一种旨在通过在更少的示例上指导模型并利用富含定义和指导方针的提示来解决以前从未见过的命名实体标签的方法。实验表明,定义和指南可以产生更好的性能、更快、更稳健的学习,特别是在标记未见的命名实体时。此外,SIIMAER在域外零射击NER中执行了最先进的方法,同时在精简的标签集上进行训练。

[NLP-37] First Place Solution of 2023 Global Artificial Intelligence Technology Innovation Competition Track 1
[NLP-37] 2023年全球人工智能技术创新大赛第一赛道解决方案第一名

链接: https://arxiv.org/abs/2407.01271
作者: Xiangyu Wu,Hailiang Zhang,Yang Yang,Jianfeng Lu
关键词: Innovation Competition Track, Medical Imaging Diagnosis, Global Artificial Intelligence, Artificial Intelligence Technology, Intelligence Technology Innovation
中文关键词: 创新竞赛赛道、医学影像诊断、全球人工智能、人工智能技术、智能技术创新
类目: Computation and Language (cs.CL)
备注: First Place of 2023 Global Artificial Intelligence Technology Innovation Competition

点击查看摘要

Abstract:In this paper, we present our champion solution to the Global Artificial Intelligence Technology Innovation Competition Track 1: Medical Imaging Diagnosis Report Generation. We select CPT-BASE as our base model for the text generation task. During the pre-training stage, we delete the mask language modeling task of CPT-BASE and instead reconstruct the vocabulary, adopting a span mask strategy and gradually increasing the number of masking ratios to perform the denoising auto-encoder pre-training task. In the fine-tuning stage, we design iterative retrieval augmentation and noise-aware similarity bucket prompt strategies. The retrieval augmentation constructs a mini-knowledge base, enriching the input information of the model, while the similarity bucket further perceives the noise information within the mini-knowledge base, guiding the model to generate higher-quality diagnostic reports based on the similarity prompts. Surprisingly, our single model has achieved a score of 2.321 on leaderboard A, and the multiple model fusion scores are 2.362 and 2.320 on the A and B leaderboards respectively, securing first place in the rankings.
摘要:在本文中,我们介绍了我们在全球人工智能技术创新大赛第一赛道:医学影像诊断报告生成方面的冠军解决方案。我们选择CPT-BASE作为文本生成任务的基本模型。在预训练阶段,我们删除了CPT-BASE的掩码语言建模任务,代之以重建词汇表,采用跨度掩码策略,逐步增加掩蔽率来完成去噪自动编码器的预训练任务。在微调阶段,我们设计了迭代检索增强和噪声感知相似桶提示策略。检索扩充构建了一个微型知识库,丰富了模型的输入信息,而相似桶则进一步感知微型知识库中的噪声信息,指导模型基于相似性提示生成更高质量的诊断报告。令人惊讶的是,我们的单模在排行榜A上获得了2.321分,多模融合在A和B排行榜上的得分分别为2.362和2.320,确保了排名第一。

[NLP-38] he African Woman is Rhythmic and Soulful: Evaluation of Open-ended Generation for Implicit Biases
[NLP-38] 非洲女性有节奏又有灵魂:开放一代的隐性偏见评价

链接: https://arxiv.org/abs/2407.01270
作者: Serene Lim
关键词: Large Language Models, Language Models, Large Language, LLM Decision Bias, demonstrate underlying prejudices
中文关键词: 大型语言模型、语言模型、大型语言、LLM决策偏见,展示了潜在的偏见
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the subtle and often concealed biases present in Large Language Models (LLMs), which, despite passing explicit bias tests, can still exhibit implicit biases akin to those observed in humans who profess egalitarian beliefs yet demonstrate underlying prejudices. The challenge of measuring such biases is exacerbated as LLMs become increasingly proprietary, restricting access to their internal mechanisms such as embeddings, which are crucial for applying traditional bias measures. To tackle these issues, this study introduces innovative measures of bias inspired by psychological methodologies: the LLM Implicit Association Test (IAT) Bias and the LLM Decision Bias. The LLM IAT Bias is a prompt-based method designed to unearth implicit biases by simulating the well-known psychological IAT but adapted for use with LLMs. The LLM Decision Bias measure is developed to detect subtle discrimination in decision-making tasks, focusing on how LLMs choose between individuals in various scenarios. Open-ended generation is also utilised through thematic analysis of word generations and storytelling. The experiments revealed biases across gender and racial domains, from discriminatory categorisations to exoticisation. Our findings indicate that the prompt-based measure of implicit bias not only correlates with traditional embedding-based methods but also more effectively predicts downstream behaviors, which are crucially measured by the LLM Decision Bias. This relationship underscores the importance of relative, rather than absolute, evaluations in assessing implicit biases, reflecting psychological insights into human bias assessment. This research contributes to the broader understanding of AI ethics and provides suggestions for continually assessing and mitigating biases in advanced AI systems, emphasising the need for more qualitative and downstream focus.
摘要:这项研究调查了大型语言模型(LLM)中存在的微妙且往往被隐藏的偏见,尽管通过了显性偏见测试,但仍可能表现出类似于在自称平等主义信念但表现出潜在偏见的人类身上观察到的内隐偏见。随着LLM变得越来越专有,限制了对其嵌入等内部机制的访问,衡量此类偏差的挑战加剧,而这些机制对于应用传统的偏差衡量标准至关重要。为了解决这些问题,本研究引入了受心理学方法论启发的偏差的创新测量方法:LLM内隐联想测试(IAT)偏差和LLM决策偏差。LLMIAT偏差是一种基于即时的方法,旨在通过模拟众所周知的心理IAT来挖掘内隐偏见,但适用于LLMS。LLM决策偏差测量是为了检测决策任务中的细微差别,重点关注LLM在不同情景下如何在个人之间做出选择。开放式生成也通过对词语生成和讲故事的主题分析来使用。这些实验揭示了性别和种族领域的偏见,从歧视性的分类到异国情调。我们的发现表明,基于提示的内隐偏差测量不仅与传统的基于嵌入的方法相关,而且更有效地预测下游行为,这一点是由LLM决策偏差来衡量的。这种关系突显了相对评估而不是绝对评估在评估隐性偏见方面的重要性,反映了对人类偏见评估的心理学见解。这项研究有助于更广泛地理解人工智能伦理,并为持续评估和减轻先进人工智能系统中的偏差提供建议,强调需要更多地关注定性和下游。

[NLP-39] SignCLIP: Connecting Text and Sign Language by Contrastive Learning
[NLP-39] SignCLIP:通过对比学习连接文本和手语

链接: https://arxiv.org/abs/2407.01264
作者: Zifan Jiang,Gerard Sant,Amit Moryossef,Mathias Müller,Rico Sennrich,Sarah Ebling
关键词: Contrastive Language-Image Pretraining, Contrastive Language-Image, Language-Image Pretraining, sign language, language
中文关键词: 对比隐喻-图像预训练,对比隐喻-图像,隐喻-图像预训练,手语,语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present SignCLIP, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size. We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. SignCLIP discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot prompting or fine-tuning. We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights. Our code and models are openly available. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2407.01264 [cs.CL] (or arXiv:2407.01264v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.01264 Focus to learn more arXiv-issued DOI via DataCite
摘要:我们提出了SignCLIP,它重新利用CLIP(对比语言-图像预训练)将口语文本和手语视频这两类不同形式的自然语言投影到同一空间。SignCLIP是一种有效的方法,可以从大规模、多语言的视频-文本对中学习用于手语处理的有用的视觉表示,而不需要直接针对特定任务或通常有限大小的手语进行优化。我们在SpreadtheSign上对SignCLIP进行了预培训,SpreadtheSign是一个著名的手语词典,由多达44种手语的约50万个视频片段组成,并使用各种下游数据集对其进行评估。SignCLIP识别域内签名,具有显著的文本到视频/视频到文本检索精度。它还在域外下游任务方面具有竞争力,例如在基本的少量提示或微调时进行孤立的手语识别。我们分析了口语文本和手语姿势所形成的潜在空间,这为我们提供了更多的语言学见解。我们的代码和模型是公开提供的。主题:计算与语言(cs.CL)引用为:arxiv:2407.01264cs.CLhttps://doi.org/10.48550/arXiv.2407.01264 Focus通过DataCite了解更多arxiv发布的文档

[NLP-40] uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation via Large-Scale Pseudo Labelling
[NLP-40] uDistil-Whisper:通过大规模伪标签进行知识提炼的无标签数据过滤

链接: https://arxiv.org/abs/2407.01257
作者: Abdul Waheed,Karima Kadaoui,Muhammad Abdul-Mageed
关键词: distilling Whisper knowledge, Recent work, models, reducing the size, Recent
中文关键词: 提炼Whisper知识,最近的工作,模型,缩小规模,最近
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Work in progress

点击查看摘要

Abstract:Recent work on distilling Whisper’s knowledge into small models using pseudo-labels shows promising performance while reducing the size by up to 50%. This results in small, efficient, and dedicated models. However, a critical step of distillation from pseudo-labels involves filtering high-quality predictions and using only those during training. This step requires ground truth to compare and filter bad examples making the whole process supervised. In addition to that, the distillation process requires a large amount of data thereby limiting the ability to distil models in low-resource settings. To address this challenge, we propose an unsupervised or label-free framework for distillation, thus eliminating the requirement for labeled data altogether. Through experimentation, we show that our best distilled models outperform the teacher model by 5-7 points in terms of WER. Additionally, our models are on par with or better than similar supervised data filtering setup. When we scale the data, our models significantly outperform all zero-shot and supervised models. In this work, we demonstrate that it’s possible to distill large Whisper models into relatively small models without using any labeled data. As a result, our distilled models are 25-50% more compute and memory efficient while maintaining performance equal to or better than the teacher model.
摘要:最近使用伪标签将语者的知识提取到小模型中的工作显示出良好的性能,同时将大小减少了50%。这就产生了小型、高效和专用的模型。然而,从伪标签中提炼的一个关键步骤是过滤高质量的预测,并在训练期间只使用那些预测。这一步需要地面实况来比较和过滤不良榜样,使整个过程受到监督。除此之外,蒸馏过程需要大量数据,从而限制了在低资源环境下提取模型的能力。为了应对这一挑战,我们提出了一个无监督或无标签的蒸馏框架,从而完全消除了对标签数据的要求。通过实验,我们发现我们最好的提取模型在WER方面比教师模型高出5-7个百分点。此外,我们的模型与类似的监督数据过滤设置不相上下,甚至更好。当我们对数据进行缩放时,我们的模型显著优于所有的零精度模型和监督模型。在这项工作中,我们证明了在不使用任何标记数据的情况下,将大型耳语模型提取为相对较小的模型是可能的。因此,我们的精炼模型在保持与教师模型相同或更好的性能的同时,计算和内存效率提高了25%-50%。

[NLP-41] Large Language Models are Zero-Shot Recognizers for Activities of Daily Living
[NLP-41] 大型语言模型是日常生活活动的零镜头识别器

链接: https://arxiv.org/abs/2407.01238
作者: Gabriele Civitarese,Michele Fiori,Priyankar Choudhary,Claudio Bettini
关键词: Daily Living, Large Language Models, energy management, smart home environments, home environments enables
中文关键词: 日常生活、大型语言模型、能源管理、智能家居环境、家庭环境使
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: Currently under review

点击查看摘要

Abstract:The sensor-based recognition of Activities of Daily Living (ADLs) in smart home environments enables several applications in the areas of energy management, safety, well-being, and healthcare. ADLs recognition is typically based on deep learning methods requiring large datasets to be trained. Recently, several studies proved that Large Language Models (LLMs) effectively capture common-sense knowledge about human activities. However, the effectiveness of LLMs for ADLs recognition in smart home environments still deserves to be investigated. In this work, we propose ADL-LLM, a novel LLM-based ADLs recognition system. ADLLLM transforms raw sensor data into textual representations, that are processed by an LLM to perform zero-shot ADLs recognition. Moreover, in the scenario where a small labeled dataset is available, ADL-LLM can also be empowered with few-shot prompting. We evaluated ADL-LLM on two public datasets, showing its effectiveness in this domain.
摘要:智能家居环境中基于传感器的日常生活活动(ADL)识别实现了能源管理、安全、福祉和医疗保健领域的多种应用。ADL识别通常基于需要训练大型数据集的深度学习方法。最近,几项研究证明,大型语言模型(LLM)可以有效地捕获有关人类活动的常识知识。然而,LLM在智能家居环境中用于ADL识别的有效性仍然值得研究。在这项工作中,我们提出了ADL-LLM,这是一种新型的基于LLM的ADL识别系统。ADLLLM将原始传感器数据转换为文本表示,由LLM处理以执行零激发ADL识别。此外,在有小标签数据集可用的情况下,ADL-LLM还可以通过少量提示来实现。我们在两个公共数据集上评估了ADL-LLM,展示了其在该领域的有效性。

[NLP-42] MIRAI: Evaluating LLM Agents for Event Forecasting
[NLP-42] MIRAI:评估LLM代理的事件预测

链接: https://arxiv.org/abs/2407.01231
作者: Chenchen Ye,Ziniu Hu,Yihe Deng,Zijie Huang,Mingyu Derek Ma,Yanqiao Zhu,Wei Wang
关键词: solve complex problems, LLM agents, Large Language Models, empowered LLM agents, Recent advancements
中文关键词: 解决复杂问题、LLM代理、大型语言模型、授权LLM代理、最新进展
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 66 pages, 8 figures, 6 tables; Website: this https URL

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents’ forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents’ abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents’ capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.
摘要:大型语言模型(LLM)的最新进展使LLM代理能够自主地收集世界信息,并在这些信息上进行推理来解决复杂的问题。鉴于这一能力,越来越多的人对使用LLM代理来预测国际事件产生了兴趣,这可能会影响决策并在国际范围内制定政策。尽管人们的兴趣与日俱增,但LLM代理的预测能力和可靠性缺乏严格的基准。为了弥补这一差距,我们引入了Mirai,这是一个新的基准,旨在系统地评估LLM代理在国际事件背景下作为时间预测者的作用。我们的基准具有代理环境,具有访问历史、结构化事件和文本新闻文章的广泛数据库的工具。我们通过仔细的清理和解析来精炼GDELT事件数据库,以管理一系列具有不同预测视野的关系预测任务,评估LLM代理从短期到长期的预测能力。我们进一步实现了API,使LLM代理能够通过基于代码的接口使用不同的工具。总而言之,Mirai从三个方面全面评估了代理的能力:1)自主地从大型全球数据库中获取和集成关键信息;2)使用特定于领域的API和库编写代码以供工具使用;3)联合推理来自不同格式和时间的历史知识,以准确预测未来事件。通过全面的基准,我们的目标是建立一个可靠的框架,以评估LLM代理预测国际事件的能力,从而有助于开发更准确和可靠的国际关系分析模型。

[NLP-43] Searching for Best Practices in Retrieval-Augmented Generation
[NLP-43] 寻找检索增强一代的最佳实践

链接: https://arxiv.org/abs/2407.01219
作者: Xiaohua Wang,Zhenghua Wang,Xuan Gao,Feiran Zhang,Yixin Wu,Zhibo Xu,Tianyuan Shi,Zhengyuan Wang,Shizheng Li,Qi Qian,Ruicheng Yin,Changze Lv,Xiaoqing Zheng,Xuanjing Huang
关键词: enhancing response quality, mitigating hallucinations, effective in integrating, specialized domains, Retrieval-augmented generation
中文关键词: 提高响应质量,减轻幻觉,有效整合,专业领域,检索增强一代
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a “retrieval as generation” strategy.
摘要:检索增强生成(RAG)技术已被证明在整合最新信息、减轻幻觉和提高反应质量方面是有效的,特别是在专业领域。虽然已经提出了许多RAG方法来通过依赖于查询的检索来增强大型语言模型,但这些方法仍然存在实现复杂和响应时间延长的问题。通常,RAG工作流程涉及多个处理步骤,每个步骤都可以以各种方式执行。在这里,我们调查现有的RAG方法及其潜在的组合,以确定最佳的RAG实践。通过大量的实验,我们提出了几种平衡性能和效率的RAG部署策略。此外,我们还证明了多通道检索技术可以显著提高关于视觉输入的问答能力,并使用以检索为生成的策略来加速多通道内容的生成。

[NLP-44] EconNLI: Evaluating Large Language Models on Economics Reasoning
[NLP-44] EcoNLI:评估经济推理中的大型语言模型

链接: https://arxiv.org/abs/2407.01212
作者: Yue Guo,Yi Yang
关键词: Large Language Models, providing financial advice, lacks systematic evaluation, Large Language, Language Models
中文关键词: 大型语言模型,提供财务建议,缺乏系统评估,大型语言,语言模型
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2024

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used for writing economic analysis reports or providing financial advice, but their ability to understand economic knowledge and reason about potential results of specific economic events lacks systematic evaluation. To address this gap, we propose a new dataset, natural language inference on economic events (EconNLI), to evaluate LLMs’ knowledge and reasoning abilities in the economic domain. We evaluate LLMs on (1) their ability to correctly classify whether a premise event will cause a hypothesis event and (2) their ability to generate reasonable events resulting from a given premise. Our experiments reveal that LLMs are not sophisticated in economic reasoning and may generate wrong or hallucinated answers. Our study raises awareness of the limitations of using LLMs for critical decision-making involving economic reasoning and analysis. The dataset and codes are available at this https URL.
摘要:大型语言模型(LLM)被广泛用于撰写经济分析报告或提供财务建议,但其理解经济知识和对特定经济事件潜在结果的推理的能力缺乏系统评估。为了解决这一差距,我们提出了一个新的数据集,即经济事件的自然语言推理(EngineNLI),来评估LLM在经济领域的知识和推理能力。我们评估LLM的指标包括:(1)它们正确分类前提事件是否会导致假设事件的能力,以及(2)它们生成由给定前提产生的合理事件的能力。我们的实验表明,LLM在经济推理方面并不复杂,可能会产生错误或幻觉的答案。我们的研究提高了人们对使用LLM进行涉及经济推理和分析的关键决策的局限性的认识。数据集和代码可在此httpsURL中获取。

[NLP-45] textMemory3: Language Modeling with Explicit Memory
[NLP-45] 文本内存3:具有显式记忆的语言建模

链接: https://arxiv.org/abs/2407.01178
作者: Hongkang Yang,Zehao Lin,Wenjin Wang,Hao Wu,Zhiyu Li,Bo Tang,Wenqiang Wei,Jinbo Wang,Zeyun Tang,Shichao Song,Chenyang Xi,Yu Yu,Kai Chen,Feiyu Xiong,Linpeng Tang,Weinan E
关键词: large language models, meaningful computation, large language, costly process, process that transports
中文关键词: 大型语言模型、有意义的计算、大型语言、昂贵的过程、传输的过程
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining “abstract knowledge”. As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named \textMemory^3 , since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.
摘要:大型语言模型的训练和推理是一个昂贵的过程,它将知识从原始数据传输到有意义的计算。受人脑记忆层次的启发,我们通过为LLM配备显式记忆来降低这一成本,这是一种比模型参数和文本检索-增强生成(RAG)更便宜的记忆格式。从概念上讲,随着LLM的大部分知识外化到外显记忆中,LLM可以享受较小的参数大小、训练成本和推理成本,所有这些都与剩余的“抽象知识”的数量成比例。作为概念的初步验证,我们从零开始训练一个2.4B的LLM,它获得了比更大的LLM和RAG模型更好的性能,并保持了比RAG更高的译码速度。该模型被命名为TextMemory^3,因为外显记忆是LLMS中仅次于内隐记忆(模型参数)和工作记忆(上下文键-值)的第三种记忆形式。我们引入了支持知识外部化的记忆电路理论,并提出了新的技术,包括使存储易于处理的记忆稀疏机制和促进记忆形成的两阶段预训练方案。

[NLP-46] Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
[NLP-46] 学习探索和选择覆盖条件检索增强一代

链接: https://arxiv.org/abs/2407.01158
作者: Takyoung Kim,Kyungjae Lee,Young Rok Jang,Ji Yong Cho,Gangwoo Kim,Minseok Cho,Moontae Lee
关键词: extensive parametric capacities, typically yield long-form, Interactions with billion-scale, yield long-form responses, long-form responses due
中文关键词: 广泛的参数能力,通常产生长形式,与十亿级的相互作用,产生长形式响应,长形式响应
类目: Computation and Language (cs.CL)
备注: Work in progress. Resources are available at this https URL

点击查看摘要

Abstract:Interactions with billion-scale large language models typically yield long-form responses due to their extensive parametric capacities, along with retrieval-augmented features. While detailed responses provide insightful viewpoint of a specific subject, they frequently generate redundant and less engaging content that does not meet user interests. In this work, we focus on the role of query outlining (i.e., selected sequence of queries) in scenarios that users request a specific range of information, namely coverage-conditioned ( C^2 ) scenarios. For simulating C^2 scenarios, we construct QTree, 10K sets of information-seeking queries decomposed with various perspectives on certain topics. By utilizing QTree, we train QPlanner, a 7B language model generating customized query outlines that follow coverage-conditioned queries. We analyze the effectiveness of generated outlines through automatic and human evaluation, targeting on retrieval-augmented generation (RAG). Moreover, the experimental results demonstrate that QPlanner with alignment training can further provide outlines satisfying diverse user interests. Our resources are available at this https URL.
摘要:与数十亿规模的大型语言模型的交互通常会产生长形式的响应,这是因为它们具有广泛的参数容量以及检索增强的功能。虽然详细的回复提供了对特定主题的有洞察力的观点,但它们经常产生多余的、不太吸引人的内容,不符合用户的兴趣。在这项工作中,我们专注于查询大纲(即选定的查询序列)在用户请求特定范围的信息的场景中的作用,即覆盖条件(C^2)场景。为了模拟C^2场景,我们构建了QTree,10K个信息搜索查询集,这些查询以特定主题的不同视角进行分解。通过使用QTree,我们训练了QPlanner,这是一种7B语言模型,生成遵循覆盖条件查询的定制查询大纲。我们针对检索增强生成(RAG),通过自动评价和人工评价来分析生成的轮廓的有效性。此外,实验结果表明,经过对齐训练的QPlanner可以进一步提供满足不同用户兴趣的轮廓。我们的资源可以在这个HTTPS URL上找到。

[NLP-47] Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models
[NLP-47] 将所有内容分开:或将任何文本与多模式模型中的任何图像对齐

链接: https://arxiv.org/abs/2407.01157
作者: Shaeke Salman,Md Montasir Bin Shams,Xiuwen Liu
关键词: unprecedented zero-shot capabilities, exhibit unprecedented zero-shot, shared embedding space, models exhibit unprecedented, zero-shot capabilities
中文关键词: 前所未有的零拍摄能力,展现出前所未有的零拍摄、共享嵌入空间,模型展现出前所未有的零拍摄能力
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:2401.15568 , arXiv:2402.08473

点击查看摘要

Abstract:Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities can be misaligned. In this paper, we extend and utilize a recently developed effective gradient-based procedure that allows us to match the embedding of a given text by minimally modifying an image. Using the procedure, we show that we can align the embeddings of distinguishable texts to any image through unnoticeable adversarial attacks in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and at the same time visually indistinguishable images can be matched to the embeddings of very different texts. Our technique achieves 100% success rate when it is applied to text datasets and images from multiple sources. Without overcoming the vulnerability, multimodal models cannot robustly align inputs from different modalities in a semantically meaningful way. \textbfWarning: the text data used in this paper are toxic in nature and may be offensive to some readers.
摘要:利用共享的嵌入空间,新兴的多式联运模型显示出前所未有的零射击能力。然而,如果不同的模式可能会错位,共享嵌入空间可能会导致新的漏洞。在本文中,我们扩展和利用了最近开发的一种有效的基于梯度的方法,该方法允许我们通过对图像进行最小限度的修改来匹配给定文本的嵌入。利用该过程,我们证明了在联合图文模型中,通过不可察觉的对抗性攻击,可以将可区分文本的嵌入与任何图像对齐,从而揭示了语义无关的图像可以具有相同文本的嵌入,同时视觉上不可区分的图像可以与非常不同的文本的嵌入相匹配。将该方法应用于多个来源的文本数据集和图像,取得了100%的准确率。如果不克服这一弱点,多通道模型就不能以语义有意义的方式稳健地对齐来自不同通道的输入。\textbf警告:本文中使用的文本数据是有毒的,可能会冒犯某些读者。

[NLP-48] Sociocultural Considerations in Monitoring Anti-LGBTQ Content on Social Media
[NLP-48] 监控社交媒体上反LGBTQ内容的社会文化考虑

链接: https://arxiv.org/abs/2407.01149
作者: Sidney G.-J. Wong
关键词: sociocultural factors, open-source training data, open-source hate speech, hate speech data, hate speech detection
中文关键词: 社会文化因素、开源训练数据、开源仇恨言论、仇恨言论数据、仇恨言论检测
类目: Computation and Language (cs.CL)
备注: Accepted Manuscript ACL 2024 Workshop C3NLP

点击查看摘要

Abstract:The purpose of this paper is to ascertain the influence of sociocultural factors (i.e., social, cultural, and political) in the development of hate speech detection systems. We set out to investigate the suitability of using open-source training data to monitor levels of anti-LGBTQ+ content on social media across different national-varieties of English. Our findings suggests the social and cultural alignment of open-source hate speech data sets influences the predicted outputs. Furthermore, the keyword-search approach of anti-LGBTQ+ slurs in the development of open-source training data encourages detection models to overfit on slurs; therefore, anti-LGBTQ+ content may go undetected. We recommend combining empirical outputs with qualitative insights to ensure these systems are fit for purpose.
摘要:本文的目的是确定社会文化因素(即,社会、文化和政治)在仇恨言论检测系统的开发中。我们开始调查使用开源培训数据来监控不同国家英语变体社交媒体上反LGBTQ+内容水平的合适性。我们的研究结果表明,开源仇恨言论数据集的社会和文化一致性会影响预测的输出。此外,在开发开源训练数据时反LGBTQ+诽谤的关键词搜索方法鼓励检测模型过度适合诽谤;因此,反LGBTQ+内容可能会被检测不到。我们建议将经验输出与定性见解相结合,以确保这些系统适合目的。

[NLP-49] An Empirical Comparison of Generative Approaches for Product Attribute-Value Identification
[NLP-49] 产品属性-价值识别生成方法的实证比较

链接: https://arxiv.org/abs/2407.01137
作者: Kassem Sabeh,Robert Litschko,Mouna Kacimi,Barbara Plank,Johann Gamper
关键词: e-commerce platforms, supporting applications, applications like search, question answering, Product attributes
中文关键词: 电子商务平台、支持应用程序、搜索等应用程序、问答、产品属性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Product attributes are crucial for e-commerce platforms, supporting applications like search, recommendation, and question answering. The task of Product Attribute and Value Identification (PAVI) involves identifying both attributes and their values from product information. In this paper, we formulate PAVI as a generation task and provide, to the best of our knowledge, the most comprehensive evaluation of PAVI so far. We compare three different attribute-value generation (AVG) strategies based on fine-tuning encoder-decoder models on three datasets. Experiments show that end-to-end AVG approach, which is computationally efficient, outperforms other strategies. However, there are differences depending on model sizes and the underlying language model. The code to reproduce all experiments is available at: this https URL
摘要:产品属性对于电子商务平台至关重要,支持搜索、推荐和问答等应用程序。产品属性和价值识别(PAVI)的任务涉及从产品信息中识别属性及其价值。在本文中,我们将PAVI制定为一项生成任务,并提供了据我们所知迄今为止对PAVI最全面的评估。我们基于三个数据集上的微调编码器-解码器模型,比较了三种不同的属性值生成(AVG)策略。实验表明,端到端AVG方法计算效率高,优于其他策略。然而,根据模型大小和基础语言模型的不同,存在差异。复制所有实验的代码可在以下网址获取:此https URL

[NLP-50] Cross-Lingual Transfer Learning for Speech Translation
[NLP-50] 语音翻译的跨语言迁移学习

链接: https://arxiv.org/abs/2407.01130
作者: Rao Ma,Yassir Fathullah,Mengjie Qian,Siyuan Tang,Mark Gales,Kate Knill
关键词: increasing interest, interest in building, building multilingual foundation, NLP, NLP tasks
中文关键词: 增加兴趣,对建立、建立多语言基础、NLP、NLP任务的兴趣
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There has been increasing interest in building multilingual foundation models for NLP and speech research. Zero-shot cross-lingual transfer has been demonstrated on a range of NLP tasks where a model fine-tuned on task-specific data in one language yields performance gains in other languages. Here, we explore whether speech-based models exhibit the same transfer capability. Using Whisper as an example of a multilingual speech foundation model, we examine the utterance representation generated by the speech encoder. Despite some language-sensitive information being preserved in the audio embedding, words from different languages are mapped to a similar semantic space, as evidenced by a high recall rate in a speech-to-speech retrieval task. Leveraging this shared embedding space, zero-shot cross-lingual transfer is demonstrated in speech translation. When the Whisper model is fine-tuned solely on English-to-Chinese translation data, performance improvements are observed for input utterances in other languages. Additionally, experiments on low-resource languages show that Whisper can perform speech translation for utterances from languages unseen during pre-training by utilizing cross-lingual representations.
摘要:为自然语言处理和语音研究建立多语言基础模型越来越受到人们的关注。在一系列NLP任务上已经展示了零准数跨语言迁移,其中一个模型对一种语言的特定任务数据进行了微调,在其他语言中产生了性能提升。在这里,我们探讨基于语音的模型是否表现出相同的传输能力。以Whisper作为多语言语音基础模型的一个例子,我们检查了语音编码器生成的话语表示。尽管在音频嵌入中保留了一些语言敏感信息,但来自不同语言的单词被映射到相似的语义空间,语音到语音检索任务中的高召回率证明了这一点。利用这种共享的嵌入空间,在语音翻译中展示了零镜头跨语言迁移。当只根据英汉翻译数据对Whisper模型进行微调时,可以观察到其他语言的输入话语的性能改善。此外,在低资源语言上的实验表明,Whisper可以利用跨语言表征对预训练中看不到的语言进行语音翻译。

[NLP-51] Investigating the potential of Sparse Mixtures-of-Experts for multi-domain neural machine translation
[NLP-51] 研究稀疏专家混合在多领域神经机器翻译中的潜力

链接: https://arxiv.org/abs/2407.01126
作者: Nadezhda Chirkova,Vassilina Nikoulina,Jean-Luc Meunier,Alexandre Bérard
关键词: Neural Machine Translation, multi-domain Neural Machine, Machine Translation, Neural Machine, developing efficient models
中文关键词: 神经机器翻译,多域神经机器,机器翻译,神经机器,开发高效模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We focus on multi-domain Neural Machine Translation, with the goal of developing efficient models which can handle data from various domains seen during training and are robust to domains unseen during training. We hypothesize that Sparse Mixture-of-Experts (SMoE) models are a good fit for this task, as they enable efficient model scaling, which helps to accommodate a variety of multi-domain data, and allow flexible sharing of parameters between domains, potentially enabling knowledge transfer between similar domains and limiting negative transfer. We conduct a series of experiments aimed at validating the utility of SMoE for the multi-domain scenario, and find that a straightforward width scaling of Transformer is a simpler and surprisingly more efficient approach in practice, and reaches the same performance level as SMoE. We also search for a better recipe for robustness of multi-domain systems, highlighting the importance of mixing-in a generic domain, i.e. Paracrawl, and introducing a simple technique, domain randomization.
摘要:我们致力于多领域神经机器翻译的研究,目的是开发高效的模型,能够处理训练过程中看到的不同领域的数据,并且对训练过程中看不到的领域具有健壮性。我们假设稀疏专家混合(SMOE)模型很适合这项任务,因为它们支持有效的模型缩放,这有助于容纳各种多域数据,并允许域之间灵活地共享参数,潜在地使相似域之间的知识转移成为可能,并限制负转移。我们进行了一系列实验,旨在验证SMOE在多域场景中的实用性,并发现Transformer的直接宽度缩放在实践中是一种更简单、更高效的方法,并且达到了与SMOE相同的性能水平。我们还寻找了一种更好的方法来提高多域系统的健壮性,强调了混合在通用域中的重要性,即Paracrawl,并引入了一种简单的技术,域随机化。

[NLP-52] Calibrated Large Language Models for Binary Question Answering
[NLP-52] 用于二元问题解答的校准大型语言模型

链接: https://arxiv.org/abs/2407.01122
作者: Patrizio Giovannotti,Alexander Gammerman
关键词: large language models, binary text classification, text classification tasks, classification tasks remains, remains a challenge
中文关键词: 大型语言模型、二进制文本分类、文本分类任务、分类任务仍然是一个挑战
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to COPA 2024 (13th Symposium on Conformal and Probabilistic Prediction with Applications)

点击查看摘要

Abstract:Quantifying the uncertainty of predictions made by large language models (LLMs) in binary text classification tasks remains a challenge. Calibration, in the context of LLMs, refers to the alignment between the model’s predicted probabilities and the actual correctness of its predictions. A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn–Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels. Our experiments on the BoolQ dataset using the Llama 2 model demonstrate that IVAP consistently outperforms the commonly used temperature scaling method for various label token choices, achieving well-calibrated probabilities while maintaining high predictive quality. Our findings contribute to the understanding of calibration techniques for LLMs and provide a practical solution for obtaining reliable uncertainty estimates in binary question answering tasks, enhancing the interpretability and trustworthiness of LLM predictions.
摘要:在二进制文本分类任务中,量化大语言模型预测的不确定性仍然是一个挑战。在LLMS的上下文中,校准指的是模型的预测概率与其预测的实际正确性之间的对准。一个经过良好校准的模型应该产生能够准确反映其预测正确的可能性的概率。我们提出了一种新的方法,它利用归纳的Venn-Abers预测器(IVAP)来校准与对应于二进制标签的输出标记相关联的概率。我们在使用Llama 2模型的BoolQ数据集上的实验表明,对于各种标签标记选择,IVAP始终优于常用的温度缩放方法,在保持高预测质量的同时实现了良好校准的概率。我们的发现有助于理解LLMS的校准技术,并为在二元问答任务中获得可靠的不确定性估计提供了实用的解决方案,提高了LLM预测的可解释性和可信度。

[NLP-53] Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?
[NLP-53] Pron vs Promote:大型语言模型是否已经可以挑战创意文本写作领域的世界级小说作家?

链接: https://arxiv.org/abs/2407.01119
作者: Guillermo Marco,Julio Gonzalo,Ramón del Castillo,María Teresa Mateo Girona
关键词: Large Language Models, creative text writing, report research results, creative writing skills, outperform average humans
中文关键词: 大型语言模型、创意文本写作、报告研究结果、创意写作技能,优于普通人
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages 6 figures

点击查看摘要

Abstract:It has become routine to report research results where Large Language Models (LLMs) outperform average humans in a wide range of language-related tasks, and creative text writing is no exception. It seems natural, then, to raise the bid: Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent’s. Then, we prepared an evaluation rubric inspired by Boden’s definition of creativity, and we collected 5,400 manual assessments provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer, and that reaching such level of autonomous creative writing skills probably cannot be reached simply with larger language models.
摘要:在一系列与语言相关的任务中,大型语言模型(LLM)的表现优于普通人类,报道研究成果已成为一种惯例,创造性的文本写作也不例外。那么,提高出价似乎是很自然的:LLMS准备好与顶尖(而不是一般)小说家在创造性写作技能上竞争了吗?为了提供这个问题的初步答案,我们在Patricio Pron(获奖小说家,被认为是他那一代人中最好的小说家之一)和GPT-4(表现最好的LLM之一)之间进行了一场比赛,本着人工智能与人类决斗的精神,比如DeepBlue vs Kasparov和AlphaGo vs Lee Sidol。我们要求Pron和GPT-4各提供30个题目,然后为他们的题目和他们对手的题目写短篇小说。然后,我们根据Boden对创造力的定义编制了一个评价量表,我们收集了5400名文学评论家和学者提供的手工评估。我们的实验结果表明,LLM还远远不能挑战人类顶尖的创造性作家,而达到这种水平的自主创造性写作技能可能不是简单地通过更大的语言模型就能达到的。

[NLP-54] BERGEN: A Benchmarking Library for Retrieval-Augmented Generation
[NLP-54] Bergen:检索增强一代的基准库

链接: https://arxiv.org/abs/2407.01102
作者: David Rau,Hervé Déjean,Nadezhda Chirkova,Thibault Formal,Shuai Wang,Vassilina Nikoulina,Stéphane Clinchant
关键词: Large Language Models, enhance Large Language, Large Language, Language Models, Retrieval-Augmented Generation
中文关键词: 大型语言模型、增强大型语言、大型语言、语言模型、检索增强生成
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 29 pages

点击查看摘要

Abstract:Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets. Our open-source library BERGEN is available under \urlthis https URL.
摘要:检索增强生成允许使用外部知识增强大型语言模型。为了响应生成式LLM最近的流行,人们提出了许多RAG方法,其中涉及大量复杂的不同配置,例如评估数据集、集合、指标、检索器和LLM。不一致的基准测试给比较方法和了解管道中每个组件的影响带来了重大挑战。在这项工作中,我们研究了最佳实践,为RAG的系统评估奠定了基础,并介绍了BERGER,这是一个用于标准化RAG实验的可重复研究的端到端图书馆。在一项专注于QA的广泛研究中,我们对不同的最先进的寻回犬、重评级者和LLM进行了基准测试。此外,我们还分析现有的RAG指标和数据集。我们的开源库Bergen可在\urlThis https URL下找到。

[NLP-55] Eliminating Position Bias of Language Models: A Mechanistic Approach
[NLP-55] 消除语言模型的位置偏差:一种机械方法

链接: https://arxiv.org/abs/2407.01100
作者: Ziqi Wang,Hanlin Zhang,Xiner Li,Kuan-Hao Huang,Chi Han,Shuiwang Ji,Sham M. Kakade,Hao Peng,Heng Ji
关键词: Position bias, prevalent issue, issue of modern, Position, bias
中文关键词: 立场偏见,流行问题,现代问题,立场,偏见
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. Specifically, we find that causal attention generally causes models to favor distant content, while relative positional encodings like RoPE prefer nearby ones based on the analysis of retrieval-augmented question answering (QA). Further, our empirical study on object detection reveals that position bias is also present in vision-language models (VLMs). Based on the above analyses, we propose to ELIMINATE position bias caused by different input segment orders (e.g., options in LM-as-a-judge, retrieved documents in QA) in a TRAINING-FREE ZERO-SHOT manner. Our method changes the causal attention to bidirectional attention between segments and utilizes model attention values to decide the relative orders of segments instead of using the order provided in input prompts, therefore enabling Position-INvariant inferencE (PINE) at the segment level. By eliminating position bias, models achieve better performance and reliability in downstream tasks where position bias widely exists, such as LM-as-a-judge and retrieval-augmented QA. Notably, PINE is especially useful when adapting LMs for evaluating reasoning pairs: it consistently provides 8 to 10 percentage points performance gains in most cases, and makes Llama-3-70B-Instruct perform even better than GPT-4-0125-preview on the RewardBench reasoning subset. Comments: 18 pages, 5 figures Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2407.01100 [cs.CL] (or arXiv:2407.01100v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.01100 Focus to learn more arXiv-issued DOI via DataCite
摘要:位置偏差已被证明是现代语言模型(LMS)中的一个普遍问题,该模型根据内容在给定上下文中的位置来确定内容的优先级。这种偏差通常会导致意外的模型故障,并损害各种应用程序的性能、健壮性和可靠性。我们的机制分析将位置偏差归因于几乎所有最先进的LMS中使用的两个组成部分:因果注意和相对位置编码。具体地说,我们发现因果注意通常会导致模型倾向于距离较远的内容,而基于检索-增强问答(QA)的分析,相对位置编码(如ROPE)更倾向于附近的内容。此外,我们对目标检测的实证研究表明,位置偏差也存在于视觉-语言模型(VLM)中。基于以上分析,我们提出了以免训练的零射击方式消除不同输入片段顺序(例如,LM-as-a-Screen中的选项,QA中的检索文档)造成的位置偏差。该方法将语段之间的因果注意改变为语段间的双向注意,并利用模型关注值来确定语段的相对顺序,而不是使用输入提示中提供的顺序,从而实现了语段级别的位置不变推理(PINE)。通过消除位置偏差,模型在位置偏差广泛存在的下游任务中获得了更好的性能和可靠性,例如作为判断的LM和检索增强的QA。值得注意的是,在采用LMS来评估推理对时,PINE特别有用:它在大多数情况下都能持续提供8到10个百分点的性能提升,并使Llama-3-70B-Indict在RewardBch推理子集上的性能甚至好于GPT-4-0125-PREVIEW。评论:18页,5位数字主题:计算和语言(cs.CL);机器学习(cs.LG)引用为:arxiv:2407.01100cs.CLhttps://doi.org/10.48550/arXiv.2407.01100 Focus通过DataCite了解更多arxiv发布的目标文件

[NLP-56] IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation
[NLP-56] IBSEN:导演-演员代理合作,创造可控和互动的戏剧剧本

链接: https://arxiv.org/abs/2407.01093
作者: Senyu Han,Lu Chen,Li-Min Lin,Zhengshan Xu,Kai Yu
关键词: Large language models, human-like character role-playing, Large language, language model agents, Current language model
中文关键词: 大型语言模型、类人角色扮演、大型语言、语言模型代理、当前语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted by ACL 2024 Main

点击查看摘要

Abstract:Large language models have demonstrated their capabilities in storyline creation and human-like character role-playing. Current language model agents mainly focus on reasonable behaviors from the level of individuals, and their behaviors might be hard to constraint on the level of the whole storyline. In this paper we introduce IBSEN, a director-actor coordinate agent framework that generates drama scripts and makes the plot played by agents more controllable. The director agent writes plot outlines that the user desires to see, instructs the actor agents to role-play their characters, and reschedules the plot when human players participate in the scenario to ensure the plot is progressing towards the objective. To evaluate the framework, we create a novel drama plot that involves several actor agents and check the interactions between them under the instruction of the director agent. Evaluation results show that our framework could generate complete, diverse drama scripts from only a rough outline of plot objectives, meanwhile maintaining the characteristics of characters in the drama. Our codes and prompts are available at this https URL.
摘要:大型语言模型在故事情节创作和人物角色扮演方面表现出了强大的能力。目前的语言模型主体主要关注个体层面的合理行为,他们的行为可能很难约束在整个故事情节的层面上。在本文中,我们介绍了IBSEN,这是一个导演和演员协调的代理框架,它生成戏剧剧本,使代理扮演的情节更具可控性。导演代理编写用户希望看到的情节大纲,指示演员代理扮演他们的角色,并在人类玩家参与场景时重新安排情节,以确保情节朝着目标发展。为了评估该框架,我们创建了一个涉及多个演员代理的小说情节,并在导演代理的指导下检查他们之间的互动。评测结果表明,该框架能够从剧情目标的大致轮廓中生成完整、多样的剧情剧本,同时保持剧中人物的特点。我们的代码和提示可在此HTTPS URL中找到。

[NLP-57] M2QA: Multi-domain Multilingual Question Answering
[NLP-57] M2 QA:多领域多语言问题解答

链接: https://arxiv.org/abs/2407.01091
作者: Leon Engländer,Hannah Sterz,Clifton Poth,Jonas Pfeiffer,Ilia Kuznetsov,Iryna Gurevych
关键词: Generalization and robustness, machine learning research, robustness to input, core desiderata, desiderata of machine
中文关键词: 概括性和鲁棒性、机器学习研究、输入鲁棒性、核心需求、机器的需求
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generalization and robustness to input variation are core desiderata of machine learning research. Language varies along several axes, most importantly, language instance (e.g. French) and domain (e.g. news). While adapting NLP models to new languages within a single domain, or to new domains within a single language, is widely studied, research in joint adaptation is hampered by the lack of evaluation datasets. This prevents the transfer of NLP systems from well-resourced languages and domains to non-dominant language-domain combinations. To address this gap, we introduce M2QA, a multi-domain multilingual question answering benchmark. M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing. We use M2QA to explore cross-lingual cross-domain performance of fine-tuned models and state-of-the-art LLMs and investigate modular approaches to domain and language adaptation. We witness 1) considerable performance variations across domain-language combinations within model classes and 2) considerable performance drops between source and target language-domain combinations across all model sizes. We demonstrate that M2QA is far from solved, and new methods to effectively transfer both linguistic and domain-specific information are necessary. We make M2QA publicly available at this https URL.
摘要:泛化和对输入变化的稳健性是机器学习研究的核心要求。语言沿着几个轴变化,最重要的是,语言实例(例如法语)和领域(例如新闻)。虽然对自然语言处理模型适应单一领域内的新语言或单一语言内的新领域进行了广泛的研究,但由于缺乏评估数据集,联合适应的研究受到阻碍。这防止了自然语言处理系统从资源丰富的语言和领域转移到非主要语言领域的组合。为了弥补这一差距,我们引入了M2QA,一个多领域多语言问答基准。M2QA包括13,500个Tean2.0风格的德语、土耳其语和中文问答实例,用于产品评论、新闻和创意写作领域。我们使用M2QA来探索微调模型和最新LLM的跨语言跨领域性能,并研究领域和语言适应的模块化方法。我们见证了1)模型类内领域语言组合之间的显著性能差异,以及2)所有模型大小的源语言和目标语言领域组合之间的显著性能下降。我们证明,M2QA远未解决,需要新的方法来有效地传递语言和领域特定的信息。我们通过此HTTPS URL公开提供M2QA。

[NLP-58] Rethinking LLM-based Preference Evaluation
[NLP-58] 重新思考基于LLM的偏好评估

链接: https://arxiv.org/abs/2407.01085
作者: Zhengyu Hu,Linxin Song,Jieyu Zhang,Zheyuan Xiao,Jingang Wang,Zhenyu Chen,Jieyu Zhao,Hui Xiong
关键词: large language model, based preference evaluation, large language, widely adopted, adopted to compare
中文关键词: 大语言模型,基于偏好评估,大语言,广泛采用,采用进行比较
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, large language model (LLM)-based preference evaluation has been widely adopted to compare pairs of model responses. However, a severe bias towards lengthy responses has been observed, raising concerns about the reliability of this evaluation method. In this work, we designed a series of controlled experiments to study the major impacting factors of the metric of LLM-based preference evaluation, i.e., win rate, and conclude that the win rate is affected by two axes of model response: desirability and information mass, where the former is length-independent and related to trustworthiness, and the latter is length-dependent and can be represented by conditional entropy. We find that length impacts the existing evaluations by influencing information mass. However, a reliable evaluation metric should not only assess content quality but also ensure that the assessment is not confounded by extraneous factors such as response length. Therefore, we propose a simple yet effective adjustment, AdapAlpaca, to the existing practice of win rate measurement. Specifically, by adjusting the lengths of reference answers to match the test model’s answers within the same interval, we debias information mass relative to length, ensuring a fair model evaluation.
摘要:近年来,基于大语言模型(LLM)的偏好评价被广泛应用于模型反应对的比较。然而,已经观察到严重偏向于冗长的答复,这引起了人们对这种评估方法的可靠性的担忧。在这项工作中,我们设计了一系列对照实验来研究基于LLM的偏好评估度量的主要影响因素,即胜率,得出胜率受到模型响应的两个轴的影响,其中前者与长度无关且与可信性有关,后者与长度相关且可用条件熵来表示。我们发现,长度通过影响信息质量来影响现有的评估。然而,可靠的评估指标不仅应该评估内容质量,还应该确保评估不会受到响应长度等外部因素的干扰。因此,我们提出了一种简单而有效的调整,AdapAlpaca,以适应现有的胜率测量做法。具体地说,通过调整参考答案的长度以匹配相同间隔内的测试模型答案,我们相对于长度偏离了信息量,确保了公平的模型评估。

[NLP-59] Min P Sampling: Balancing Creativity and Coherence at High Temperature
[NLP-59] Min P采样:平衡高温下的创造力和一致性

链接: https://arxiv.org/abs/2407.01082
作者: Minh Nguyen,Andrew Baker,Andreas Kirsch,Clement Neo
关键词: Large Language Models, Large Language, Language Models, generate longform text, generate longform
中文关键词: 大型语言模型,大型语言,语言模型,生成长格式文本,生成长格式
类目: Computation and Language (cs.CL)
备注: 8 Pages

点击查看摘要

Abstract:Large Language Models (LLMs) generate longform text by successively sampling the next token based on the probability distribution of the token vocabulary at each decoding step. Current popular truncation sampling methods such as top- p sampling, also known as nucleus sampling, often struggle to balance coherence and creativity in generating text, particularly when using higher temperatures. To address this issue, we propose min- p , a dynamic truncation sampling method, that establishes a minimum base percentage threshold for tokens, which the scales according to the probability of the top candidate token. Through experiments on several benchmarks, such as GPQA, GSM8K and AlpacaEval Creative Writing, we demonstrate that min- p improves the coherence and quality of generated text even at high temperatures, while also facilitating more creative and diverse outputs compared to top- p and other sampling methods. As of writing, min- p has been adopted by multiple open-source LLM implementations, and have been independently assessed by members of the open-source LLM community, further validating its practical utility and potential.
摘要:大语言模型通过在每个解码步骤中根据令牌词汇量的概率分布连续采样下一个令牌来生成长文本。目前流行的截断抽样方法,如top-p抽样,也被称为核抽样,通常难以在生成文本时平衡连贯性和创造性,特别是在使用更高的温度时。为了解决这个问题,我们提出了一种动态截断抽样方法MIN-P,它为令牌建立了一个最小基本百分比阈值,该阈值根据顶级候选令牌的概率进行缩放。通过在GPQA、GSM8K和AlpacaEval Creative Writing等几个基准测试上的实验,我们证明了min-p即使在高温下也提高了生成文本的连贯性和质量,同时也促进了比top-p和其他采样方法更具创造性和多样性的输出。在撰写本文时,min-p已经被多个开源LLM实现采用,并且已经由开源LLM社区的成员独立评估,进一步验证了它的实用价值和潜力。

[NLP-60] CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
[NLP-60] CVLUE:中国视觉语言理解评估的新基准数据集

链接: https://arxiv.org/abs/2407.01081
作者: Yuxuan Wang,Yijun Liu,Fei Yu,Chen Huang,Kexin Li,Zhiguo Wan,Wanxiang Che
关键词: Chinese vision-language models, constructed on Western-centric, Chinese vision-language, Chinese, Chinese culture
中文关键词: 中国视觉语言模型,建立在以西方为中心的中国视觉语言、中国、中国文化的基础上
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the rapid development of Chinese vision-language models (VLMs), most existing Chinese vision-language (VL) datasets are constructed on Western-centric images from existing English VL datasets. The cultural bias in the images makes these datasets unsuitable for evaluating VLMs in Chinese culture. To remedy this issue, we present a new Chinese Vision- Language Understanding Evaluation (CVLUE) benchmark dataset, where the selection of object categories and images is entirely driven by Chinese native speakers, ensuring that the source images are representative of Chinese culture. The benchmark contains four distinct VL tasks ranging from image-text retrieval to visual question answering, visual grounding and visual dialogue. We present a detailed statistical analysis of CVLUE and provide a baseline performance analysis with several open-source multilingual VLMs on CVLUE and its English counterparts to reveal their performance gap between English and Chinese. Our in-depth category-level analysis reveals a lack of Chinese cultural knowledge in existing VLMs. We also find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs’ understanding of Chinese culture.
摘要:尽管中文视觉语言模型发展迅速,但现有的大多数中文视觉语言数据集都是建立在已有英语视觉语言数据集中的以西方为中心的图像上。图像中的文化偏见使得这些数据集不适合评估中国文化中的VLM。为了解决这个问题,我们提出了一个新的汉语视觉-语言理解评估(CVLUE)基准数据集,其中对象类别和图像的选择完全由以中国为母语的人驱动,确保源图像代表中国文化。该基准包含四个不同的VL任务,从图像-文本检索到视觉问题回答、视觉基础和视觉对话。我们对CVLUE进行了详细的统计分析,并提供了几个开源的多语言VLM在CVLUE和其英文版本上的基线性能分析,以揭示它们在英文和中文之间的性能差距。我们对语类层面的深入分析表明,现有的语料库中缺乏中国文化知识。我们还发现,对与中国文化相关的语料库进行微调有效地提高了语料库对中国文化的理解。

[NLP-61] Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese
[NLP-61] Face 4RAG:中文检索增强生成的事实一致性评估

链接: https://arxiv.org/abs/2407.01080
作者: Yunqi Xu,Tianchi Cai,Jiyan Jiang,Xierui Song
关键词: Retrieval Augmented Generation, conventional Retrieval Augmented, Augmented Generation, Retrieval Augmented, Large Language Models
中文关键词: 检索增强生成、传统检索增强、增强生成、检索增强、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE). Despite the various FCE methods proposed earlier, these methods are evaluated on datasets generated by specific Large Language Models (LLMs). Without a comprehensive benchmark, it remains unexplored how these FCE methods perform on other LLMs with different error distributions or even unseen error types, as these methods may fail to detect the error types generated by other LLMs. To fill this gap, in this paper, we propose the first comprehensive FCE benchmark \emphFace4RAG for RAG independent of the underlying LLM. Our benchmark consists of a synthetic dataset built upon a carefully designed typology for factuality inconsistency error and a real-world dataset constructed from six commonly used LLMs, enabling evaluation of FCE methods on specific error types or real-world error distributions. On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference. To fix this issue, we further propose a new method called \emphL-Face4RAG with two novel designs of logic-preserving answer decomposition and fact-logic FCE. Extensive experiments show L-Face4RAG substantially outperforms previous methods for factual inconsistency detection on a wide range of tasks, notably beyond the RAG task from which it is originally motivated. Both the benchmark and our proposed method are publicly available.\footnote\urlthis https URL\labellink_face4rag
摘要:传统检索增强生成(RAG)中普遍存在的事实不一致错误问题推动了事实一致性评价(FCE)的研究。尽管前面提出了各种FCE方法,但这些方法是在特定的大型语言模型(LLM)生成的数据集上进行评估的。在没有全面的基准的情况下,这些FCE方法如何在具有不同错误分布甚至是看不见的错误类型的其他LLM上执行仍然是未被探索的,因为这些方法可能无法检测由其他LLM产生的错误类型。为了填补这一空白,在本文中,我们提出了第一个独立于底层LLM的RAG的全面FCE基准\phaFace4RAG。我们的基准包括一个建立在针对真实性不一致性错误的精心设计的类型学基础上的合成数据集,以及一个由六个常用的LLM构建的真实世界数据集,从而能够对特定错误类型或真实世界错误分布的FCE方法进行评估。在提出的基准测试中,我们发现了现有FCE方法在检测逻辑谬误方面的失败,逻辑谬误指的是答案和检索到的引用之间的逻辑结构不匹配。为了解决这个问题,我们进一步提出了一种新的方法,称为L-Face4RAG,它具有两个新的设计,即保留逻辑的答案分解和事实逻辑FCE。广泛的实验表明,L-Face4RAG在许多任务上的事实不一致性检测性能大大优于以前的方法,特别是在它最初动机所在的RAG任务之外。基准测试和我们建议的方法都是公开提供的。\Footnote\urlThis HTTPS URL\Labellink_face4rag

[NLP-62] Human-like object concept representations emerge naturally in multimodal large language models
[NLP-62] 类人对象概念表示在多模式大型语言模型中自然出现

链接: https://arxiv.org/abs/2407.01067
作者: Changde Du,Kaicheng Fu,Bincheng Wen,Yi Sun,Jie Peng,Wei Wei,Ying Gao,Shengpei Wang,Chuncheng Zhang,Jinpeng Li,Shuang Qiu,Le Chang,Huiguang He
关键词: offering crucial insights, Large Language Models, intrigued cognitive scientists, long intrigued cognitive, scientists and neuroscientists
中文关键词: 提供重要见解、大型语言模型、引起兴趣的认知科学家、长期引起兴趣的认知科学家和神经科学家
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The conceptualization and categorization of natural objects in the human mind have long intrigued cognitive scientists and neuroscientists, offering crucial insights into human perception and cognition. Recently, the rapid development of Large Language Models (LLMs) has raised the attractive question of whether these models can also develop human-like object representations through exposure to vast amounts of linguistic and multimodal data. In this study, we combined behavioral and neuroimaging analysis methods to uncover how the object concept representations in LLMs correlate with those of humans. By collecting large-scale datasets of 4.7 million triplet judgments from LLM and Multimodal LLM (MLLM), we were able to derive low-dimensional embeddings that capture the underlying similarity structure of 1,854 natural objects. The resulting 66-dimensional embeddings were found to be highly stable and predictive, and exhibited semantic clustering akin to human mental representations. Interestingly, the interpretability of the dimensions underlying these embeddings suggests that LLM and MLLM have developed human-like conceptual representations of natural objects. Further analysis demonstrated strong alignment between the identified model embeddings and neural activity patterns in many functionally defined brain ROIs (e.g., EBA, PPA, RSC and FFA). This provides compelling evidence that the object representations in LLMs, while not identical to those in the human, share fundamental commonalities that reflect key schemas of human conceptual knowledge. This study advances our understanding of machine intelligence and informs the development of more human-like artificial cognitive systems.
摘要:人类头脑中自然物体的概念化和分类长期以来一直是认知科学家和神经学家的兴趣所在,为人类的感知和认知提供了至关重要的见解。近年来,大型语言模型的快速发展提出了一个引人注目的问题,即这些模型是否也可以通过接触大量的语言和多通道数据来开发类似于人类的对象表征。在这项研究中,我们结合行为和神经成像分析方法来揭示LLMS中的对象概念表征如何与人类的概念表征相关联。通过从LLM和多模式LLM(MLLM)收集470万个三元组判断的大规模数据集,我们能够推导出低维嵌入,这些嵌入捕捉了1854个自然对象的潜在相似结构。由此得到的66维嵌入被发现具有高度的稳定性和预测性,并表现出类似于人类心理表征的语义聚集。有趣的是,这些嵌入背后的维度的可解释性表明,LLM和MLLM已经开发出了对自然对象的类似于人类的概念表示。进一步的分析表明,在许多功能定义的脑ROI(如EBA、PPA、RSC和FFA)中,识别的模型嵌入与神经活动模式之间具有很强的一致性。这提供了令人信服的证据,表明LLMS中的对象表示虽然与人类中的对象表示不完全相同,但具有基本的共性,反映了人类概念知识的关键图式。这项研究促进了我们对机器智能的理解,并为更多类似人类的人工认知系统的发展提供了信息。

[NLP-63] Development of Cognitive Intelligence in Pre-trained Language Models
[NLP-63] 预训练语言模型中认知智能的发展

链接: https://arxiv.org/abs/2407.01047
作者: Raj Sanjay Shah,Khushi Bhardwaj,Sashank Varma
关键词: Large Pre-trained Language, Pre-trained Language Models, Recent studies show, Large Pre-trained, Pre-trained Language
中文关键词: 大型预训练语言,预训练语言模型,最近的研究表明,大型预训练语言,预训练语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies show evidence for emergent cognitive abilities in Large Pre-trained Language Models (PLMs). The increasing cognitive alignment of these models has made them candidates for cognitive science theories. Prior research into the emergent cognitive abilities of PLMs has largely been path-independent to model training, i.e., has focused on the final model weights and not the intermediate steps. However, building plausible models of human cognition using PLMs would benefit from considering the developmental alignment of their performance during training to the trajectories of children’s thinking. Guided by psychometric tests of human intelligence, we choose four sets of tasks to investigate the alignment of ten popular families of PLMs and evaluate their available intermediate and final training steps. These tasks are Numerical ability, Linguistic abilities, Conceptual understanding, and Fluid reasoning. We find a striking regularity: regardless of model size, the developmental trajectories of PLMs consistently exhibit a window of maximal alignment to human cognitive development. Before that window, training appears to endow “blank slate” models with the requisite structure to be poised to rapidly learn from experience. After that window, training appears to serve the engineering goal of reducing loss but not the scientific goal of increasing alignment with human cognition.
摘要:最近的研究表明,在大型预先训练的语言模型(PLM)中有新出现的认知能力。以往对PLM突现认知能力的研究在很大程度上是与模型训练无关的,即专注于最终的模型权重,而不是中间步骤。然而,使用PLM建立可信的人类认知模型将受益于考虑它们在训练期间的表现与儿童思维轨迹的发展一致性。在人类智力心理测量学测试的指导下,我们选择了四组任务来调查十个流行的PLM家庭的配对情况,并评估他们可用的中间和最终训练步骤。我们发现了一个惊人的规律性:无论模型大小如何,PLM的发展轨迹始终显示出与人类认知发展最大一致的窗口。在这一窗口之前,培训似乎赋予了“白板”模型必要的结构,使其能够迅速从经验中学习。在那之后,培训似乎服务于减少损失的工程目标,而不是增加与人类认知一致的科学目标。

[NLP-64] FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language Models
[NLP-64] FRoG:评估大型语言模型中广义量化词的模糊推理

链接: https://arxiv.org/abs/2407.01046
作者: Yiyuan Li,Shichao Sun,Pengfei Liu
关键词: daily contexts, vital due, imprecise information, information in daily, Fuzzy reasoning
中文关键词: 日常上下文、重要原因、不精确信息、日常信息、模糊推理
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Fuzzy reasoning is vital due to the frequent use of imprecise information in daily contexts. However, the ability of current large language models (LLMs) to handle such reasoning remains largely uncharted. In this paper, we introduce a new benchmark, FRoG, for fuzzy reasoning, featuring real-world mathematical word problems that incorporate generalized quantifiers. Our experimental findings reveal that fuzzy reasoning continues to pose significant challenges for LLMs. Moreover, we find that existing methods designed to enhance reasoning do not consistently improve performance in tasks involving fuzzy logic. Additionally, our results show an inverse scaling effect in the performance of LLMs on FRoG. Interestingly, we also demonstrate that strong mathematical reasoning skills are not necessarily indicative of success on our benchmark.
摘要:由于日常环境中经常使用不精确的信息,模糊推理至关重要。然而,当前大型语言模型(LLM)处理此类推理的能力在很大程度上仍然未知。在本文中,我们引入了一个新的模糊推理基准FRoG,其特点是包含广义量化词的现实世界数学单词问题。我们的实验结果表明,模糊推理继续给LLM带来重大挑战。此外,我们发现旨在增强推理的现有方法并不能始终如一地提高涉及模糊逻辑的任务的性能。此外,我们的结果显示了LLM在FRoG上的性能存在逆比例效应。有趣的是,我们还证明,强大的数学推理能力并不一定表明我们的基准取得成功。

[NLP-65] PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs
[NLP-65] PocketLLM:为个性化LLM启用设备上微调

链接: https://arxiv.org/abs/2407.01031
作者: Dan Peng,Zhihui Fu,Jun Wang
关键词: Recent advancements, large language models, impressive capabilities, advancements in large, large language
中文关键词: 最近的进步、大型语言模型、令人印象深刻的功能、大型语言的进步
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to the ACL 2024 Workshop on Privacy in Natural Language Processing (PrivateNLP)

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have indeed showcased their impressive capabilities. On mobile devices, the wealth of valuable, non-public data generated daily holds great promise for locally fine-tuning personalized LLMs, while maintaining privacy through on-device processing. However, the constraints of mobile device resources pose challenges to direct on-device LLM fine-tuning, mainly due to the memory-intensive nature of derivative-based optimization required for saving gradients and optimizer states. To tackle this, we propose employing derivative-free optimization techniques to enable on-device fine-tuning of LLM, even on memory-limited mobile devices. Empirical results demonstrate that the RoBERTa-large model and OPT-1.3B can be fine-tuned locally on the OPPO Reno 6 smartphone using around 4GB and 6.5GB of memory respectively, using derivative-free optimization techniques. This highlights the feasibility of on-device LLM fine-tuning on mobile devices, paving the way for personalized LLMs on resource-constrained devices while safeguarding data privacy.
摘要:大型语言模型(LLM)的最新进展确实展示了它们令人印象深刻的能力。在移动设备上,每天产生的宝贵的非公开数据为本地微调个性化LLM提供了巨大的希望,同时通过设备上的处理保持隐私。然而,移动设备资源的限制给直接在设备上进行LLM微调带来了挑战,这主要是因为保存梯度和优化器状态所需的基于导数的优化的内存密集型本质。为了解决这个问题,我们建议使用免导数优化技术来在设备上实现LLM的微调,即使在内存有限的移动设备上也是如此。实验结果表明,Roberta-Large模型和OPT-1.3B可以在oppo Reno 6智能手机上进行本地微调,分别使用约4 GB和6.5 GB的内存,使用免导数优化技术。这突出了在移动设备上进行设备上LLM微调的可行性,为在资源受限的设备上实现个性化LLM铺平了道路,同时保护了数据隐私。

[NLP-66] Augmenting Document-level Relation Extraction with Efficient Multi-Supervision
[NLP-66] 通过高效的多重监督增强文档级关系提取

链接: https://arxiv.org/abs/2407.01026
作者: Xiangyu Lin,Weijia Jia,Zhiguo Gong
关键词: low information density, distantly supervised data, document-level relation extraction, relation extraction due, sentence-level relation extraction
中文关键词: 低信息密度、远程监督数据、文档级关系提取、关系提取到期、业务级关系提取
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite its popularity in sentence-level relation extraction, distantly supervised data is rarely utilized by existing work in document-level relation extraction due to its noisy nature and low information density. Among its current applications, distantly supervised data is mostly used as a whole for pertaining, which is of low time efficiency. To fill in the gap of efficient and robust utilization of distantly supervised training data, we propose Efficient Multi-Supervision for document-level relation extraction, in which we first select a subset of informative documents from the massive dataset by combining distant supervision with expert supervision, then train the model with Multi-Supervision Ranking Loss that integrates the knowledge from multiple sources of supervision to alleviate the effects of noise. The experiments demonstrate the effectiveness of our method in improving the model performance with higher time efficiency than existing baselines.
摘要:尽管远程监督数据在业务级关系提取中很受欢迎,但由于其有噪音和低信息密度,文档级关系提取的现有工作很少利用远程监督数据。在其当前的应用中,远程监督的数据大多作为一个整体用于关联,时间效率较低。为了填补远程监督训练数据高效、稳健利用的空白,我们提出了文档级关系提取的高效多监督,其中我们首先通过将远程监督与专家监督相结合,从海量数据集中选择信息文档的子集,然后使用Multi-训练模型监督排名损失,整合来自多个监督来源的知识,以减轻噪音的影响。实验证明了我们的方法在提高模型性能方面的有效性,时间效率比现有基线更高。

[NLP-67] DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models
[NLP-67] DynaThink:快还是慢?大型语言模型的动态决策框架

链接: https://arxiv.org/abs/2407.01009
作者: Jiabao Pan,Yan Zhang,Chen Zhang,Zuozhu Liu,Hongwei Wang,Haizhou Li
关键词: Large language models, Large language, demonstrated emergent capabilities, language models, fast COT approach
中文关键词: 大型语言模型、大型语言、演示的紧急能力、语言模型、快速COT方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated emergent capabilities across diverse reasoning tasks via popular Chains-of-Thought (COT) prompting. However, such a simple and fast COT approach often encounters limitations in dealing with complicated problems, while a thorough method, which considers multiple reasoning pathways and verifies each step carefully, results in slower inference. This paper addresses the challenge of enabling LLMs to autonomously select between fast and slow inference methods, thereby optimizing both efficiency and effectiveness. We introduce a dynamic decision-making framework that categorizes tasks into two distinct pathways: ‘Fast’, designated for tasks where the LLM quickly identifies a high-confidence solution, and ‘Slow’, allocated for tasks that the LLM perceives as complex and for which it has low confidence in immediate solutions as well as requiring more reasoning paths to verify. Experiments on five popular reasoning benchmarks demonstrated the superiority of the DynaThink over baselines.
摘要:大型语言模型(LLM)通过流行的思维链(CoT)提示,在不同的推理任务中表现出了涌现的能力。然而,这种简单、快速的COT方法在处理复杂问题时往往会遇到局限性,而考虑多条推理路径并仔细验证每一步的彻底方法会导致推理速度较慢。本文解决了使LLMS能够在快速和慢速推理方法之间自主选择的挑战,从而优化了效率和有效性。我们引入了一个动态决策框架,将任务分类为两条不同的路径:“快速”,指定给LLM快速识别高置信度解决方案的任务;“慢”,分配给LLM认为复杂且对立即解决方案信心较低的任务,以及需要更多推理路径来验证的任务。在五个流行的推理基准上的实验证明了动态思维相对于基线的优越性。

[NLP-68] Engineering Conversational Search Systems: A Review of Applications Architectures and Functional Components
[NLP-68] 工程对话搜索系统:应用程序架构和功能组件回顾

链接: https://arxiv.org/abs/2407.00997
作者: Phillip Schneider,Wessel Poelman,Michael Rovatsos,Florian Matthes
关键词: multiple dialogue turns, Conversational search systems, users’ information gain, maximizing users’ information, enable information retrieval
中文关键词: 多次对话回合、对话式搜索系统、用户信息获得、最大化用户信息、实现信息检索
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 NLP4ConvAI Workshop

点击查看摘要

Abstract:Conversational search systems enable information retrieval via natural language interactions, with the goal of maximizing users’ information gain over multiple dialogue turns. The increasing prevalence of conversational interfaces adopting this search paradigm challenges traditional information retrieval approaches, stressing the importance of better understanding the engineering process of developing these systems. We undertook a systematic literature review to investigate the links between theoretical studies and technical implementations of conversational search systems. Our review identifies real-world application scenarios, system architectures, and functional components. We consolidate our results by presenting a layered architecture framework and explaining the core functions of conversational search systems. Furthermore, we reflect on our findings in light of the rapid progress in large language models, discussing their capabilities, limitations, and directions for future research.
摘要:会话搜索系统通过自然语言交互实现信息检索,其目标是在多次对话中最大化用户的信息收益。采用这种搜索模式的对话界面越来越普遍,这对传统的信息检索方法提出了挑战,强调了更好地了解开发这些系统的工程过程的重要性。我们进行了系统的文献回顾,以调查对话搜索系统的理论研究和技术实现之间的联系。我们的审查确定了真实世界的应用程序场景、系统架构和功能组件。我们通过提出分层架构框架和解释对话式搜索系统的核心功能来巩固我们的结果。此外,鉴于大型语言模型的快速发展,我们反思了我们的发现,讨论了它们的能力、局限性和未来研究的方向。

[NLP-69] Can Small Language Models Learn Unlearn and Retain Noise Patterns?
[NLP-69] 小型语言模型可以学习忘记并保留噪音模式吗?

链接: https://arxiv.org/abs/2407.00996
作者: Nicy Scaria,Silvester John Joseph Kennedy,Deepak Subramani
关键词: Small Language Models, large language models, Language Models, Small Language, large language
中文关键词: 小语言模型,大语言模型,语言模型,小语言,大语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Small Language Models (SLMs) are generally considered to be more compact versions of large language models (LLMs), typically having fewer than 7 billion parameters. This study investigates the ability of small language models to learn, retain, and subsequently eliminate noise that is typically not found on the internet, where most pretraining datasets are sourced. For this, four pre-trained SLMs were utilized: Olmo 1B, Qwen1.5 1.8B, Gemma 2B, and Phi2 2.7B. The models were instruction-tuned without noise and tested for task execution with in-context learning. Afterward, noise patterns were introduced to evaluate the models’ learning and unlearning capabilities. We evaluated the models’ performance at various training levels. Phi consistently excelled with word-level noise but performed the worst with character-level noise. Despite being the smallest with approximately 1 billion parameters, Olmo performed consistently well on tasks.
摘要:小型语言模型(SLC)通常被认为是大型语言模型(LLM)的更紧凑版本,通常参数少于70亿个。这项研究调查了小型语言模型学习、保留并随后消除通常在互联网上找不到的噪音的能力,而互联网是大多数预训练数据集的来源地。为此,使用了四个预先训练的STM:Olmo 1B、Qwen 1.5 1.8B、Gemma 2B和Phi 2 2.7B。这些模型在没有噪音的情况下进行了描述调整,并通过上下文学习测试了任务执行。随后,引入噪音模式来评估模型的学习和非学习能力。我们评估了模型在不同培训水平下的表现。Phi在单词级噪音方面一直表现出色,但在字符级噪音方面表现最差。尽管奥尔莫是最小的,参数约为10亿个,但他在任务中始终表现良好。

[NLP-70] LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation
[NLP-70] 通过方向蕴含图和索赔水平响应增强进行LLM不确定性量化

链接: https://arxiv.org/abs/2407.00994
作者: Longchao Da,Tiejin Chen,Lu Cheng,Hua Wei
关键词: Large language models, showcased superior capabilities, Large language, stemming from basic, basic question-answer
中文关键词: 大型语言模型,展示了卓越的能力,大型语言,源于基本的、基本的问答
类目: Computation and Language (cs.CL)
备注: 11 pages main content, 5 pages appendix

点击查看摘要

Abstract:The Large language models (LLMs) have showcased superior capabilities in sophisticated tasks across various domains, stemming from basic question-answer (QA), they are nowadays used as decision assistants or explainers for unfamiliar content. However, they are not always correct due to the data sparsity in specific domain corpus, or the model’s hallucination problems. Given this, how much should we trust the responses from LLMs? This paper presents a novel way to evaluate the uncertainty that captures the directional instability, by constructing a directional graph from entailment probabilities, and we innovatively conduct Random Walk Laplacian given the asymmetric property of a constructed directed graph, then the uncertainty is aggregated by the derived eigenvalues from the Laplacian process. We also provide a way to incorporate the existing work’s semantics uncertainty with our proposed layer. Besides, this paper identifies the vagueness issues in the raw response set and proposes an augmentation approach to mitigate such a problem, we conducted extensive empirical experiments and demonstrated the superiority of our proposed solutions.
摘要:大型语言模型(LLM)在各种领域的复杂任务中显示出了优越的能力,源于基本的问答(QA),它们现在被用作决策助手或不熟悉的内容的解释器。然而,由于特定领域语料库中的数据稀疏,或者模型的幻觉问题,它们并不总是正确的。有鉴于此,我们应该在多大程度上信任低收入国家的回应?本文提出了一种新的方法来评估反映方向不稳定性的不确定性,通过从蕴涵概率构造有向图,并创新性地对构造的有向图进行随机游走拉普拉斯运算,然后利用从拉普拉斯过程得到的特征值来聚合不确定性。我们还提供了一种将现有工作的语义不确定性合并到我们提议的层中的方法。此外,本文识别了原始响应集合中的模糊问题,并提出了一种增强方法来缓解这一问题,我们进行了大量的实证实验,证明了我们所提出的解决方案的优越性。

[NLP-71] Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
[NLP-71] Mobile-Bench:基于LLM的移动代理的评估基准

链接: https://arxiv.org/abs/2407.00993
作者: Shihan Deng,Weikai Xu,Hongda Sun,Wei Liu,Tao Tan,Jianfeng Liu,Ang Li,Jian Luan,Bin Wang,Rui Yan,Shuo Shang
关键词: large language models, LLM-based mobile agents, mobile agents, language models, human-computer interaction
中文关键词: 大型语言模型、基于LLM的移动代理、移动代理、语言模型、人机交互
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the remarkable advancements of large language models (LLMs), LLM-based agents have become a research hotspot in human-computer interaction. However, there is a scarcity of benchmarks available for LLM-based mobile agents. Benchmarking these agents generally faces three main challenges: (1) The inefficiency of UI-only operations imposes limitations to task evaluation. (2) Specific instructions within a singular application lack adequacy for assessing the multi-dimensional reasoning and decision-making capacities of LLM mobile agents. (3) Current evaluation metrics are insufficient to accurately assess the process of sequential actions. To this end, we propose Mobile-Bench, a novel benchmark for evaluating the capabilities of LLM-based mobile agents. First, we expand conventional UI operations by incorporating 103 collected APIs to accelerate the efficiency of task completion. Subsequently, we collect evaluation data by combining real user queries with augmentation from LLMs. To better evaluate different levels of planning capabilities for mobile agents, our data is categorized into three distinct groups: SAST, SAMT, and MAMT, reflecting varying levels of task complexity. Mobile-Bench comprises 832 data entries, with more than 200 tasks specifically designed to evaluate multi-APP collaboration scenarios. Furthermore, we introduce a more accurate evaluation metric, named CheckPoint, to assess whether LLM-based mobile agents reach essential points during their planning and reasoning steps.
摘要:随着大语言模型的发展,基于大语言模型的智能体已成为人机交互领域的研究热点。然而,基于LLM的移动代理缺乏可用的基准。对这些代理进行基准测试通常面临三个主要挑战:(1)仅限用户界面的操作效率低下,对任务评估造成了限制。(2)单一应用程序中的特定指令不足以评估LLM移动代理的多维推理和决策能力。(3)现有的评价指标不足以准确地评估顺序动作的过程。为此,我们提出了一种新的评估基于LLM的移动代理性能的基准–Mobile-BENCH。首先,我们通过整合103个收集的API来扩展传统的UI操作,以加快任务完成的效率。随后,我们通过将真实用户查询与来自LLMS的扩充相结合来收集评估数据。为了更好地评估移动代理的不同规划能力级别,我们的数据被分为三个不同的组:SAST、SAMT和MAMT,反映了不同级别的任务复杂性。Mobile-Back包含832个数据条目,200多个任务专门设计用于评估多应用协作场景。此外,我们引入了一种更准确的评估度量,称为检查点,用于评估基于LLM的移动代理在规划和推理过程中是否到达关键点。

[NLP-72] VisEval: A Benchmark for Data Visualization in the Era of Large Language Models
[NLP-72] VisEval:大型语言模型时代数据可视化的基准

链接: https://arxiv.org/abs/2407.00981
作者: Nan Chen,Yuge Zhang,Jiahang Xu,Kan Ren,Yuqing Yang
关键词: Translating natural language, visual data analysis, shown great promise, natural language processing, Translating natural
中文关键词: 翻译自然语言,视觉数据分析,表现出巨大的前景,自然语言处理,翻译自然
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Translating natural language to visualization (NL2VIS) has shown great promise for visual data analysis, but it remains a challenging task that requires multiple low-level implementations, such as natural language processing and visualization design. Recent advancements in pre-trained large language models (LLMs) are opening new avenues for generating visualizations from natural language. However, the lack of a comprehensive and reliable benchmark hinders our understanding of LLMs’ capabilities in visualization generation. In this paper, we address this gap by proposing a new NL2VIS benchmark called VisEval. Firstly, we introduce a high-quality and large-scale dataset. This dataset includes 2,524 representative queries covering 146 databases, paired with accurately labeled ground truths. Secondly, we advocate for a comprehensive automated evaluation methodology covering multiple dimensions, including validity, legality, and readability. By systematically scanning for potential issues with a number of heterogeneous checkers, VisEval provides reliable and trustworthy evaluation outcomes. We run VisEval on a series of state-of-the-art LLMs. Our evaluation reveals prevalent challenges and delivers essential insights for future advancements.
摘要:自然语言到可视化的转换(NL2VIS)在可视化数据分析方面显示出巨大的潜力,但它仍然是一项具有挑战性的任务,需要自然语言处理和可视化设计等多种底层实现。最近在预先训练的大型语言模型(LLM)中的进展为从自然语言中生成可视化开辟了新的途径。然而,缺乏一个全面可靠的基准,阻碍了我们对LLMS在可视化生成方面的能力的理解。在本文中,我们通过提出一种新的NL2VIS基准来解决这一差距,该基准称为VisEval。首先,我们介绍了一个高质量、大规模的数据集。该数据集包括2,524个代表性查询,涉及146个数据库,并与准确标记的基本事实配对。其次,我们提倡一种涵盖多个维度的全面的自动化评估方法,包括有效性、合法性和可读性。通过使用多个不同的检查器系统地扫描潜在问题,VisEval提供了可靠和值得信赖的评估结果。我们在一系列最先进的LLM上运行VisEval。我们的评估揭示了普遍存在的挑战,并为未来的发展提供了重要的见解。

[NLP-73] Universal Approximation Theory: The basic theory for large language models
[NLP-73] 普遍逼近理论:大型语言模型的基本理论

链接: https://arxiv.org/abs/2407.00958
作者: Wei Wang,Qing Li
关键词: artificial intelligence, innovations like ChatGPT, area of focus, focus in artificial, introduction of groundbreaking
中文关键词: 人工智能、ChatGPT等创新、重点领域、专注于人工、开创性的引入
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models have emerged as a critical area of focus in artificial intelligence, particularly with the introduction of groundbreaking innovations like ChatGPT. Large-scale Transformer networks have quickly become the leading approach for advancing natural language processing algorithms. Built on the Transformer architecture, these models enable interactions that closely mimic human communication and, equipped with extensive knowledge, can even assist in guiding human tasks. Despite their impressive capabilities and growing complexity, a key question remains-the theoretical foundations of large language models (LLMs). What makes Transformer so effective for powering intelligent language applications, such as translation and coding? What underlies LLMs’ ability for In-Context Learning (ICL)? How does the LoRA scheme enhance the fine-tuning of LLMs? And what supports the practicality of pruning LLMs? To address these critical questions and explore the technological strategies within LLMs, we leverage the Universal Approximation Theory (UAT) to offer a theoretical backdrop, shedding light on the mechanisms that underpin these advancements.
摘要:语言模型已经成为人工智能领域的一个关键焦点,特别是随着ChatGPT等突破性创新的引入。大规模的Transformer网络已经迅速成为改进自然语言处理算法的主要方法。这些模型构建在Transformer架构之上,能够实现紧密模拟人类交流的交互,并配备了丰富的知识,甚至可以帮助指导人工任务。尽管它们的功能令人印象深刻,而且日益复杂,但一个关键问题仍然存在–大型语言模型(LLM)的理论基础。是什么让Transformer在支持智能语言应用(如翻译和编码)方面如此有效?LLMS的情景中学习(ICL)能力的基础是什么?LORA方案如何增强LLMS的微调?什么支持修剪LLM的实用性?为了解决这些关键问题并探索LLMS中的技术战略,我们利用通用近似理论(UAT)提供了一个理论背景,揭示了支撑这些进步的机制。

[NLP-74] SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models
[NLP-74] SplitLoRA:用于大型语言模型的分离参数高效微调框架

链接: https://arxiv.org/abs/2407.00952
作者: Zheng Lin,Xuanjie Hu,Yuxin Zhang,Zhe Chen,Zihan Fang,Xianhao Chen,Ang Li,Praneeth Vepakomma,Yue Gao
关键词: LLM fine-tuning, LLM fine-tuning paradigm, LLM, large language models, handling high-complexity models
中文关键词: LLM微调,LLM微调范式,LLM,大型语言模型,处理高复杂性模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:The scalability of large language models (LLMs) in handling high-complexity models and large-scale datasets has led to tremendous successes in pivotal domains. While there is an urgent need to acquire more training data for LLMs, a concerning reality is the depletion of high-quality public datasets within a few years. In view of this, the federated learning (FL) LLM fine-tuning paradigm recently has been proposed to facilitate collaborative LLM fine-tuning on distributed private data, where multiple data owners collaboratively fine-tune a shared LLM without sharing raw data. However, the staggering model size of LLMs imposes heavy computing and communication burdens on clients, posing significant barriers to the democratization of the FL LLM fine-tuning paradigm. To address this issue, split learning (SL) has emerged as a promising solution by offloading the primary training workload to a server via model partitioning while exchanging activation/activation’s gradients with smaller data sizes rather than the entire LLM. Unfortunately, research on the SL LLM fine-tuning paradigm is still in its nascent stage. To fill this gap, in this paper, we propose the first SL LLM fine-tuning framework, named SplitLoRA. SplitLoRA is built on the split federated learning (SFL) framework, amalgamating the advantages of parallel training from FL and model splitting from SL and thus greatly enhancing the training efficiency. It is worth noting that SplitLoRA is the inaugural open-source benchmark for SL LLM fine-tuning, providing a foundation for research efforts dedicated to advancing SL LLM fine-tuning. Extensive simulations validate that SplitLoRA achieves target accuracy in significantly less time than state-of-the-art LLM fine-tuning frameworks, demonstrating the superior training performance of SplitLoRA. The project page is available at this https URL.
摘要:大型语言模型(LLM)在处理高复杂性模型和大规模数据集方面的可扩展性导致了在关键领域的巨大成功。虽然迫切需要为土地管理系统获取更多的训练数据,但一个令人担忧的现实是,高质量的公共数据集在几年内就会枯竭。有鉴于此,最近提出了联合学习(FL)LLM微调范例,以促进分布式私有数据上的协作式LLM微调,其中多个数据所有者在不共享原始数据的情况下协作微调共享LLM。然而,LLMS惊人的模型大小给客户带来了沉重的计算和通信负担,对FL LLM微调范例的民主化构成了重大障碍。为了解决这个问题,分离学习(SL)已经成为一种有前途的解决方案,它通过模型分区将主要培训工作量卸载到服务器,同时用较小的数据大小交换激活/激活的梯度,而不是整个LLM。遗憾的是,对二语LLM微调范式的研究仍处于初级阶段。为了填补这一空白,本文提出了第一个SL LLM微调框架SplitLoRA。SplitLoRA建立在分裂联邦学习(SFL)框架上,融合了FL并行训练和SL模型分裂的优点,极大地提高了训练效率。值得注意的是,SplitLoRA是第一个用于SL LLM微调的开源基准,为致力于推进SL LLM微调的研究工作提供了基础。大量的仿真验证了SplitLoRA比最先进的LLM微调框架在显著更少的时间内达到目标精度,展示了SplitLoRA优越的训练性能。该项目页面可通过此HTTPS URL访问。

[NLP-75] he House Always Wins: A Framework for Evaluating Strategic Deception in LLMs
[NLP-75] 众议院永远获胜:评估法学院战略欺骗的框架

链接: https://arxiv.org/abs/2407.00948
作者: Tanush Chopra,Michael Li
关键词: large language models, language models, large language, evaluating strategic deception, fair play
中文关键词: 大型语言模型,语言模型,大型语言,评估战略欺骗,公平竞争
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Research conducted at the Deception Detection Hackathon 2024 hosted by Apart Apollo Research

点击查看摘要

Abstract:We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the “house.” Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.
摘要:我们提出了一个评估大型语言模型(LLM)中战略欺骗的框架。在这个框架中,LLM在两种情况下充当游戏大师:一种是随机游戏机制,另一种是可以在随机或故意动作之间进行选择。例如,我们使用二十一点是因为动作空间和策略都涉及欺骗。我们在二十一点中对Llama 3 - 70 B、GPT-4-Turbo和Mixtral进行基准测试,将结果与公平竞争中的预期分布进行比较,以确定LLM是否制定有利于“房子”的策略。“我们的研究结果表明,当给予隐性随机性指令时,LLM表现出与公平竞争的显着偏差,这表明在模糊场景中存在战略操纵的倾向。然而,当提出明确的选择时,LLM在很大程度上遵守公平竞争,这表明指令的框架在引发或减轻人工智能系统中潜在的欺骗行为方面发挥着至关重要的作用。

[NLP-76] ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions
[NLP-76] ProductAgent:通过提出澄清问题来对对话式产品搜索代理进行基准测试

链接: https://arxiv.org/abs/2407.00942
作者: Jingheng Ye,Yong Jiang,Xiaobin Wang,Yinghui Li,Yangning Li,Hai-Tao Zheng,Pengjun Xie,Fei Huang
关键词: tailored product searching, clarification question generation, strategic clarification question, e-commercial scenario, paper introduces
中文关键词: 定制产品搜索、澄清问题生成、战略澄清问题、电子商务场景、论文介绍
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 13 tables, 6 figures. Under review

点击查看摘要

Abstract:This paper introduces the task of product demand clarification within an e-commercial scenario, where the user commences the conversation with ambiguous queries and the task-oriented agent is designed to achieve more accurate and tailored product searching by asking clarification questions. To address this task, we propose ProductAgent, a conversational information seeking agent equipped with abilities of strategic clarification question generation and dynamic product retrieval. Specifically, we develop the agent with strategies for product feature summarization, query generation, and product retrieval. Furthermore, we propose the benchmark called PROCLARE to evaluate the agent’s performance both automatically and qualitatively with the aid of a LLM-driven user simulator. Experiments show that ProductAgent interacts positively with the user and enhances retrieval performance with increasing dialogue turns, where user demands become gradually more explicit and detailed. All the source codes will be released after the review anonymity period.
摘要:本文介绍了电子商务场景中的产品需求澄清任务,其中用户以含糊的问题开始对话,面向任务的代理被设计为通过提出澄清问题来实现更准确和定制的产品搜索。为了解决这一问题,我们提出了一种对话式信息搜索代理ProductAgent,它具有策略澄清问题生成和动态产品检索的能力。具体地说,我们开发了具有产品特征摘要、查询生成和产品检索策略的代理。此外,我们还提出了一个称为PROCLARE的基准测试程序,它可以在LLM驱动的用户模拟器的帮助下,自动和定性地评估代理的性能。实验表明,随着对话次数的增加,ProductAgent与用户进行了积极的交互,提高了检索性能,用户的需求逐渐变得更加明确和详细。所有源代码将在审查匿名期后发布。

[NLP-77] MalAlgoQA: A Pedagogical Approach for Evaluating Counterfactual Reasoning Abilities
[NLP-77] MalAlgoQA:评估反事实推理能力的教学方法

链接: https://arxiv.org/abs/2407.00938
作者: Naiming Liu,Shashank Sonkar,Myco Le,Richard Baraniuk
关键词: Large Language Models, Large Language, capabilities of Large, answer rationale identification, paper introduces MalAlgoQA
中文关键词: 大型语言模型、大型语言、大型功能、答案原理识别、论文介绍MalAlgoQA
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This paper introduces MalAlgoQA, a novel dataset designed to evaluate the counterfactual reasoning capabilities of Large Language Models (LLMs) through a pedagogical approach. The dataset comprises mathematics and reading comprehension questions, each accompanied by four answer choices and their corresponding rationales. We focus on the incorrect answer rationales, termed “malgorithms”, which highlights flawed reasoning steps leading to incorrect answers and offers valuable insights into erroneous thought processes. We also propose the Malgorithm Identification task, where LLMs are assessed based on their ability to identify corresponding malgorithm given an incorrect answer choice. To evaluate the model performance, we introduce two metrics: Algorithm Identification Accuracy (AIA) for correct answer rationale identification, and Malgorithm Identification Accuracy (MIA) for incorrect answer rationale identification. The task is challenging since state-of-the-art LLMs exhibit significant drops in MIA as compared to AIA. Moreover, we find that the chain-of-thought prompting technique not only fails to consistently enhance MIA, but can also lead to underperformance compared to simple prompting. These findings hold significant implications for the development of more cognitively-inspired LLMs to improve their counterfactual reasoning abilities, particularly through a pedagogical perspective where understanding and rectifying student misconceptions are crucial.
摘要:本文介绍了一个新的数据集MalAlgoQA,该数据集旨在通过教学方法评估大型语言模型(LLMS)的反事实推理能力。数据集包括数学和阅读理解问题,每个问题都伴随着四个答案选择和相应的理论基础。我们专注于错误答案的基本原理,称为“M算法”,它突出了导致错误答案的有缺陷的推理步骤,并提供了对错误思维过程的有价值的见解。我们还提出了M算法识别任务,其中评估LLM的基础是它们在给定错误答案选择的情况下识别相应M算法的能力。为了评估模型的性能,我们引入了两个度量:算法识别准确率(AIA)和算法识别准确率(MIA),分别用于正确答案理由识别和错误答案理由识别。这项任务具有挑战性,因为与AIA相比,最先进的LLM在MIA中显示出显著的下降。此外,我们发现,链式提示技术不仅不能持续地提高MIA,而且与简单的提示相比,还可能导致表现不佳。这些发现对于开发更多受认知启发的LLM以提高他们的反事实推理能力具有重要意义,特别是通过理解和纠正学生错误概念的教学角度。

[NLP-78] Large Language Model Enhanced Knowledge Representation Learning: A Survey
[NLP-78] 大语言模型增强知识表示学习:调查

链接: https://arxiv.org/abs/2407.00936
作者: Xin Wang,Zirui Chen,Haofen Wang,Leong Hou U,Zhao Li,Wenbin Guo
关键词: Large Language Models, Knowledge Representation Learning, complex knowledge structures, Large Language, utilize complex knowledge
中文关键词: 大型语言模型、知识表示学习、复杂知识结构、大型语言、利用复杂知识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with Knowledge Representation Learning (KRL) signifies a pivotal advancement in the field of artificial intelligence, enhancing the ability to capture and utilize complex knowledge structures. This synergy leverages the advanced linguistic and contextual understanding capabilities of LLMs to improve the accuracy, adaptability, and efficacy of KRL, thereby expanding its applications and potential. Despite the increasing volume of research focused on embedding LLMs within the domain of knowledge representation, a thorough review that examines the fundamental components and processes of these enhanced models is conspicuously absent. Our survey addresses this by categorizing these models based on three distinct Transformer architectures, and by analyzing experimental data from various KRL downstream tasks to evaluate the strengths and weaknesses of each approach. Finally, we identify and explore potential future research directions in this emerging yet underexplored domain, proposing pathways for continued progress.
摘要:大型语言模型(LLMS)与知识表示学习(KRL)的结合标志着人工智能领域的一大进步,增强了捕捉和利用复杂知识结构的能力。这种协同利用了LLMS的高级语言和上下文理解能力,以提高KRL的准确性、适应性和有效性,从而扩大其应用和潜力。尽管越来越多的研究集中于将LLM嵌入知识表示领域,但明显缺乏对这些增强模型的基本组成部分和过程进行彻底审查。我们的调查通过根据三种不同的Transformer架构对这些模型进行分类,并通过分析来自各种KRL下游任务的实验数据来评估每种方法的优缺点,从而解决了这一问题。最后,我们确定并探索了这一新兴但未被探索的领域的潜在未来研究方向,提出了继续取得进展的途径。

[NLP-79] Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining
[NLP-79] 向前看还是环顾四周?自回归与掩蔽预训练的理论比较

链接: https://arxiv.org/abs/2407.00935
作者: Qi Zhang,Tianqi Du,Haotian Huang,Yifei Wang,Yisen Wang
关键词: masked SSL, SSL, generative self-supervised learning, autoregressive SSL, generative SSL paradigms
中文关键词: 屏蔽SSL、SSL、生成性自我监督学习、自回归SSL、生成性SSL范式
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL. Through establishing theoretical frameworks, we elucidate the strengths and limitations of autoregressive and masked SSL within the primary evaluation tasks of classification and content generation. Our findings demonstrate that in classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections compared to the fixed position of target tokens in autoregressive SSL, which yields superior clustering performance. In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts in masked SSL (vs. flexible lengths of conditional texts in autoregressive SSL) hinders its generation performance. To leverage each other’s strengths and mitigate weaknesses, we propose diversity-enhanced autoregressive and variable-length masked objectives, which substantially improve the classification performance of autoregressive SSL and the generation performance of masked SSL. Code is available at this https URL.
摘要:近年来,生成性自我监督学习(SSL)范式的兴起在视觉、语言和多模式领域表现出了令人印象深刻的表现。虽然生成性SSL目标的不同设计导致了下游任务中的不同属性,但对这些差异的理论理解仍在很大程度上有待探索。在这篇文章中,我们首次对两种主要的生成性SSL范式:自回归SSL和掩蔽SSL进行了理论上的比较。通过建立理论框架,我们阐明了自回归和掩蔽SSL在分类和内容生成的主要评估任务中的优势和局限性。我们的发现表明,在分类任务中,掩蔽SSL中目标标记的灵活性与自回归SSL中目标标记的固定位置相比,可以促进更多的样本间连接,从而产生更好的聚类性能。在内容生成任务中,测试样本的可变长度与掩蔽SSL中未屏蔽文本的固定长度之间的不一致(与自回归SSL中条件文本的可变长度相比)阻碍了其生成性能。为了取长补短,我们提出了分集增强的自回归和可变长度掩码目标,大大提高了自回归算法的分类性能和掩码算法的生成性能。代码可在此HTTPS URL上找到。

[NLP-80] CLEME2.0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction
[NLP-80] CREME 2.0:通过理清编辑以纠正语法错误来实现更具可解释性的评估

链接: https://arxiv.org/abs/2407.00934
作者: Jingheng Ye,Zishan Xu,Yinghui Li,Xuxin Cheng,Linlin Song,Qingyu Zhou,Hai-Tao Zheng,Ying Shen,Xin Su
关键词: Grammatical Error Correction, Error Correction, Grammatical Error, interpretability of Grammatical, previous studies
中文关键词: 语法错误纠正,错误纠正,语法错误,语法的可解释性,以前的研究
类目: Computation and Language (cs.CL)
备注: 16 pages, 8 tables, 2 figures. Under review

点击查看摘要

Abstract:The paper focuses on improving the interpretability of Grammatical Error Correction (GEC) metrics, which receives little attention in previous studies. To bridge the gap, we propose CLEME2.0, a reference-based evaluation strategy that can describe four elementary dimensions of GEC systems, namely hit-correction, error-correction, under-correction, and over-correction. They collectively contribute to revealing the critical characteristics and locating drawbacks of GEC systems. Evaluating systems by Combining these dimensions leads to high human consistency over other reference-based and reference-less metrics. Extensive experiments on 2 human judgement datasets and 6 reference datasets demonstrate the effectiveness and robustness of our method. All the codes will be released after the peer review.
摘要:本文的重点是提高语法错误纠正(GEC)指标的可解释性,而这在之前的研究中很少受到关注。为了弥合这一差距,我们提出了CREME 2.0,这是一种基于参考的评估策略,可以描述GEC系统的四个基本维度,即命中纠正、错误纠正、不足纠正和过度纠正。它们共同有助于揭示GEC系统的关键特征和定位缺陷。通过结合这些维度来评估系统可以比其他基于参考和无参考的指标具有高度的人类一致性。对2个人类判断数据集和6个参考数据集的大量实验证明了我们方法的有效性和稳健性。所有代码将在同行评审后发布。

[NLP-81] FoldGPT: Simple and Effective Large Language Model Compression Scheme
[NLP-81] FoudGPT:简单有效的大型语言模型压缩方案

链接: https://arxiv.org/abs/2407.00928
作者: Songwei Liu,Chao Zeng,Lianqiang Li,Chenqian Yan,Lean Fu,Xing Mei,Fangmin Chen
关键词: escalating data security, data security concerns, deploying large language, mobile devices continues, large language models
中文关键词: 数据安全升级、数据安全担忧、部署大型语言、移动设备继续、大型语言模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of most layers exhibit significant similarity. Moreover, this similarity becomes more pronounced as the model size increases, indicating substantial redundancy in the depth direction of the LLMs. Based on this observation, we propose an efficient model volume compression strategy, termed FoldGPT, which combines block removal and block parameter sharing.This strategy consists of three parts: (1) Based on the learnable gating parameters, we determine the block importance ranking while modeling the coupling effect between blocks. Then we delete some redundant layers based on the given removal rate. (2) For the retained blocks, we apply a specially designed group parameter sharing strategy, where blocks within the same group share identical weights, significantly compressing the number of parameters and slightly reducing latency overhead. (3) After sharing these Blocks, we “cure” the mismatch caused by sparsity with a minor amount of fine-tuning and introduce a tail-layer distillation strategy to improve the performance. Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression, demonstrating the feasibility of achieving model lightweighting through straightforward block removal and parameter sharing.
摘要:在不断升级的数据安全担忧和云成本的推动下,在移动设备上部署大型语言模型(LLM)的需求持续增加。然而,网络带宽和内存限制给在移动设备上部署十亿级别的模型带来了挑战。在这项研究中,我们调查了不同层次在不同尺度上的LLMS的输出,发现大多数层次的输出表现出显著的相似性。此外,随着模型尺寸的增加,这种相似性变得更加明显,这表明在LLMS的深度方向上存在大量冗余。基于此,我们提出了一种结合块去除和块参数共享的高效模型体积压缩策略FoldGPT,该策略由三部分组成:(1)基于可学习的选通参数,在对块之间的耦合效应进行建模的同时,确定块的重要性排序。然后根据给定的去除率删除一些冗余层。(2)对于保留的块,我们采用了专门设计的组参数共享策略,其中同一组内的块共享相同的权重,显著压缩了参数数量,并略微降低了延迟开销。(3)在共享这些块之后,我们用少量的微调来“治愈”稀疏性造成的不匹配,并引入了尾层精馏策略来提高性能。实验表明,FoldGPT在模型压缩效率上优于以往的SOTA方法,证明了通过简单的块去除和参数共享来实现模型轻量化的可行性。

[NLP-82] EXCGEC: A Benchmark of Edit-wise Explainable Chinese Grammatical Error Correction
[NLP-82] EXCGEC:编辑式可解释中文语法错误更正的基准

链接: https://arxiv.org/abs/2407.00924
作者: Jingheng Ye,Shang Qin,Yinghui Li,Xuxin Cheng,Libo Qin,Hai-Tao Zheng,Peng Xing,Zishan Xu,Guo Cheng,Zhao Wei
关键词: Grammatical Error Correction, Existing studies explore, Grammatical Error, Error Correction, Existing studies
中文关键词: 语法错误纠正,现有研究探索,语法错误,错误纠正,现有研究
类目: Computation and Language (cs.CL)
备注: 22 pages, 10 tables, 9 figures. Under review

点击查看摘要

Abstract:Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations. To bridge the gap, this paper introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of both correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented samples featuring the design of hybrid edit-wise explanations. We benchmark several series of LLMs in multiple settings, covering post-explaining and pre-explaining. To promote the development of the task, we introduce a comprehensive suite of automatic metrics and conduct human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. All the codes and data will be released after the review.
摘要:现有的研究探索了语法错误更正(GEC)在有限场景中的可解释性,忽视了更正和解释之间的相互作用。为了弥合这一差距,本文介绍了可解释GEC(EXGEC)的任务,重点关注纠正和解释任务的整体作用。为了促进这项任务,我们提出了EXCGEC,这是一个为中国EXGEC量身定制的基准,由8,216个解释增强样本组成,具有混合编辑解释的设计。我们在多种环境下对几个系列LLM进行基准测试,涵盖后解释和预解释。为了促进任务的发展,我们引入了一套全面的自动指标,并进行了人工评估实验,以证明自由文本解释自动指标的人类一致性。所有代码和数据将在审查后发布。

[NLP-83] Preserving Multilingual Quality While Tuning Query Encoder on English Only
[NLP-83] 在仅以英语调整查询编码器的同时保留多语言质量

链接: https://arxiv.org/abs/2407.00923
作者: Oleg Vasilyev,Randy Sawaya,John Bohannon
关键词: relevant text passages, dense passage retrieval, passage retrieval system, dense passage, text passages
中文关键词: 相关文本段落,密集段落检索,段落检索系统,密集段落,文本段落
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A dense passage retrieval system can serve as the initial stages of information retrieval, selecting the most relevant text passages for downstream tasks. In this work we conducted experiments with the goal of finding how much the quality of a multilingual retrieval could be degraded if the query part of a dual encoder is tuned on an English-only dataset (assuming scarcity of cross-lingual samples for the targeted domain or task). Specifically, starting with a high quality multilingual embedding model, we observe that an English-only tuning may not only preserve the original quality of the multilingual retrieval, but even improve it.
摘要:密集段落检索系统可以作为信息检索的初始阶段,为下游任务选择最相关的文本段落。在这项工作中,我们进行了实验,目标是找出如果双编码器的查询部分在纯英语数据集上调整,多语言检索的质量可能会降低多少(假设目标领域或任务的跨语言样本稀缺)。具体来说,从高质量的多语言嵌入模型开始,我们观察到纯英语调优不仅可以保留多语言检索的原始质量,甚至可以改进它。

[NLP-84] Deep Image-to-Recipe Translation
[NLP-84] 深度图像到食谱翻译

链接: https://arxiv.org/abs/2407.00911
作者: Jiangqin Ma,Bilal Mawji,Franz Williams
关键词: profound level, reflecting the intricate, intricate connection, Eat, cherished food memories
中文关键词: 层次深刻,反映出错综复杂、错综复杂的联系,吃、珍惜的食物记忆
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The modern saying, “You Are What You Eat” resonates on a profound level, reflecting the intricate connection between our identities and the food we consume. Our project, Deep Image-to-Recipe Translation, is an intersection of computer vision and natural language generation that aims to bridge the gap between cherished food memories and the art of culinary creation. Our primary objective involves predicting ingredients from a given food image. For this task, we first develop a custom convolutional network and then compare its performance to a model that leverages transfer learning. We pursue an additional goal of generating a comprehensive set of recipe steps from a list of ingredients. We frame this process as a sequence-to-sequence task and develop a recurrent neural network that utilizes pre-trained word embeddings. We address several challenges of deep learning including imbalanced datasets, data cleaning, overfitting, and hyperparameter selection. Our approach emphasizes the importance of metrics such as Intersection over Union (IoU) and F1 score in scenarios where accuracy alone might be misleading. For our recipe prediction model, we employ perplexity, a commonly used and important metric for language models. We find that transfer learning via pre-trained ResNet-50 weights and GloVe embeddings provide an exceptional boost to model performance, especially when considering training resource constraints. Although we have made progress on the image-to-recipe translation, there is an opportunity for future exploration with advancements in model architectures, dataset scalability, and enhanced user interaction.
摘要:现代谚语“你吃什么,你就是什么”在深刻的层面上引起共鸣,反映了我们的身份和我们消费的食物之间的错综复杂的联系。我们的项目,深度图像到食谱翻译,是计算机视觉和自然语言生成的交叉,旨在弥合珍贵的食物记忆和烹饪创造艺术之间的差距。我们的主要目标是根据给定的食物图像预测配料。对于这项任务,我们首先开发一个定制的卷积网络,然后将其性能与利用迁移学习的模型进行比较。我们追求的另一个目标是从配料列表生成一套全面的食谱步骤。我们将这个过程框架为一个序列到序列的任务,并开发了一个利用预训练单词嵌入的递归神经网络。我们解决了深度学习的几个挑战,包括数据集不平衡、数据清理、过拟合和超参数选择。我们的方法强调了诸如联合交集(IOU)和F1分数等指标在仅有准确性可能会产生误导的场景中的重要性。对于我们的食谱预测模型,我们采用了困惑,这是语言模型的一个常用且重要的度量。我们发现,通过预先训练的ResNet-50权重和手套嵌入的转移学习可以极大地提高模型的性能,特别是在考虑训练资源限制的情况下。尽管我们在图像到配方的转换方面取得了进展,但随着模型体系结构、数据集可伸缩性和增强的用户交互方面的进步,未来仍有机会进行探索。

[NLP-85] FineSurE: Fine-grained Summarization Evaluation using LLMs
[NLP-85] FineSurE:使用LLM进行细粒度总结评估

链接: https://arxiv.org/abs/2407.00908
作者: Hwanjun Song,Hang Su,Igor Shalyminov,Jason Cai,Saab Mansour
关键词: Automated evaluation, streamlining text summarization, crucial for streamlining, streamlining text, costly and time-consuming
中文关键词: 自动评估、简化文本摘要,对于简化、简化文本至关重要,成本高昂且耗时
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2024 (main, long)

点击查看摘要

Abstract:Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at this https URL.
像Rouge这样的传统方法并不能很好地与人类判断相关联,而最近提出的基于LLM的指标只提供了使用Likert-Scale分数的摘要级评估。这限制了更深层次的模型分析,例如,我们只能在摘要级别分配一个幻觉分数,而在句子级别,我们可以计算包含幻觉的句子。为了弥补这些局限性,我们提出了FineSurE,这是一个专门为使用大型语言模型(LLM)的摘要任务量身定做的细粒度评估器。我们将各种开源和专有LLM作为FineSurE的主干进行比较。此外,我们针对SOTA方法(包括基于NLI、QA和LLM的方法)对FineSurE进行了广泛的基准测试,显示出改进的性能,特别是在完备性和简洁性维度上。代码可在此HTTPS URL上找到。

[NLP-86] From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning
[NLP-86] 从内省到最佳实践:多模式背景学习中演示的原则性分析

链接: https://arxiv.org/abs/2407.00902
作者: Nan Xu,Fei Wang,Sheng Zhang,Hoifung Poon,Muhao Chen
关键词: Large Language models, Large Language, multiple image-text pairs, similar ICL abilities, capabilities of Large
中文关键词: 大型语言模型、大型语言、多个图像-文本对、类似的ICL能力、大型的能力
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Considering such modality impact, we further utilize modality-driven demonstration strategies to boost ICL performance. We also identify that demonstration selection is closely related to the models’ ability to capture task inductive biases from multimodal ICL. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks even if those tasks are not seen in or even contradict pretraining data.
摘要:受大语言模型的语境学习能力的启发,具有额外视觉通道的多通道大语言模型在多个图文对的实验中也表现出类似的语境学习能力。然而,对多式联运ICL工作原理的研究相对较少。我们在一系列新的但关键的任务上对不同规模的模型进行了系统和原则性的多模式ICL评估。通过对不同通道信息的扰动,我们表明在多通道ICL中,通道在不同任务中的重要性不同。考虑到这些通道的影响,我们进一步利用通道驱动的演示策略来提高ICL的性能。我们还发现,演示选择与模型从多通道ICL中捕获任务归纳偏差的能力密切相关。我们的原则性分析提供了一种全面的方式来理解演示在多模式情景学习中的作用,并有助于有效地在广泛的任务中改进多模式ICL,即使这些任务在预培训数据中看不到甚至与之相矛盾。

[NLP-87] MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula
[NLP-87] MathCAMPS:人类课程中数学问题的细粒度综合

链接: https://arxiv.org/abs/2407.00900
作者: Shubhra Mishra,Gabriel Poesia,Belinda Mo,Noah D. Goodman
关键词: Large Language Models, Large Language, Language Models, important capability, Mathematics Common Core
中文关键词: 大型语言模型,大型语言,语言模型,重要能力,数学共同核心
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Dataset and code: this https URL

点击查看摘要

Abstract:Mathematical problem solving is an important skill for Large Language Models (LLMs), both as an important capability and a proxy for a range of reasoning abilities. Existing benchmarks probe a diverse set of skills, but they yield aggregate accuracy metrics, obscuring specific abilities or weaknesses. Furthermore, they are difficult to extend with new problems, risking data contamination over time. To address these challenges, we propose MathCAMPS: a method to synthesize high-quality mathematical problems at scale, grounded on 44 fine-grained “standards” from the Mathematics Common Core (CC) Standard for K-8 grades. We encode each standard in a formal grammar, allowing us to sample diverse symbolic problems and their answers. We then use LLMs to realize the symbolic problems into word problems. We propose a cycle-consistency method for validating problem faithfulness. Finally, we derive follow-up questions from symbolic structures and convert them into follow-up word problems - a novel task of mathematical dialogue that probes for robustness in understanding. Experiments on 23 LLMs show surprising failures even in the strongest models (in particular when asked simple follow-up questions). Moreover, we evaluate training checkpoints of Pythia 12B on MathCAMPS, allowing us to analyze when particular mathematical skills develop during its training. Our framework enables the community to reproduce and extend our pipeline for a fraction of the typical cost of building new high-quality datasets.
摘要:数学问题解决是大型语言模型的一项重要技能,它既是一种重要的能力,也是一系列推理能力的代表。现有的基准测试了一系列不同的技能,但它们产生了聚合的准确性指标,掩盖了特定的能力或弱点。此外,它们很难扩展到新的问题,随着时间的推移有数据污染的风险。为了应对这些挑战,我们提出了MathCAMPS:一种大规模综合高质量数学问题的方法,基于K-8年级数学共同核心(CC)标准中的44个细粒度“标准”。我们用形式语法对每个标准进行编码,允许我们对不同的符号问题及其答案进行采样。然后利用最小二乘法将符号问题转化为文字问题。我们提出了一种验证问题忠诚度的循环一致性方法。最后,我们从符号结构中推导出后续问题,并将其转换为后续应用问题–这是一项探索理解稳健性的数学对话的新任务。在23个LLM上的实验显示,即使在最强大的模型中也会出现令人惊讶的失败(特别是在被问及简单的后续问题时)。此外,我们在MathCAMPS上评估了Pythia 12B的训练检查点,使我们能够分析在其训练过程中特定的数学技能何时发展。我们的框架使社区能够以构建新的高质量数据集的典型成本的一小部分来复制和扩展我们的管道。

[NLP-88] How to Leverage Digit Embeddings to Represent Numbers?
[NLP-88] 如何利用数字嵌入来代表数字?

链接: https://arxiv.org/abs/2407.00894
作者: Jasivan Alex Sivakumar,Nafise Sadat Moosavi
关键词: performing arithmetic operations, existing language models, arithmetic operations, Sivakumar and Moosavi, performing arithmetic
中文关键词: 执行算术运算、现有语言模型、算术运算、Sivakumar和Moosavi、执行算术
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Apart from performing arithmetic operations, understanding numbers themselves is still a challenge for existing language models. Simple generalisations, such as solving 100+200 instead of 1+2, can substantially affect model performance (Sivakumar and Moosavi, 2023). Among various techniques, character-level embeddings of numbers have emerged as a promising approach to improve number representation. However, this method has limitations as it leaves the task of aggregating digit representations to the model, which lacks direct supervision for this process. In this paper, we explore the use of mathematical priors to compute aggregated digit embeddings and explicitly incorporate these aggregates into transformer models. This can be achieved either by adding a special token to the input embeddings or by introducing an additional loss function to enhance correct predictions. We evaluate the effectiveness of incorporating this explicit aggregation, analysing its strengths and shortcomings, and discuss future directions to better benefit from this approach. Our methods, while simple, are compatible with any pretrained model and require only a few lines of code, which we have made publicly available.
摘要:除了执行算术运算外,理解数字本身对于现有的语言模型来说仍然是一个挑战。简单的概括,例如求解100+200而不是1+2,可以显著影响模型的性能(Sivakumar和Moosavi,2023)。在各种技术中,数字的字符级嵌入已经成为一种很有前途的改进数字表示的方法。然而,这种方法有局限性,因为它将聚合数字表示的任务留给了模型,而模型缺乏对这一过程的直接监督。在本文中,我们探索使用数学先验来计算聚合数字嵌入,并显式地将这些聚合合并到变压器模型中。这可以通过向输入嵌入添加特殊令牌或通过引入额外的损失函数来增强正确预测来实现。我们评估了合并这种显式聚合的有效性,分析了它的优点和缺点,并讨论了未来的方向,以更好地从这种方法中受益。我们的方法虽然简单,但与任何预先训练的模型兼容,只需要几行代码,我们已经公开了这些代码。

[NLP-89] Papez: Resource-Efficient Speech Separation with Auditory Working Memory
[NLP-89] Papez:具有听觉工作记忆的资源高效语音分离

链接: https://arxiv.org/abs/2407.00888
作者: Hyunseok Oh,Juheon Yi,Youngki Lee
关键词: Transformer-based models recently, extreme computational load, computational load makes, single-channel speech separation, models recently reached
中文关键词: 基于转换器的模型最近,极端的计算负载,计算负载使得,单通道语音分离,最近达到的模型
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 pages. Accepted by ICASSP 2023

点击查看摘要

Abstract:Transformer-based models recently reached state-of-the-art single-channel speech separation accuracy; However, their extreme computational load makes it difficult to deploy them in resource-constrained mobile or IoT devices. We thus present Papez, a lightweight and computation-efficient single-channel speech separation model. Papez is based on three key techniques. We first replace the inter-chunk Transformer with small-sized auditory working memory. Second, we adaptively prune the input tokens that do not need further processing. Finally, we reduce the number of parameters through the recurrent transformer. Our extensive evaluation shows that Papez achieves the best resource and accuracy tradeoffs with a large margin. We publicly share our source code at \textttthis https URL
摘要:基于转换器的模型最近达到了最先进的单通道语音分离准确度;然而,它们极端的计算负载使得难以在资源有限的移动或物联网设备中部署它们。因此,我们提出了Papez,一种轻量级且计算效率高的单通道语音分离模型。Papez基于三项关键技术。我们首先用小尺寸的听觉工作记忆替换块间Transformer。其次,我们自适应地修剪不需要进一步处理的输入令牌。最后,我们通过循环Transformer减少参数的数量。我们的广泛评估表明,Papez以较大的优势实现了最佳的资源和准确性权衡。我们在\textttThis https URL上公开分享我们的源代码

[NLP-90] Mechanistic Interpretation through Contextual Decomposition in Transformers
[NLP-90] 《变形金刚》中通过语境分解进行机械解释

链接: https://arxiv.org/abs/2407.00886
作者: Aliyah R. Hsu,Yeshwanth Cherapanamjeri,Anobel Y. Odisho,Peter R. Carroll,Bin Yu
关键词: black boxes due, complex nonlinear relationships, regarded as black, black boxes, boxes due
中文关键词: 由于黑匣子,复杂的非线性关系,被视为黑匣子,黑匣子,由于盒
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformers exhibit impressive capabilities but are often regarded as black boxes due to challenges in understanding the complex nonlinear relationships between features. Interpreting machine learning models is of paramount importance to mitigate risks, and mechanistic interpretability is in particular of current interest as it opens up a window for guiding manual modifications and reverse-engineering solutions. In this work, we introduce contextual decomposition for transformers (CD-T), extending a prior work on CD for RNNs and CNNs, to address mechanistic interpretation computationally efficiently. CD-T is a flexible interpretation method for transformers. It can capture contributions of combinations of input features or source internal components (e.g. attention heads, feed-forward networks) to (1) final predictions or (2) the output of any target internal component. Using CD-T, we propose a novel algorithm for circuit discovery. On a real-world pathology report classification task: we show CD-T distills a more faithful circuit of attention heads with improved computational efficiency (speed up 2x) than a prior benchmark, path patching. As a versatile interpretation method, CD-T also exhibits exceptional capabilities for local interpretations. CD-T is shown to reliably find words and phrases of contrasting sentiment/topic on SST-2 and AGNews datasets. Through human experiments, we demonstrate CD-T enables users to identify the more accurate of two models and to better trust a model’s outputs compared to alternative interpretation methods such as SHAP and LIME.
摘要:变形金刚显示出令人印象深刻的能力,但由于在理解特征之间复杂的非线性关系方面存在挑战,因此通常被视为黑匣子。解释机器学习模型对于降低风险至关重要,机械性的可解释性尤其令人感兴趣,因为它为指导人工修改和逆向工程解决方案打开了一扇窗。在这项工作中,我们引入了转换器的上下文分解(CD-T),扩展了先前关于RNN和CNN的CD的工作,以解决计算上高效的机械性解释。CD-T是一种灵活的变压器解释方法。它可以捕获输入特征或源内部组件(例如,注意力头部、前馈网络)的组合对(1)最终预测或(2)任何目标内部组件的输出的贡献。利用CD-T,我们提出了一种新的电路发现算法。在一个真实的病理报告分类任务中:我们向CD-T摘要展示了一个更忠实的注意力头部电路,与之前的基准测试路径修补相比,计算效率提高了2倍。作为一种多才多艺的口译方法,CD-T在当地口译中也表现出了非凡的能力。CD-T在SST-2和AgNews数据集上可靠地找到了情感/主题相反的单词和短语。通过人体实验,我们证明CD-T使用户能够识别两个模型中更准确的一个,并且与Shap和LIME等替代解释方法相比,能够更好地信任模型的输出。

[NLP-91] MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting
[NLP-91] MoE-CT:一种抗灾难性遗忘的大型语言模型训练的新型方法

链接: https://arxiv.org/abs/2407.00875
作者: Tianhao Li,Shangjie Li,Binbin Xie,Deyi Xiong,Baosong Yang
关键词: Conventional Continual Training, leaving a disparity, advent of large, predominantly catered, Continual Training
中文关键词: 传统的持续培训,留下了差距,大规模的、主要有迎合性的持续培训的出现
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:The advent of large language models (LLMs) has predominantly catered to high-resource languages, leaving a disparity in performance for low-resource languages. Conventional Continual Training (CT) approaches to bridge this gap often undermine a model’s original linguistic proficiency when expanding to multilingual contexts. Addressing this issue, we introduce a novel MoE-CT architecture, a paradigm that innovatively separates the base model’s learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency. Our approach significantly outperforms conventional CT methods, as evidenced by our experiments, which show marked improvements in multilingual benchmarks without sacrificing the model’s original language performance. Moreover, our MoE-CT framework demonstrates enhanced resistance to forgetting and superior transfer learning capabilities. By preserving the base model’s integrity and focusing on strategic parameter expansion, our methodology advances multilingual language modeling and represents a significant step forward for low-resource language inclusion in LLMs, indicating a fruitful direction for future research in language technologies.
摘要:大型语言模型(LLM)的出现主要迎合了高资源语言,使得低资源语言在性能上存在差异。弥补这一差距的传统持续培训(CT)方法在扩展到多语言环境时往往会破坏模型的原始语言熟练程度。针对这一问题,我们引入了一种新颖的MOE-CT体系结构,它创新性地将基本模型的学习与多语言扩展过程分开。我们的设计冻结了原始的LLM参数,从而保护了它在高资源语言中的性能,而附加的MOE模块,在不同的语言数据集上训练,增强了低资源语言的熟练程度。我们的方法明显优于传统的CT方法,我们的实验表明,在不牺牲模型原始语言性能的情况下,多语言基准测试有显著的改善。此外,我们的MOE-CT框架显示出更强的抗遗忘能力和卓越的迁移学习能力。通过保持基本模型的完整性和专注于战略参数扩展,我们的方法推进了多语言建模,代表了低资源语言包含在LLMS中的重要一步,为语言技术的未来研究指明了一个富有成效的方向。

[NLP-92] Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles
[NLP-92] 角色扮演:使领域专家能够通过启发和遵守原则来创建LLM模拟患者

链接: https://arxiv.org/abs/2407.00870
作者: Ryan Louie(1),Ananjan Nandi(1),William Fang(1),Cheng Chang(1),Emma Brunskill(1),Diyi Yang(1) ((1) Stanford University)
关键词: Recent works leverage, realistic social scenarios, works leverage LLMs, Recent works, roleplay realistic social
中文关键词: 最近的作品利用、现实的社会场景、作品利用LLM、最近的作品、角色扮演现实的社会
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 34 pages, 24 figures, 11 Tables

点击查看摘要

Abstract:Recent works leverage LLMs to roleplay realistic social scenarios, aiding novices in practicing their social skills. However, simulating sensitive interactions, such as in mental health, is challenging. Privacy concerns restrict data access, and collecting expert feedback, although vital, is laborious. To address this, we develop Roleplay-doh, a novel human-LLM collaboration pipeline that elicits qualitative feedback from a domain-expert, which is transformed into a set of principles, or natural language rules, that govern an LLM-prompted roleplay. We apply this pipeline to enable senior mental health supporters to create customized AI patients for simulated practice partners for novice counselors. After uncovering issues in GPT-4 simulations not adhering to expert-defined principles, we also introduce a novel principle-adherence prompting pipeline which shows 30% improvements in response quality and principle following for the downstream task. Via a user study with 25 counseling experts, we demonstrate that the pipeline makes it easy and effective to create AI patients that more faithfully resemble real patients, as judged by creators and third-party counselors.
摘要:最近的作品利用LLM扮演现实的社交场景,帮助新手练习他们的社交技能。然而,模拟敏感的相互作用,例如心理健康,是具有挑战性的。隐私问题限制了数据的访问,收集专家的反馈虽然至关重要,但却很费力。为了解决这个问题,我们开发了Roleplay-doh,这是一种新颖的人-LLM协作管道,它从领域专家那里获得定性反馈,这些反馈被转换为一组原则或自然语言规则,这些原则或自然语言规则管理LLM提示的角色扮演。我们应用这条管道,使资深心理健康支持者能够为模拟实践伙伴为新手顾问创建定制的人工智能患者。在发现GPT-4模拟中不遵循专家定义的原则的问题后,我们还引入了一种新的原则遵守提示管道,该管道在响应质量和下游任务的原则遵循方面都有30%的改进。通过与25名咨询专家进行的用户研究,我们证明,根据创建者和第三方顾问的判断,该管道可以轻松有效地创建更接近真实患者的人工智能患者。

[NLP-93] Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
[NLP-93] 大型语言模型是不自愿的真话者:利用谬误失败进行越狱攻击

链接: https://arxiv.org/abs/2407.00869
作者: Yue Zhou,Henry Peng Zou,Barbara Di Eugenio,Yang Zhang
关键词: difficulties generating fallacious, difficulties generating, deceptive reasoning, language models, language
中文关键词: 产生谬误的困难、产生困难、欺骗性推理、语言模型、语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We find that language models have difficulties generating fallacious and deceptive reasoning. When asked to generate deceptive outputs, language models tend to leak honest counterparts but believe them to be false. Exploiting this deficiency, we propose a jailbreak attack method that elicits an aligned language model for malicious output. Specifically, we query the model to generate a fallacious yet deceptively real procedure for the harmful behavior. Since a fallacious procedure is generally considered fake and thus harmless by LLMs, it helps bypass the safeguard mechanism. Yet the output is factually harmful since the LLM cannot fabricate fallacious solutions but proposes truthful ones. We evaluate our approach over five safety-aligned large language models, comparing four previous jailbreak methods, and show that our approach achieves competitive performance with more harmful outputs. We believe the findings could be extended beyond model safety, such as self-verification and hallucination.
摘要:我们发现语言模型很难产生错误和欺骗性的推理。当被要求生成欺骗性输出时,语言模型往往会泄露诚实的对应结果,但认为它们是错误的。利用这一缺陷,我们提出了一种越狱攻击方法,该方法可以得到恶意输出的对齐语言模型。具体地说,我们对该模型提出质疑,以便为有害行为生成一个虚假但虚假的真实过程。由于错误的程序通常被LLMS认为是虚假的,因此是无害的,它有助于绕过保障机制。然而,这些结果实际上是有害的,因为LLM不能捏造虚假的解决方案,而是提出真实的解决方案。我们在五个安全对齐的大型语言模型上对我们的方法进行了评估,并与之前的四种越狱方法进行了比较,结果表明我们的方法在具有更多有害输出的情况下获得了具有竞争力的性能。我们认为,这些发现可以扩展到模型安全之外,例如自我验证和幻觉。

[NLP-94] owards Robust Speech Representation Learning for Thousands of Languages
[NLP-94] owards针对数千种语言的稳健语音表示学习

链接: https://arxiv.org/abs/2407.00837
作者: William Chen,Wangyou Zhang,Yifan Peng,Xinjian Li,Jinchuan Tian,Jiatong Shi,Xuankai Chang,Soumi Maiti,Karen Livescu,Shinji Watanabe
关键词: Self-supervised learning, helped extend speech, extend speech technologies, helped extend, Self-supervised
中文关键词: 自我监督学习,帮助扩展语音,扩展语音技术,帮助扩展,自我监督
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 20 pages

点击查看摘要

Abstract:Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world’s 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in this https URL.
摘要:自监督学习通过减少对标记数据的需求,帮助将语音技术扩展到更多的语言。然而,机型还远远不能支持世界上7000多种语言。我们提出了XEUS,这是一种用于通用语音的跨语言编码器,针对4057种语言的100多万小时数据进行了培训,将SSL模型的语言覆盖范围扩大了4倍。我们将现有可公开访问的语料库中的100万个小时的演讲与新创建的来自4057种语言的7400多个小时的语料库结合在一起,该语料库将公开发布。为了处理多语言语音数据的不同情况,我们用一个新的去混响目标来增强典型的SSLMASTED预测方法,增加了鲁棒性。我们在几个基准测试中对Xeus进行了评估,结果表明,在各种任务中,Xeus的性能始终优于最先进的(SOTA)SSL模型,或者取得了与之相当的结果。Xeus在ML-Superb基准上设定了新的SOTA:尽管参数或训练前数据较少,但它的性能分别比MMS 1B和w2v-BERT 2.0 v2高0.8%和4.4%。检查点、代码和数据位于此HTTPS URL中。

[NLP-95] NAIST Simultaneous Speech Translation System for IWSLT 2024
[NLP-95] NAIST IWSYS 2024同步语音翻译系统

链接: https://arxiv.org/abs/2407.00826
作者: Yuka Ko,Ryo Fukuda,Yuta Nishikawa,Yasumasa Kano,Tomoya Yanagita,Kosuke Doi,Mana Makinae,Haotian Tan,Makoto Sakai,Sakriani Sakti,Katsuhito Sudoh,Satoshi Nakamura
关键词: paper describes NAIST, describes NAIST submission, Evaluation Campaign, describes NAIST, NAIST submission
中文关键词: 论文描述了NAIST,描述了NAIST提交,评估活动,描述了NAIST,NAIST提交
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: IWSLT 2024 system paper

点击查看摘要

Abstract:This paper describes NAIST’s submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-German, Japanese, Chinese speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.
摘要:本文描述了NAIST提交给IWSLT 2024评估活动的同步轨道:英语到德语、日语、汉语的语音到文本的翻译和英语到日语的语音到语音的翻译。结合Hubert和mBART两种预先训练好的语言模型,我们开发了一个多语种端到端的语音到文本翻译模型。我们使用本地协议(LA)和AlignAtt两种解码策略对该模型进行训练。提交的模型使用LA策略,因为它的性能优于以前模型中的AlignAtt策略。我们的语音到语音翻译方法是上述语音到文本模型和增量文本到语音(TTS)模块的级联,该模块结合了音素估计模型、并行声学模型和并行WaveGAN声码器。我们通过将Transformer架构与AlignAtt策略应用于评估模型来改进我们的增量TTS。结果表明,我们升级的TTS模块有助于提高系统的性能。

[NLP-96] Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning
[NLP-96] 分步控制DPO:利用分步误差进行增强数学推理

链接: https://arxiv.org/abs/2407.00782
作者: Zimu Lu,Aojun Zhou,Ke Wang,Houxing Ren,Weikang Shi,Junting Pan,Mingjie Zhan
关键词: Direct Preference Optimization, Direct Preference, Preference Optimization, large language models, SCDPO
中文关键词: 直接偏好优化、直接偏好、偏好优化、大型语言模型、SCDPO
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to understand reasoning errors and output accurate reasoning steps. We apply SCDPO to both code-integrated and chain-of-thought solutions, empirically showing that it consistently improves the performance compared to naive DPO on three different SFT models, including one existing SFT model and two models we finetuned. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at identifying errors in mathematical solutions. We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method.
摘要:直接偏好优化(DPO)已被证明能有效地提高大型语言模型(LLM)在推理和对齐等下游任务中的性能。在这项工作中,我们提出了步骤控制的DPO(SCDPO),一种通过创建从指定步骤开始出错的数学推理原理的负样本来自动提供逐步错误监督的方法。通过将这些样本应用于DPO训练,SCDPO可以更好地对齐模型,以了解推理错误并输出准确的推理步骤。我们将SCDPO应用于代码集成和思想链解决方案,经验表明,与在三个不同的SFT模型(包括一个现有的SFT模型和我们优化的两个模型)上的朴素DPO相比,SCDPO始终提高了性能。对SCDPO和DPO的学分分配的定性分析表明,SCDPO在识别数学解中的错误方面是有效的。然后我们将SCDPO应用于InternLM2-20B模型,得到了一个20B模型,该模型在GSM8K上获得了88.5%的高分,在数学上获得了58.1%的高分,与所有其他开源的LLM相媲美,显示了我们方法的巨大潜力。

[NLP-97] Characterizing Stereotypical Bias from Privacy-preserving Pre-Training
[NLP-97] 从隐私保护预培训中描述刻板印象偏见

链接: https://arxiv.org/abs/2407.00764
作者: Stefan Arnold,Rene Gröbner,Annika Schreiner
关键词: Differential Privacy, embedding space, applied to raw, exploiting the spatial, spatial arrangement
中文关键词: 差异隐私,嵌入空间,应用于原始,利用空间,空间安排
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Differential Privacy (DP) can be applied to raw text by exploiting the spatial arrangement of words in an embedding space. We investigate the implications of such text privatization on Language Models (LMs) and their tendency towards stereotypical associations. Since previous studies documented that linguistic proficiency correlates with stereotypical bias, one could assume that techniques for text privatization, which are known to degrade language modeling capabilities, would cancel out undesirable biases. By testing BERT models trained on texts containing biased statements primed with varying degrees of privacy, our study reveals that while stereotypical bias generally diminishes when privacy is tightened, text privatization does not uniformly equate to diminishing bias across all social domains. This highlights the need for careful diagnosis of bias in LMs that undergo text privatization.
摘要:通过利用嵌入空间中单词的空间排列,差异隐私(DP)可以应用于原始文本。我们调查了这种文本私有化对语言模型(LM)及其刻板印象联想倾向的影响。由于之前的研究记录了语言熟练程度与刻板印象偏见相关,因此人们可以假设文本私有化技术(已知会降低语言建模能力)将消除不受欢迎的偏见。通过测试在包含带有不同程度隐私的偏见陈述的文本上训练的BERT模型,我们的研究表明,虽然刻板印象偏见通常会在隐私收紧时减少,但文本私有化并不统一等同于所有社会领域的偏见减少。这凸显了需要仔细诊断进行文本私有化的LM中的偏见。

[NLP-98] A Comparative Study of Quality Evaluation Methods for Text Summarization
[NLP-98] 文本摘要质量评价方法的比较研究

链接: https://arxiv.org/abs/2407.00747
作者: Huyen Nguyen,Haihua Chen,Lavanya Pobbathi,Junhua Ding
关键词: natural language processing, Evaluating text summarization, challenging task, task in natural, NLP
中文关键词: 自然语言处理、评估文本摘要、具有挑战性的任务、自然任务、NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The paper is under review at Empirical Methods in Natural Language Processing (EMNLP) 2024. It has 15 pages and 4 figures

点击查看摘要

Abstract:Evaluating text summarization has been a challenging task in natural language processing (NLP). Automatic metrics which heavily rely on reference summaries are not suitable in many situations, while human evaluation is time-consuming and labor-intensive. To bridge this gap, this paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. We also conducts a comparative study on eight automatic metrics, human evaluation, and our proposed LLM-based method. Seven different types of state-of-the-art (SOTA) summarization models were evaluated. We perform extensive experiments and analysis on datasets with patent documents. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency. Based on the empirical comparison, we propose a LLM-powered framework for automatically evaluating and improving text summarization, which is beneficial and could attract wide attention among the community.
摘要:评价文本摘要一直是自然语言处理领域的一项具有挑战性的任务。严重依赖参考摘要的自动度量在许多情况下不适用,而人工评估既耗时又费力。为了弥补这一差距,提出了一种基于大语言模型(LLMS)的文本摘要评价方法。我们还对八个自动度量、人工评估和我们提出的基于LLM的方法进行了比较研究。评估了七种不同类型的最新(SOTA)摘要模型。我们在包含专利文档的数据集上进行了广泛的实验和分析。我们的结果表明,LLMS评价与人工评价非常接近,而广泛使用的自动度量如Rouge-2、BERTScore和SummaC则不一致,也缺乏一致性。在实验比较的基础上,我们提出了一个基于LLM的自动评价和改进文本摘要的框架,这是有益的,可以引起社区的广泛关注。

[NLP-99] AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations
[NLP-99] AIMDiT:通过多模式维度转换进行模式增强和互动,用于对话中的情感识别

链接: https://arxiv.org/abs/2407.00743
作者: Sheng Wu,Jiaxing Liu,Longbiao Wang,Dongxiao He,Xiaobao Wang,Jianwu Dang
关键词: natural language processing, Recognition in Conversations, Emotion Recognition, speaker in conversations, language processing
中文关键词: 自然语言处理、对话中的识别、情感识别、对话中的说话者、语言处理
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Specifically, we design a Modality Augmentation Network which performs rich representation learning through dimension transformation of different modalities and parameter-efficient inception block. On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics compared to the state-of-the-art (SOTA) models.
摘要:对话中的情感识别(ERC)是自然语言处理领域的一项流行任务,旨在识别对话中说话者的情感状态。虽然目前的研究主要强调上下文建模,但对有效的多模式融合方法的研究很少。我们提出了一种名为AIMDiT的新型框架来解决深度特征的多模式融合问题。具体来说,我们设计了一个模式增强网络,它通过不同模式的维度转换和参数高效的初始块来执行丰富的表示学习。另一方面,情态交互网络对提取的情态间特征和情态内特征进行交互融合。使用我们的AIMDiT框架在公共基准数据集MELD上进行的实验显示,与最先进的(SOTA)模型相比,Acc-7和w-F1指标提高了2.34%和2.87%。

[NLP-100] LocateEdit: Energy-based Text Editing for Efficient Flexible and Faithful Controlled Text Generation
[NLP-100] LocateEdit:基于能量的文本编辑,用于高效、灵活且忠实的受控文本生成

链接: https://arxiv.org/abs/2407.00740
作者: Hye Ryung Son,Jay-Yoon Lee
关键词: Recent approaches, base language models, decoding time, base, approaches to controlled
中文关键词: 最近的方法、基础语言模型、解码时间、基础、控制的方法
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 2 figures

点击查看摘要

Abstract:Recent approaches to controlled text generation (CTG) often involve manipulating the weights or logits of base language models (LMs) at decoding time. However, these methods are inapplicable to latest black-box LMs and ineffective at preserving the core semantics of the base LM’s original generations. In this work, we propose LocateEdit(LE), an efficient and flexible energy-based approach to CTG, which edits text outputs from a base LM using off-the-shelf energy models. Given text outputs from the base LM, LE first locates spans that are most relevant to constraints (e.g., toxicity) utilizing energy models, and then edits these spans by replacing them with more suitable alternatives. Importantly, our method is compatible with black-box LMs, as it requires only the text outputs. Also, since LE doesn’t mandate specific architecture for its component models, it can work with a diverse combination of available off-the-shelf models. Moreover, LE preserves the base LM’s original generations, by selectively modifying constraint-related aspects of the texts and leaving others unchanged. These targeted edits also ensure that LE operates efficiently. Our experiments confirm that LE achieves superior semantic preservation of the base LM generations and speed, while simultaneously obtaining competitive or improved constraint satisfaction. Furthermore, we analyze how the granularity of energy distribution impacts CTG performance and find that fine-grained, regression-based energy models improve constraint satisfaction, compared to conventional binary classifier energy models.
摘要:最近的受控文本生成(CTG)方法通常涉及在解码时操纵基本语言模型(LMS)的权重或逻辑。然而,这些方法不适用于最新的黑盒LM,并且不能有效地保留基本LM原始代的核心语义。在这项工作中,我们提出了LocateEdit(LE),这是一种高效而灵活的基于能量的CTG方法,它使用现有的能量模型来编辑来自基本LM的文本输出。在给定来自基本LM的文本输出的情况下,LE首先利用能量模型来定位与约束(例如,毒性)最相关的跨度,然后通过用更合适的替代方案来替换它们来编辑这些跨度。重要的是,我们的方法与黑盒LMS兼容,因为它只需要文本输出。此外,由于LE没有为其组件模型强制指定特定的体系结构,因此它可以使用各种可用的现成模型组合。此外,LE通过选择性地修改文本中与约束相关的方面并保持其他方面不变,保留了基本LM的原始世代。这些有针对性的编辑还确保LE高效运行。我们的实验证实,LE在保持基本LM生成和速度的同时,获得了竞争性或改进的约束满足。此外,我们分析了能量分布的粒度对CTG性能的影响,发现与传统的二进制分类器能量模型相比,基于回归的细粒度能量模型提高了约束满足。

[NLP-101] Large Language Models Struggle in Token-Level Clinical Named Entity Recognition
[NLP-101] 大型语言模型在令牌级临床命名实体识别中陷入困境

链接: https://arxiv.org/abs/2407.00731
作者: Qiuhao Lu,Rui Li,Andrew Wen,Jinlian Wang,Liwei Wang,Hongfang Liu
关键词: Large Language Models, Large Language, Named Entity Recognition, Language Models, token-level NER
中文关键词: 大型语言模型、大型语言、命名实体识别、语言模型、标记级NER
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AMIA 2024 Annual Symposium Proceedings

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized various sectors, including healthcare where they are employed in diverse applications. Their utility is particularly significant in the context of rare diseases, where data scarcity, complexity, and specificity pose considerable challenges. In the clinical domain, Named Entity Recognition (NER) stands out as an essential task and it plays a crucial role in extracting relevant information from clinical texts. Despite the promise of LLMs, current research mostly concentrates on document-level NER, identifying entities in a more general context across entire documents, without extracting their precise location. Additionally, efforts have been directed towards adapting ChatGPT for token-level NER. However, there is a significant research gap when it comes to employing token-level NER for clinical texts, especially with the use of local open-source LLMs. This study aims to bridge this gap by investigating the effectiveness of both proprietary and local LLMs in token-level clinical NER. Essentially, we delve into the capabilities of these models through a series of experiments involving zero-shot prompting, few-shot prompting, retrieval-augmented generation (RAG), and instruction-fine-tuning. Our exploration reveals the inherent challenges LLMs face in token-level NER, particularly in the context of rare diseases, and suggests possible improvements for their application in healthcare. This research contributes to narrowing a significant gap in healthcare informatics and offers insights that could lead to a more refined application of LLMs in the healthcare sector.
摘要:大型语言模型(LLM)已经给各个领域带来了革命性的变化,包括医疗保健领域,它们被应用于不同的应用领域。在罕见疾病的背景下,它们的效用尤其重要,因为数据的稀缺性、复杂性和特殊性构成了相当大的挑战。在临床领域,命名实体识别(NER)作为一项基本任务脱颖而出,在从临床文本中提取相关信息方面发挥着至关重要的作用。尽管LLMS前景看好,但目前的研究主要集中在文档级NER上,在整个文档中识别更一般上下文中的实体,而不提取它们的准确位置。此外,还努力使ChatGPT适应令牌级NER。然而,当涉及到为临床文本使用令牌级NER时,特别是在使用本地开源LLM时,存在着显著的研究差距。这项研究旨在通过调查专有和本地LLM在令牌级临床NER中的有效性来弥合这一差距。从本质上讲,我们通过一系列实验深入研究了这些模型的能力,这些实验涉及零射击提示、少射击提示、提取-增强生成(RAG)和指令微调。我们的探索揭示了LLMS在令牌级NER中面临的内在挑战,特别是在罕见疾病的背景下,并建议对其在医疗保健中的应用进行可能的改进。这项研究有助于缩小医疗保健信息学中的一个重大差距,并提供了可能导致在医疗保健部门更精细地应用LLMS的见解。

[NLP-102] Scaling Technology Acceptance Analysis with Large Language Model (LLM) Annotation Systems
[NLP-102] 使用大型语言模型(LLM)注释系统进行扩展技术接受度分析

链接: https://arxiv.org/abs/2407.00702
作者: Pawel Robert Smolinski,Joseph Januszewicz,Jacek Winiarski
关键词: models effectively predict, effectively predict, acceptance models effectively, technology products, Technology
中文关键词: 模型有效预测、有效预测、有效接受模型、技术产品、技术
类目: Computation and Language (cs.CL)
备注: This is a preprint of a paper accepted for the 32nd International Conference on Information Systems Development (ISD 2024), Gdansk, Poland

点击查看摘要

Abstract:Technology acceptance models effectively predict how users will adopt new technology products. Traditional surveys, often expensive and cumbersome, are commonly used for this assessment. As an alternative to surveys, we explore the use of large language models for annotating online user-generated content, like digital reviews and comments. Our research involved designing an LLM annotation system that transform reviews into structured data based on the Unified Theory of Acceptance and Use of Technology model. We conducted two studies to validate the consistency and accuracy of the annotations. Results showed moderate-to-strong consistency of LLM annotation systems, improving further by lowering the model temperature. LLM annotations achieved close agreement with human expert annotations and outperformed the agreement between experts for UTAUT variables. These results suggest that LLMs can be an effective tool for analyzing user sentiment, offering a practical alternative to traditional survey methods and enabling deeper insights into technology design and adoption.
摘要:技术接受模型有效地预测了用户将如何采用新技术产品。这种评估通常使用传统的调查,通常既昂贵又繁琐。作为调查的替代方案,我们探索使用大型语言模型来注释在线用户生成的内容,如数字评论和评论。我们的研究包括设计一个基于技术接受和使用统一理论模型的LLM标注系统,将评论转换为结构化数据。我们进行了两项研究,以验证注释的一致性和准确性。结果表明,LLM注记系统具有中等到较强的一致性,随着模型温度的降低,一致性进一步提高。LLM注释与人类专家注释的一致性很好,并且优于专家之间对UTAUT变量的一致。这些结果表明,LLMS可以成为分析用户情绪的有效工具,为传统调查方法提供了一种实用的替代方法,并使人们能够更深入地了解技术设计和采用。

[NLP-103] BAPO: Base-Anchored Preference Optimization for Personalized Alignment in Large Language Models
[NLP-103] BAPO:大型语言模型中个性化对齐的基本锚定偏好优化

链接: https://arxiv.org/abs/2407.00693
作者: Gihun Lee,Minchan Jeong,Yujin Kim,Hojung Jung,Jaehoon Oh,Sangmook Kim,Se-Young Yun
关键词: align Large Language, Large Language Models, Large Language, shown remarkable success, align Large
中文关键词: 对齐大型语言,大型语言模型,大型语言,表现出显着的成功,对齐大型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: under review

点击查看摘要

Abstract:While learning to align Large Language Models (LLMs) with human preferences has shown remarkable success, aligning these models to meet the diverse user preferences presents further challenges in preserving previous knowledge. This paper examines the impact of personalized preference optimization on LLMs, revealing that the extent of knowledge loss varies significantly with preference heterogeneity. Although previous approaches have utilized the KL constraint between the reference model and the policy model, we observe that they fail to maintain general knowledge and alignment when facing personalized preferences. To this end, we introduce Base-Anchored Preference Optimization (BAPO), a simple yet effective approach that utilizes the initial responses of reference model to mitigate forgetting while accommodating personalized alignment. BAPO effectively adapts to diverse user preferences while minimally affecting global knowledge or general alignment. Our experiments demonstrate the efficacy of BAPO in various setups.
摘要:虽然学习将大语言模型(LLM)与人类偏好相匹配已经取得了显著的成功,但将这些模型与不同的用户偏好相匹配,在保存先前的知识方面提出了更多的挑战。本文考察了个性化偏好优化对LLMS的影响,发现知识损失的程度随偏好异质性的不同而显著不同。虽然以前的方法利用了参考模型和策略模型之间的KL约束,但我们观察到它们在面对个性化偏好时无法保持一般知识和一致性。为此,我们引入了基本锚定偏好优化(BAPO),这是一种简单但有效的方法,它利用参考模型的初始响应来缓解遗忘,同时适应个性化对齐。BAPO有效地适应了不同的用户偏好,同时最大限度地减少了对全球知识或总体一致性的影响。我们的实验证明了BAPO在各种设置中的有效性。

[NLP-104] HRDE: Retrieval-Augmented Large Language Models for Chinese Health Rumor Detection and Explainability
[NLP-104] HRDE:用于中国健康谣言检测和解释的检索增强大语言模型

链接: https://arxiv.org/abs/2407.00668
作者: Yanfang Chen,Ding Chen,Shichao Song,Simin Niu,Hanyu Wang,Zeyun Tang,Feiyu Xiong,Zhiyu Li
关键词: people increasingly prioritize, health information, health, Chinese health, health information dissemination
中文关键词: 人们越来越重视健康信息、健康、中国健康、健康信息传播
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As people increasingly prioritize their health, the speed and breadth of health information dissemination on the internet have also grown. At the same time, the presence of false health information (health rumors) intermingled with genuine content poses a significant potential threat to public health. However, current research on Chinese health rumors still lacks a large-scale, public, and open-source dataset of health rumor information, as well as effective and reliable rumor detection methods. This paper addresses this gap by constructing a dataset containing 1.12 million health-related rumors (HealthRCN) through web scraping of common health-related questions and a series of data processing steps. HealthRCN is the largest known dataset of Chinese health information rumors to date. Based on this dataset, we propose retrieval-augmented large language models for Chinese health rumor detection and explainability (HRDE). This model leverages retrieved relevant information to accurately determine whether the input health information is a rumor and provides explanatory responses, effectively aiding users in verifying the authenticity of health information. In evaluation experiments, we compared multiple models and found that HRDE outperformed them all, including GPT-4-1106-Preview, in rumor detection accuracy and answer quality. HRDE achieved an average accuracy of 91.04% and an F1 score of 91.58%.
摘要:随着人们越来越重视自己的健康,互联网上健康信息传播的速度和广度也在增长。与此同时,虚假健康信息(健康谣言)的存在与真实内容交织在一起,对公众健康构成了重大的潜在威胁。然而,目前对中国健康谣言的研究还缺乏大规模、公开、开源的健康谣言信息数据集,以及有效可靠的谣言检测方法。本文通过对常见健康相关问题的网络抓取和一系列数据处理步骤,构建了一个包含112万条健康相关谣言(HealthRCN)的数据集,以解决这一差距。HealthRCN是迄今为止已知的最大的中国健康信息谣言数据集。基于这个数据集,我们提出了用于中文健康谣言检测和解释(HRDE)的检索增强的大语言模型。该模型利用检索到的相关信息准确地确定输入的健康信息是否为谣言,并提供解释性回应,有效地帮助用户验证健康信息的真实性。在评价实验中,我们比较了多个模型,发现HRDE在谣言检测准确率和答案质量方面都优于包括GPT-4-1106-Pview在内的所有模型。HRDE的平均准确率为91.04%,F1评分为91.58%。

[NLP-105] Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs
[NLP-105] 知识链:通过从知识图中学习将知识推理集成到大型语言模型中

链接: https://arxiv.org/abs/2407.00653
作者: Yifei Zhang,Xintao Wang,Jiaqing Liang,Sirui Xia,Lida Chen,Yanghua Xiao
关键词: Large Language Models, natural language processing, Large Language, exhibited impressive proficiency, involve increasingly complex
中文关键词: 大型语言模型、自然语言处理、大型语言,表现出令人印象深刻的熟练程度,涉及的内容越来越复杂
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited impressive proficiency in various natural language processing (NLP) tasks, which involve increasingly complex reasoning. Knowledge reasoning, a primary type of reasoning, aims at deriving new knowledge from existing one.While it has been widely studied in the context of knowledge graphs (KGs), knowledge reasoning in LLMs remains underexplored. In this paper, we introduce Chain-of-Knowledge, a comprehensive framework for knowledge reasoning, including methodologies for both dataset construction and model learning. For dataset construction, we create KnowReason via rule mining on KGs. For model learning, we observe rule overfitting induced by naive training. Hence, we enhance CoK with a trial-and-error mechanism that simulates the human process of internal knowledge exploration. We conduct extensive experiments with KnowReason. Our results show the effectiveness of CoK in refining LLMs in not only knowledge reasoning, but also general reasoning benchmarkms.
摘要:大型语言模型在各种自然语言处理(NLP)任务中表现出令人印象深刻的熟练程度,这些任务涉及越来越复杂的推理。知识推理是一种主要的推理类型,其目的是从已有的知识中获取新的知识,但在知识图(KGs)的背景下已经得到了广泛的研究,而在LLMS中的知识推理的研究还很少。在本文中,我们介绍了知识链,一个全面的知识推理框架,包括数据集构建和模型学习的方法。对于数据集的构建,我们通过在KGS上进行规则挖掘来创建KnowReason。对于模型学习,我们观察到幼稚训练导致的规则过度匹配。因此,我们用一种模拟人类内部知识探索过程的试错机制来增强COK。我们用KnowReason进行了广泛的实验。我们的结果表明,COK不仅在知识推理方面,而且在一般推理基准方面,都能有效地提炼LLM。

[NLP-106] LegalTurk Optimized BERT for Multi-Label Text Classification and NER
[NLP-106] LegalTurk针对多标签文本分类和NER优化BERT

链接: https://arxiv.org/abs/2407.00648
作者: Farnaz Zeidi,Mehmet Fatih Amasyali,Çiğdem Erol
关键词: Transformer neural network, Transformer neural, legal Turkish domain, BERT, neural network
中文关键词: Transformer神经网络,Transformer神经,合法土耳其域,BERT,神经网络
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The introduction of the Transformer neural network, along with techniques like self-supervised pre-training and transfer learning, has paved the way for advanced models like BERT. Despite BERT’s impressive performance, opportunities for further enhancement exist. To our knowledge, most efforts are focusing on improving BERT’s performance in English and in general domains, with no study specifically addressing the legal Turkish domain. Our study is primarily dedicated to enhancing the BERT model within the legal Turkish domain through modifications in the pre-training phase. In this work, we introduce our innovative modified pre-training approach by combining diverse masking strategies. In the fine-tuning task, we focus on two essential downstream tasks in the legal domain: name entity recognition and multi-label text classification. To evaluate our modified pre-training approach, we fine-tuned all customized models alongside the original BERT models to compare their performance. Our modified approach demonstrated significant improvements in both NER and multi-label text classification tasks compared to the original BERT model. Finally, to showcase the impact of our proposed models, we trained our best models with different corpus sizes and compared them with BERTurk models. The experimental results demonstrate that our innovative approach, despite being pre-trained on a smaller corpus, competes with BERTurk.
摘要:Transformer神经网络的引入,以及自我监督的预训练和转移学习等技术,为像BERT这样的高级模型铺平了道路。尽管伯特的表现令人印象深刻,但仍有进一步提升的机会。据我们所知,大多数努力都集中在提高Bert在英语和一般领域的表现上,没有专门针对合法的土耳其领域的研究。我们的研究主要致力于通过在培训前阶段进行修改,在合法的土耳其领域内增强ERT模型。在这项工作中,我们通过结合不同的掩蔽策略来引入我们的创新的改进的预训练方法。在微调任务中,我们重点研究了法律领域中两个必不可少的下游任务:名称实体识别和多标签文本分类。为了评估我们修改后的预培训方法,我们微调了所有定制模型以及原始的BERT模型,以比较它们的性能。与原始的BERT模型相比,我们的改进方法在NER和多标签文本分类任务上都有显著的改进。最后,为了展示我们提出的模型的影响,我们用不同语料库大小训练了我们最好的模型,并将它们与BERTurk模型进行了比较。实验结果表明,尽管我们在较小的语料库上进行了预训练,但我们的创新方法仍然与BERTurk竞争。

[NLP-107] A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy
[NLP-107] 一种基于配置的方法来解决词级度量差异隐私的挑战

链接: https://arxiv.org/abs/2407.00638
作者: Stephen Meisenbacher,Maulik Chevli,Florian Matthes
关键词: proposed mechanism operates, Differential Privacy, NLP must distinguish, Differential Privacy approaches, textit
中文关键词: 拟议的机制运行,差异隐私,NLP必须区分,差异隐私方法,文本
类目: Computation and Language (cs.CL)
备注: 13 pages, 2 figures, 9 tables. Accepted to PrivateNLP 2024

点击查看摘要

Abstract:Applications of Differential Privacy (DP) in NLP must distinguish between the syntactic level on which a proposed mechanism operates, often taking the form of \textitword-level or \textitdocument-level privatization. Recently, several word-level \textitMetric Differential Privacy approaches have been proposed, which rely on this generalized DP notion for operating in word embedding spaces. These approaches, however, often fail to produce semantically coherent textual outputs, and their application at the sentence- or document-level is only possible by a basic composition of word perturbations. In this work, we strive to address these challenges by operating \textitbetween the word and sentence levels, namely with \textitcollocations . By perturbing n-grams rather than single words, we devise a method where composed privatized outputs have higher semantic coherence and variable length. This is accomplished by constructing an embedding model based on frequently occurring word groups, in which unigram words co-exist with bi- and trigram collocations. We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.
摘要:差异隐私(DP)在自然语言处理中的应用必须区分所提出的机制所在的句法级别,通常采取\文本标题词级或\文本标题文档级私有化的形式。最近,已经提出了几种词级\文本度量差分隐私方法,它们依赖于这一广义DP概念在词嵌入空间中操作。然而,这些方法往往无法产生语义连贯的文本输出,并且它们只能通过单词扰动的基本组合才能在句子或文档级别上应用。在这项工作中,我们努力通过在单词和句子之间操作文本,即通过文本搭配来应对这些挑战。通过扰动n-gram而不是单个单词,我们设计了一种方法,其中合成的私有化输出具有更高的语义连贯性和可变的长度。这是通过构建基于频繁出现的词组的嵌入模型来实现的,在该模型中,一元词与双词和三词的搭配共存。我们在效用和隐私测试中对我们的方法进行了评估,这为词级以外的标记化策略提供了一个明确的案例。

[NLP-108] DP-MLM: Differentially Private Text Rewriting Using Masked Language Models
[NLP-108] DP-MLM:使用掩蔽语言模型进行差异化私人文本重写

链接: https://arxiv.org/abs/2407.00637
作者: Stephen Meisenbacher,Maulik Chevli,Juraj Vladika,Florian Matthes
关键词: privatization using Differential, Differential Privacy, Differential, text, language models
中文关键词: 使用差异、差异隐私、差异、文本、语言模型的私有化
类目: Computation and Language (cs.CL)
备注: 15 pages, 2 figures, 8 tables. Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:The task of text privatization using Differential Privacy has recently taken the form of \textittext rewriting , in which an input text is obfuscated via the use of generative (large) language models. While these methods have shown promising results in the ability to preserve privacy, these methods rely on autoregressive models which lack a mechanism to contextualize the private rewriting process. In response to this, we propose \textbfDP-MLM , a new method for differentially private text rewriting based on leveraging masked language models (MLMs) to rewrite text in a semantically similar \textitand obfuscated manner. We accomplish this with a simple contextualization technique, whereby we rewrite a text one token at a time. We find that utilizing encoder-only MLMs provides better utility preservation at lower \varepsilon levels, as compared to previous methods relying on larger models with a decoder. In addition, MLMs allow for greater customization of the rewriting mechanism, as opposed to generative approaches. We make the code for \textbfDP-MLM public and reusable, found at this https URL .
摘要:使用差异隐私的文本私有化任务最近采用了文本重写的形式,即通过使用生成(大)语言模型来混淆输入文本。虽然这些方法在保护隐私方面取得了令人振奋的结果,但这些方法依赖于自回归模型,该模型缺乏一种将私人重写过程与上下文关联的机制。针对这一问题,我们提出了一种新的基于掩蔽语言模型(MLMS)的差异隐私文本重写方法我们使用一种简单的上下文化技术来实现这一点,即一次重写一个文本标记。我们发现,与以前依赖于带有解码器的较大模型的方法相比,仅使用编码器的MLMS在较低的水平上提供了更好的效用保持性。此外,与生成性方法相比,MLM允许对重写机制进行更大程度的定制。我们将在此https URL中找到的\textbfdp-mlm代码公开并可重复使用。

[NLP-109] CAMON: Cooperative Agents for Multi-Object Navigation with LLM-based Conversations
[NLP-109] CAMON:具有基于LLM的对话的多对象导航合作代理

链接: https://arxiv.org/abs/2407.00632
作者: Pengying Wu,Yao Mu,Kangjie Zhou,Ji Ma,Junting Chen,Chang Liu
关键词: Visual navigation tasks, Visual navigation, household service robots, Visual, service robots
中文关键词: 视觉导航任务,视觉导航,家庭服务机器人,视觉,服务机器人
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: Accepted to the RSS 2024 Workshop: GROUND

点击查看摘要

Abstract:Visual navigation tasks are critical for household service robots. As these tasks become increasingly complex, effective communication and collaboration among multiple robots become imperative to ensure successful completion. In recent years, large language models (LLMs) have exhibited remarkable comprehension and planning abilities in the context of embodied agents. However, their application in household scenarios, specifically in the use of multiple agents collaborating to complete complex navigation tasks through communication, remains unexplored. Therefore, this paper proposes a framework for decentralized multi-agent navigation, leveraging LLM-enabled communication and collaboration. By designing the communication-triggered dynamic leadership organization structure, we achieve faster team consensus with fewer communication instances, leading to better navigation effectiveness and collaborative exploration efficiency. With the proposed novel communication scheme, our framework promises to be conflict-free and robust in multi-object navigation tasks, even when there is a surge in team size.
摘要:视觉导航任务是家用服务机器人的关键任务。随着这些任务变得越来越复杂,多个机器人之间的有效沟通和协作变得至关重要,以确保成功完成。近年来,大型语言模型(LLM)在具身智能体的背景下表现出了显著的理解和规划能力。然而,它们在家庭场景中的应用,特别是在使用多个代理协作通过通信完成复杂的导航任务方面,仍未得到探索。因此,本文提出了一种分布式多智能体导航的框架,利用LLM支持的通信和协作。通过设计沟通触发的动态领导组织结构,以较少的沟通实例实现更快的团队共识,从而获得更好的导航效果和协同探索效率。通过提出的新的通信方案,我们的框架在多目标导航任务中保证了无冲突和健壮性,即使在团队规模激增的情况下也是如此。

[NLP-110] Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
[NLP-110] 迭代纳什政策优化:通过无悔学习将LLM与普遍偏好保持一致

链接: https://arxiv.org/abs/2407.00617
作者: Yuheng Zhang,Dian Yu,Baolin Peng,Linfeng Song,Ye Tian,Mingyue Huo,Nan Jiang,Haitao Mi,Dong Yu
关键词: achieved great success, aligning large language, large language models, Human Feedback, Reinforcement Learning
中文关键词: 取得了巨大成功,调整大型语言、大型语言模型、人类反馈、强化学习
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art iterative algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our ablation study highlights the benefits of incorporating KL regularization for response length control.
摘要:带人反馈的强化学习(RLHF)在大型语言模型(LLM)与人类偏好的匹配方面取得了巨大的成功。流行的RLHF方法是基于报酬的,遵循Bradley-Terry(BT)模型假设,该模型可能不能完全捕捉到人类偏好的复杂性。在本文中,我们在一般偏好框架下探讨了RLHF,并从博弈论的角度对其进行了探讨。具体地说,我们将该问题描述为一个两人博弈问题,并提出了一种新的算法–迭代纳什策略优化(INPO)。关键的想法是让政策通过无悔的学习来与自己对抗,从而接近纳什政策。与以前的方法不同,INPO避免了估计单个响应的预期获胜率的需要,这通常会导致较高的计算或注释成本。相反,我们引入了一个新的损失目标,该目标直接最小化了偏好数据集。我们为我们的方法提供了理论分析,并通过在各种有代表性的基准上的实验证明了它的有效性。使用基于骆驼3-8B的SFT模型,INPO在AlpacaEval 2.0上实现了41.5%的长度控制胜率,在Arena-Hard上实现了38.3%的胜率,比BT模型假设下的最先进的迭代算法[董等人,2024]有了实质性的改进。此外,我们的消融研究强调了结合KL正则化对反应长度控制的好处。

[NLP-111] Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace
[NLP-111] 利用文本子空间高效个性化文本到图像生成

链接: https://arxiv.org/abs/2407.00608
作者: Shian Du,Xiaotian Cheng,Qi Qian,Henglu Wei,Yi Xu,Xiangyang Ji
关键词: attracted unprecedented attention, generating highly-personalized images, input concept dataset, textual prompt, input textual prompt
中文关键词: 引起前所未有的关注,生成高度个性化的图像、输入概念数据集、文本提示、输入文本提示
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalized text-to-image generation has attracted unprecedented attention in the recent few years due to its unique capability of generating highly-personalized images via using the input concept dataset and novel textual prompt. However, previous methods solely focus on the performance of the reconstruction task, degrading its ability to combine with different textual prompt. Besides, optimizing in the high-dimensional embedding space usually leads to unnecessary time-consuming training process and slow convergence. To address these issues, we propose an efficient method to explore the target embedding in a textual subspace, drawing inspiration from the self-expressiveness property. Additionally, we propose an efficient selection strategy for determining the basis vectors of the textual subspace. The experimental evaluations demonstrate that the learned embedding can not only faithfully reconstruct input image, but also significantly improves its alignment with novel input textual prompt. Furthermore, we observe that optimizing in the textual subspace leads to an significant improvement of the robustness to the initial word, relaxing the constraint that requires users to input the most relevant initial word. Our method opens the door to more efficient representation learning for personalized text-to-image generation.
摘要:近年来,个性化文本到图像的生成受到了前所未有的关注,因为它能够利用输入的概念数据集和新颖的文本提示来生成高度个性化的图像。然而,以往的方法只关注重建任务的性能,降低了其与不同文本提示相结合的能力。此外,在高维嵌入空间中进行优化通常会导致不必要的训练过程耗时和收敛速度慢。为了解决这些问题,我们提出了一种有效的方法来探索嵌入到文本子空间中的目标,并从自我表达特性中得到启发。此外,我们还提出了一种有效的选择策略来确定文本子空间的基向量。实验结果表明,学习嵌入不仅能较好地重建输入图像,而且能显著提高图像与新输入文本提示的对比度。此外,我们观察到在文本子空间中的优化导致对首字母词的稳健性显著提高,放松了要求用户输入最相关的首字母词的限制。我们的方法为个性化文本到图像的生成打开了更有效的表示学习的大门。

[NLP-112] MasonTigers at SemEval-2024 Task 10: Emotion Discovery and Flip Reasoning in Conversation with Ensemble of Transformers and Prompting
[NLP-112] MasonTigers在SemEval-2024任务10:与变形金刚和预算组合对话中的情感发现和翻转推理

链接: https://arxiv.org/abs/2407.00581
作者: Al Nahian Bin Emran,Amrita Ganguly,Sadiya Sayara Chowdhury Puspo,Nishat Raihan,Dhiman Goswami
关键词: Hindi-English code-mixed dialogues, present MasonTigers’ participation, code-mixed dialogues, Hindi-English code-mixed, emotion flip reasoning
中文关键词: 印度语-英语代码混合对话,呈现MasonTigers的参与,代码混合对话,印度语-英语代码混合,情感翻转推理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we present MasonTigers’ participation in SemEval-2024 Task 10, a shared task aimed at identifying emotions and understanding the rationale behind their flips within monolingual English and Hindi-English code-mixed dialogues. This task comprises three distinct subtasks - emotion recognition in conversation for Hindi-English code-mixed dialogues, emotion flip reasoning for Hindi-English code-mixed dialogues, and emotion flip reasoning for English dialogues. Our team, MasonTigers, contributed to each subtask, focusing on developing methods for accurate emotion recognition and reasoning. By leveraging our approaches, we attained impressive F1-scores of 0.78 for the first task and 0.79 for both the second and third tasks. This performance not only underscores the effectiveness of our methods across different aspects of the task but also secured us the top rank in the first and third subtasks, and the 2nd rank in the second subtask. Through extensive experimentation and analysis, we provide insights into our system’s performance and contributions to each subtask.
摘要:在本文中,我们介绍了MasonTigers参与SemEval-2024任务10,这是一个共同的任务,旨在识别情绪和理解他们在单语英语和印英代码混合对话中翻转背后的理论基础。这项任务包括三个不同的子任务–印英混码对话的情感识别、印英混码对话的情感翻转推理和英语对话的情感翻转推理。我们的团队MasonTigers为每个子任务做出了贡献,专注于开发准确的情感识别和推理方法。通过利用我们的方法,我们获得了令人印象深刻的F1-第一项任务的0.78分,第二项和第三项任务的0.79分。这一表现不仅突出了我们方法在任务不同方面的有效性,而且还确保了我们在第一和第三子任务中排名第一,在第二子任务中排名第二。通过广泛的实验和分析,我们对我们的系统的性能和对每个子任务的贡献提供了见解。

[NLP-113] Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models
[NLP-113] 调查和缓解大型视觉语言模型中的多模式幻觉滚雪球

链接: https://arxiv.org/abs/2407.00569
作者: Weihong Zhong,Xiaocheng Feng,Liang Zhao,Qiming Li,Lei Huang,Yuxuan Gu,Weitao Ma,Yuan Xu,Bing Qin
关键词: Large Vision-Language Models, Large Vision-Language, understanding visual information, human languages, generated hallucinations
中文关键词: 大型视觉语言模型,大型视觉语言,理解视觉信息,人类语言,产生的幻觉
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Main Conference. 21 pages, 20 figures

点击查看摘要

Abstract:Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs’ subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called MMHalSnowball to evaluate LVLMs’ behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least 31% , indicating that LVLMs are prone to accept the generated hallucinations and make false claims that they would not have supported without distractions. We term this phenomenon Multimodal Hallucination Snowballing. To mitigate this, we further propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input, providing models with direct access to the visual information. Experiments show that our method can mitigate more than 24% of the snowballed multimodal hallucination while maintaining capabilities.
摘要:尽管大型视觉语言模型在理解人类语言的视觉信息方面取得了很大进展,但它仍然存在多通道幻觉。一个自然的担忧是,在多模式互动过程中,产生的幻觉可能会影响LVLM的下一代。因此,我们提出了一个问题:当出现与先前产生的幻觉相关的问题时,即使地面视觉信息存在,LVLMS是否会被误导并做出错误的反应?为了回答这个问题,我们提出了一个名为MMHalSnowball的框架来评估LVLMS在遇到产生的幻觉时的行为,其中LVLMS被要求在经过策划的幻觉对话中回答特定的视觉问题。重要的是,我们的实验表明,开源的LVLMS的性能至少下降了31%,这表明LVLMS容易接受产生的幻觉,并做出错误的声明,如果没有分心的话,他们就不会支持。我们称这种现象为多模式幻觉滚雪球。为了缓解这一问题,我们进一步提出了一种称为残差视觉解码的无需训练的方法,其中我们用从残差视觉输入得到的输出分布来修正LVLM的输出分布,为模型提供了直接访问视觉信息的途径。实验表明,我们的方法可以在保持能力的情况下缓解滚雪球般的多通道幻觉超过24%。

[NLP-114] Answering real-world clinical questions using large language model based systems
[NLP-114] 使用基于大型语言模型的系统回答现实世界的临床问题

链接: https://arxiv.org/abs/2407.00541
作者: Yen Sia Low(1),Michael L. Jackson(1),Rebecca J. Hyde(1),Robert E. Brown(1),Neil M. Sanghavi(1),Julian D. Baldwin(1),C. William Pike(1),Jananee Muralidharan(1),Gavin Hui(1 and 2),Natasha Alexander(3),Hadeel Hassan(3),Rahul V. Nene(4),Morgan Pike(5),Courtney J. Pokrzywa(6),Shivam Vedak(7),Adam Paul Yan(3),Dong-han Yao(7),Amy R. Zipursky(3),Christina Dinh(1),Philip Ballentine(1),Dan C. Derieg(1),Vladimir Polony(1),Rehan N. Chawdry(1),Jordan Davies(1),Brigham B. Hyde(1),Nigam H. Shah(1 and 7),Saurabh Gombar(1 and 8) ((1) Atropos Health, New York NY, USA, (2) Department of Medicine, University of California, Los Angeles CA, USA, (3) Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada, (4) Department of Emergency Medicine, University of California, San Diego CA, USA, (5) Department of Emergency Medicine, University of Michigan, Ann Arbor MI, USA, (6) Department of Surgery, Columbia University, New York NY, USA, (7) Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA (8) Department of Pathology, Stanford University, Stanford CA, USA)
关键词: guide healthcare decisions, contextualizing existing research, guide healthcare, healthcare decisions, difficulty in contextualizing
中文关键词: 指导医疗保健决策,结合现有研究的背景,指导医疗保健,医疗保健决策,背景困难
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 28 pages (2 figures, 3 tables) inclusive of 8 pages of supplemental materials (4 supplemental figures and 4 supplemental tables)

点击查看摘要

Abstract:Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.
摘要:指导医疗决策的证据往往受限于缺乏相关和可信的文献,以及难以将现有研究与特定患者联系起来。大型语言模型(LLM)可以通过总结已发表的文献或基于真实世界数据(RWD)生成新的研究来潜在地解决这两个挑战。我们评估了5个基于LLM的系统回答50个临床问题的能力,并让9名独立医生审查了回答的相关性、可靠性和可操作性。目前,通用的LLMS(ChatGPT-4、Claude 3 Opus、Gemini Pro 1.5)很少产生被认为相关和基于证据的答案(2%-10%)。相比之下,基于检索增强生成(RAG)的智能LLM系统为24%(OpenEvidence)到58%(ChatRWD)的问题提供了相关和基于证据的答案。与其他LLM相比,只有智能型ChatRWD能够回答新问题(65%比0-9%)。这些结果表明,尽管通用的LLMS不应按原样使用,但基于RAG的专门构建的证据摘要系统和用于协同工作的新证据生成系统将提高患者护理相关证据的可用性。

[NLP-115] ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees
[NLP-115] ConU:具有正确覆盖保证的大型语言模型中的保形不确定性

链接: https://arxiv.org/abs/2407.00499
作者: Zhiyuan Wang,Jinhao Duan,Lu Cheng,Yue Zhang,Qingni Wang,Hengtao Shen,Xiaofeng Zhu,Xiaoshuang Shi,Kaidi Xu
关键词: natural language generation, recent large language, large language models, open-ended NLG tasks, language generation
中文关键词: 自然语言生成、最近的大型语言、大型语言模型、开放式NLG任务、语言生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Uncertainty quantification (UQ) in natural language generation (NLG) tasks remains an open challenge, exacerbated by the intricate nature of the recent large language models (LLMs). This study investigates adapting conformal prediction (CP), which can convert any heuristic measure of uncertainty into rigorous theoretical guarantees by constructing prediction sets, for black-box LLMs in open-ended NLG tasks. We propose a sampling-based uncertainty measure leveraging self-consistency and develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the design of the CP algorithm. Experimental results indicate that our uncertainty measure generally surpasses prior state-of-the-art methods. Furthermore, we calibrate the prediction sets within the model’s unfixed answer distribution and achieve strict control over the correctness coverage rate across 6 LLMs on 4 free-form NLG datasets, spanning general-purpose and medical domains, while the small average set size further highlights the efficiency of our method in providing trustworthy guarantees for practical open-ended NLG applications.
摘要:自然语言生成(NLG)任务中的不确定性量化(UQ)仍然是一个开放的挑战,最近的大型语言模型(LLM)的复杂性质加剧了这一挑战。通过构造预测集,将任何不确定性的启发式度量转化为严格的理论保证,研究了在开放式NLG任务中对黑箱LLMS的适形预测。我们提出了一种利用自洽的基于抽样的不确定性度量方法,并将与正确性一致的不确定性条件融入到CP算法的设计中,提出了一种共形不确定性准则。实验结果表明,我们的不确定性度量总体上超过了现有的方法。此外,我们在模型的非固定答案分布范围内对预测集进行了校准,并在4个自由形式的NLG数据集上实现了对6个LLM的正确覆盖率的严格控制,而小的平均集大小进一步突出了该方法在为实际的开放式NLG应用提供可信保证方面的有效性。

[NLP-116] LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement
[NLP-116] 法学硕士讲师:从错误中学习,实现自动化模型改进

链接: https://arxiv.org/abs/2407.00497
作者: Jiahao Ying,Mingbao Lin,Yixin Cao,Wei Tang,Bo Wang,Qianru Sun,Xuanjing Huang,Shuicheng Yan
关键词: Large Language Models, advanced Large Language, smaller target models, advanced Large, Large Language
中文关键词: 大型语言模型、高级大型语言、较小的目标模型、高级大型、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces the innovative “LLMs-as-Instructors” framework, which leverages the advanced Large Language Models (LLMs) to autonomously enhance the training of smaller target models. Inspired by the theory of “Learning from Errors”, this framework employs an instructor LLM to meticulously analyze the specific errors within a target model, facilitating targeted and efficient training cycles. Within this framework, we implement two strategies: “Learning from Error,” which focuses solely on incorrect responses to tailor training data, and “Learning from Error by Contrast”, which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors. Our empirical studies, conducted with several open-source models, demonstrate significant improvements across multiple benchmarks, including mathematical reasoning, coding abilities, and factual knowledge. Notably, the refined Llama-3-8b-Instruction has outperformed ChatGPT, illustrating the effectiveness of our approach. By leveraging the strengths of both strategies, we have attained a more balanced performance improvement on both in-domain and out-of-domain benchmarks. Our code can be found at this https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2407.00497 [cs.CL] (or arXiv:2407.00497v1 [cs.CL] for this version)
摘要:本文介绍了创新的“LLMS-as-Teacher”框架,它利用先进的大语言模型(LLMS)自主地加强对较小目标模型的培训。在“从错误中学习”理论的启发下,该框架聘请了一名教师LLM来仔细分析目标模型中的具体错误,从而促进有针对性和有效的培训周期。在这个框架内,我们实施了两种策略:“从错误中学习”,它只关注不正确的反应,以定制训练数据;以及“从错误中学习”,它使用对比学习来分析正确和不正确的反应,以更深入地理解错误。我们用几个开源模型进行的经验研究表明,在多个基准测试中,包括数学推理、编码能力和事实知识在内,都有显著的改进。值得注意的是,改进的Llama-3-8b-指令的性能优于ChatGPT,说明了我们方法的有效性。通过利用这两种策略的优势,我们在域内和域外基准测试中实现了更均衡的性能改进。我们的代码可以在这个HTTPS URL中找到。科目:计算和语言(cs.CL)引用为:arxiv:2407.00497cs.CL

[NLP-117] PFME: A Modular Approach for Fine-grained Hallucination Detection and Editing of Large Language Models
[NLP-117] PFME:一种用于细粒度幻觉检测和编辑大型语言模型的模块化方法

链接: https://arxiv.org/abs/2407.00488
作者: Kunquan Deng,Zeyu Huang,Chen Li,Chenghua Lin,Min Gao,Wenge Rong
关键词: Large Language Models, Large Language, producing inaccurate content, risk producing inaccurate, Fine-grained Hallucination Detection
中文关键词: 大型语言模型、大型语言、产生不准确内容、产生不准确、细粒度幻觉检测的风险
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in fluency but risk producing inaccurate content, called “hallucinations.” This paper outlines a standardized process for categorizing fine-grained hallucination types and proposes an innovative framework–the Progressive Fine-grained Model Editor (PFME)–specifically designed to detect and correct fine-grained hallucinations in LLMs. PFME consists of two collaborative modules: the Real-time Fact Retrieval Module and the Fine-grained Hallucination Detection and Editing Module. The former identifies key entities in the document and retrieves the latest factual evidence from credible sources. The latter further segments the document into sentence-level text and, based on relevant evidence and previously edited context, identifies, locates, and edits each sentence’s hallucination type. Experimental results on FavaBench and FActScore demonstrate that PFME outperforms existing methods in fine-grained hallucination detection tasks. Particularly, when using the Llama3-8B-Instruct model, PFME’s performance in fine-grained hallucination detection with external knowledge assistance improves by 8.7 percentage points (pp) compared to ChatGPT. In editing tasks, PFME further enhances the FActScore of FActScore-Alpaca13B and FActScore-ChatGPT datasets, increasing by 16.2pp and 4.6pp, respectively.
摘要:大型语言模型(LLM)在流利性方面表现出色,但存在产生不准确内容的风险,这种情况被称为“幻觉”。本文概述了对细粒度幻觉类型进行分类的标准化过程,并提出了一个创新的框架–渐进式细粒度模型编辑器(PFME)–专门设计用于检测和纠正LLMS中的细粒度幻觉。PFME由两个协作模块组成:实时事实检索模块和细粒度幻觉检测和编辑模块。前者确定文件中的关键实体,并从可靠的来源检索最新的事实证据。后者进一步将文档分割成句子级别的文本,并基于相关证据和先前编辑的上下文,识别、定位和编辑每个句子的幻觉类型。在FavaBitch和FActScore上的实验结果表明,PFME在细粒度幻觉检测任务中的性能优于现有方法。特别是,当使用Llama3-8B-指令模型时,与ChatGPT相比,PFME在外部知识辅助下的细粒度幻觉检测性能提高了8.7个百分点(Pp)。在编辑任务中,PFME进一步增强了FActScore-Alpaca13B和FActScore-ChatGPT数据集的FActScore,分别增加了16.2pp和4.6pp。

[NLP-118] Its Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization
[NLP-118] 其变形时间:通过多目标优化释放多个LLM的潜力

链接: https://arxiv.org/abs/2407.00487
作者: Bingdong Li,Zixiang Di,Yanting Yang,Hong Qian,Peng Yang,Hao Hao,Ke Tang,Aimin Zhou
关键词: large language model, model merging, large language, merging, multi-objective optimization algorithms
中文关键词: 大语言模型,模型合并,大语言,合并,多目标优化算法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce a novel approach for large language model merging via black-box multi-objective optimization algorithms. The goal of model merging is to combine multiple models, each excelling in different tasks, into a single model that outperforms any of the individual source models. However, model merging faces two significant challenges: First, existing methods rely heavily on human intuition and customized strategies. Second, parameter conflicts often arise during merging, and while methods like DARE [1] can alleviate this issue, they tend to stochastically drop parameters, risking the loss of important delta parameters. To address these challenges, we propose the MM-MO method, which automates the search for optimal merging configurations using multi-objective optimization algorithms, eliminating the need for human intuition. During the configuration searching process, we use estimated performance across multiple diverse tasks as optimization objectives in order to alleviate the parameter conflicting between different source models without losing crucial delta parameters. We conducted comparative experiments with other mainstream model merging methods, demonstrating that our method consistently outperforms them. Moreover, our experiments reveal that even task types not explicitly targeted as optimization objectives show performance improvements, indicating that our method enhances the overall potential of the model rather than merely overfitting to specific task types. This approach provides a significant advancement in model merging techniques, offering a robust and plug-and-play solution for integrating diverse models into a unified, high-performing model.
摘要:提出了一种新的基于黑盒多目标优化算法的大型语言模型融合方法。模型合并的目标是将多个模型(每个模型都在不同的任务中表现出色)组合成一个表现优于任何单个源模型的模型。然而,模型融合面临着两个重大挑战:第一,现有方法严重依赖于人类的直觉和定制的策略。其次,在合并过程中经常会出现参数冲突,虽然像DARE[1]这样的方法可以缓解这个问题,但它们往往会随机丢弃参数,从而冒着丢失重要增量参数的风险。为了应对这些挑战,我们提出了MM-MO方法,该方法使用多目标优化算法自动搜索最优合并配置,消除了对人类直觉的需要。在配置搜索过程中,我们以多个不同任务的估计性能作为优化目标,以缓解不同源模型之间的参数冲突,而不丢失关键的Delta参数。我们与其他主流的模型融合方法进行了对比实验,证明了我们的方法始终优于它们。此外,我们的实验表明,即使没有明确作为优化目标的任务类型也表现出了性能改进,这表明我们的方法增强了模型的整体潜力,而不仅仅是过度适应特定的任务类型。这种方法大大改进了模型合并技术,为将不同的模型集成到统一的、高性能的模型中提供了一个健壮的、即插即用的解决方案。

[NLP-119] owards Massive Multilingual Holistic Bias
[NLP-119] owards大量多语言整体偏见

链接: https://arxiv.org/abs/2407.00486
作者: Xiaoqing Ellen Tan,Prangthip Hansanti,Carleigh Wood,Bokai Yu,Christophe Ropers,Marta R. Costa-jussà
关键词: mitigate demographic biases, MASSIVE MULTILINGUAL HOLISTICBIAS, automatic language generation, current landscape, biases as existing
中文关键词: 减轻人口统计偏见、大规模多语言圣言偏见、自动语言生成、当前格局、现有偏见
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the current landscape of automatic language generation, there is a need to understand, evaluate, and mitigate demographic biases as existing models are becoming increasingly multilingual. To address this, we present the initial eight languages from the MASSIVE MULTILINGUAL HOLISTICBIAS (MMHB) dataset and benchmark consisting of approximately 6 million sentences representing 13 demographic axes. We propose an automatic construction methodology to further scale up MMHB sentences in terms of both language coverage and size, leveraging limited human annotation. Our approach utilizes placeholders in multilingual sentence construction and employs a systematic method to independently translate sentence patterns, nouns, and descriptors. Combined with human translation, this technique carefully designs placeholders to dynamically generate multiple sentence variations and significantly reduces the human translation workload. The translation process has been meticulously conducted to avoid an English-centric perspective and include all necessary morphological variations for languages that require them, improving from the original English HOLISTICBIAS. Finally, we utilize MMHB to report results on gender bias and added toxicity in machine translation tasks. On the gender analysis, MMHB unveils: (1) a lack of gender robustness showing almost +4 chrf points in average for masculine semantic sentences compared to feminine ones and (2) a preference to overgeneralize to masculine forms by reporting more than +12 chrf points in average when evaluating with masculine compared to feminine references. MMHB triggers added toxicity up to 2.3%.
摘要:在当前的自动语言生成环境中,随着现有模型越来越多地使用多种语言,需要理解、评估和缓解人口统计偏差。为了解决这一问题,我们提供了大规模多语言HOLISTICBIAS(MMHB)数据集和基准中的最初八种语言,该数据集和基准由代表13个人口轴的大约600万个句子组成。我们提出了一种自动构建方法,利用有限的人工标注,在语言覆盖和大小方面进一步扩大MMHB语句的规模。我们的方法在多语言句子结构中使用占位符,并使用系统的方法来独立翻译句型、名词和描述符。与人工翻译相结合,该技术精心设计占位符,动态生成多个句子变体,显著减少了人工翻译的工作量。翻译过程一丝不苟地进行,以避免以英语为中心的观点,并包括需要它们的语言的所有必要的形态变体,比原始的英语HOLISTICBIAS有所改进。最后,我们利用MMHB来报告机器翻译任务中的性别偏见和添加毒性的结果。在性别分析方面,MMHB揭示了:(1)缺乏性别稳健性,与女性相比,男性语义句子平均+4个chrf分;(2)当评估男性与女性参照时,平均超过+12个chrf分,倾向于过度泛化为男性形式。MMHB会引发高达2.3%的额外毒性。

[NLP-120] Large Language Models for Power Scheduling: A User-Centric Approach
[NLP-120] 电力调度的大型语言模型:以用户为中心的方法

链接: https://arxiv.org/abs/2407.00476
作者: Thomas Mongaillard,Samson Lasaulce,Othman Hicheur,Chao Zhang,Lina Bariah,Vineeth S. Varma,Hang Zou,Qiyang Zhao,Merouane Debbah
关键词: predefined system requirements, meet fixed, personalized services, aiming to achieve, achieve high
中文关键词: 预定义的系统要求,满足固定的、个性化的服务,旨在实现,实现高
类目: Computation and Language (cs.CL); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:While traditional optimization and scheduling schemes are designed to meet fixed, predefined system requirements, future systems are moving toward user-driven approaches and personalized services, aiming to achieve high quality-of-experience (QoE) and flexibility. This challenge is particularly pronounced in wireless and digitalized energy networks, where users’ requirements have largely not been taken into consideration due to the lack of a common language between users and machines. The emergence of powerful large language models (LLMs) marks a radical departure from traditional system-centric methods into more advanced user-centric approaches by providing a natural communication interface between users and devices. In this paper, for the first time, we introduce a novel architecture for resource scheduling problems by constructing three LLM agents to convert an arbitrary user’s voice request (VRQ) into a resource allocation vector. Specifically, we design an LLM intent recognition agent to translate the request into an optimization problem (OP), an LLM OP parameter identification agent, and an LLM OP solving agent. To evaluate system performance, we construct a database of typical VRQs in the context of electric vehicle (EV) charging. As a proof of concept, we primarily use Llama 3 8B. Through testing with different prompt engineering scenarios, the obtained results demonstrate the efficiency of the proposed architecture. The conducted performance analysis allows key insights to be extracted. For instance, having a larger set of candidate OPs to model the real-world problem might degrade the final performance because of a higher recognition/OP classification noise level. All results and codes are open source.
摘要:传统的优化和调度方案是为满足固定的、预先定义的系统需求而设计的,而未来的系统正朝着用户驱动的方法和个性化服务的方向发展,旨在实现高质量的体验和灵活性。这一挑战在无线和数字化能源网络中尤其明显,由于用户和机器之间缺乏共同语言,用户的需求在很大程度上没有得到考虑。强大的大型语言模型(LLM)的出现标志着从传统的以系统为中心的方法转变为更先进的以用户为中心的方法,它提供了用户和设备之间的自然通信接口。本文首次提出了一种解决资源调度问题的新体系结构,通过构造三个LLM代理来将任意用户的语音请求(VRQ)转换为资源分配向量。具体地说,我们设计了一个LLM意图识别代理来将请求转化为一个优化问题(OP),一个LLM OP参数识别代理和一个LLM OP求解代理。为了评估系统的性能,我们构建了一个电动汽车(EV)充电环境下的典型VRQ数据库。作为概念验证,我们主要使用骆驼38B.通过对不同的即时工程场景进行测试,得到的结果验证了该体系结构的有效性。所进行的绩效分析可以提取关键的见解。例如,由于较高的识别/OP分类噪声级别,让较大的候选OP集合对真实世界问题进行建模可能会降低最终性能。所有结果和代码都是开源的。

[NLP-121] Classifier identification in Ancient Egyptian as a low-resource sequence-labelling task
[NLP-121] 古埃及中的分类器识别是一项低资源序列标签任务

链接: https://arxiv.org/abs/2407.00475
作者: Dmitry Nikolaev,Jorke Grotenhuis,Haleli Harel,Orly Goldwasser
关键词: complex Ancient Egyptian, hieroglyphic signs clarifying, Ancient Egyptian, writing system, hieroglyphic signs
中文关键词: 复杂的古埃及,象形文字符号澄清,古埃及,书写系统,象形文字符号
类目: Computation and Language (cs.CL)
备注: Accepted to ML4AL 2024 (First Machine Learning for Ancient Languages Workshop)

点击查看摘要

Abstract:The complex Ancient Egyptian (AE) writing system was characterised by widespread use of graphemic classifiers (determinatives): silent (unpronounced) hieroglyphic signs clarifying the meaning or indicating the pronunciation of the host word. The study of classifiers has intensified in recent years with the launch and quick growth of the iClassifier project, a web-based platform for annotation and analysis of classifiers in ancient and modern languages. Thanks to the data contributed by the project participants, it is now possible to formulate the identification of classifiers in AE texts as an NLP task. In this paper, we make first steps towards solving this task by implementing a series of sequence-labelling neural models, which achieve promising performance despite the modest amount of training data. We discuss tokenisation and operationalisation issues arising from tackling AE texts and contrast our approach with frequency-based baselines.
摘要:复杂的古埃及(AE)书写系统的特点是广泛使用字形分类符(限定词):无声(不发音)的象形文字符号,澄清主词的含义或指示发音。近年来,随着iClassifier项目的推出和快速发展,对分类器的研究得到了加强,iClassifier项目是一个用于注释和分析古代和现代语言分类器的基于网络的平台。得益于项目参与者提供的数据,现在可以将AE文本中分类器的识别制定为NLP任务。在本文中,我们通过实施一系列序列标记神经模型来解决这一任务迈出了第一步,尽管训练数据量有限,这些模型仍取得了令人鼓舞的性能。我们讨论了处理AE文本中出现的标记化和操作化问题,并将我们的方法与基于频率的基线进行了比较。

[NLP-122] MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
[NLP-122] MMEvalPro:校准多模式基准以实现值得信赖和高效的评估

链接: https://arxiv.org/abs/2407.00468
作者: Jinsheng Huang,Liang Chen,Taian Guo,Fu Zeng,Yusheng Zhao,Bohan Wu,Ye Yuan,Haozhe Zhao,Zhihui Guo,Yichi Zhang,Jingyang Yuan,Wei Ju,Luchen Liu,Tianyu Liu,Baobao Chang,Ming Zhang
关键词: Large Multimodal Models, exhibit impressive cross-modal, impressive cross-modal understanding, Multimodal Models, Large Language Models
中文关键词: 大型多模式模型,展现出令人印象深刻的跨模式、令人印象深刻的跨模式理解、多模式模型、大型语言模型
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, code released at this https URL , Homepage at this https URL

点击查看摘要

Abstract:Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises 2,138 question triplets, totaling 6,414 distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by 31.73% , compared to an average gap of 8.03% in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by 23.09% , whereas the gap for previous benchmarks is just 14.64% ). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.
摘要:大型多通道模型(LMM)表现出令人印象深刻的跨通道理解和推理能力,通常通过包括图像、问题和多个选项的多项选择题(MCQ)进行评估。然而,用于这类评价的许多基准存在系统性偏差。值得注意的是,没有任何视觉感知能力的大型语言模型(LLM)实现了非同寻常的性能,破坏了这些评估的可信度。为了解决这个问题,同时保持McQ评估的效率,我们提出了MMEvalPro,这是一个基准,旨在通过三部曲评估管道和更严格的度量来避免类型I错误。对于现有基准中的每个原始问题,人工注释员通过细致的注解过程创建一个感知问题和一个知识锚问题,从而对其进行扩充。MMEvalPro包括2,138个问题三联,总计6414个不同的问题。其中三分之二的问题是由人类专家手动标记的,其余的来自现有的基准(MMMU、Science QA和MathVista)。与现有的基准测试相比,我们用最新的LMM和LMM进行的实验表明,MMEvalPro更具挑战性(最好的LMM落后人类性能31.73%,而以前的基准测试的平均差距为8.03%)和可信性(最好的LLM比最好的LMM落后23.09%,而以前的基准测试的差距仅为14.64%)。我们的深入分析解释了表现差距较大的原因,并证明了评估的可信度,强调了其推动未来研究的巨大潜力。

[NLP-123] BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science
[NLP-123] BioKGBBench:生物医学科学人工智能代理的知识图谱检查基准

链接: https://arxiv.org/abs/2407.00466
作者: Xinna Lin,Siqi Ma,Junjie Shan,Xiaojing Zhang,Shell Xu Hu,Tiannan Guo,Stan Z. Li,Kaicheng Yu
关键词: Pursuing artificial intelligence, Large Language Models, Pursuing artificial, artificial intelligence, Language Models
中文关键词: 追求人工智能,大型语言模型,追求人工,人工智能,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pursuing artificial intelligence for biomedical science, a.k.a. AI Scientist, draws increasing attention, where one common approach is to build a copilot agent driven by Large Language Models (LLMs). However, to evaluate such systems, people either rely on direct Question-Answering (QA) to the LLM itself, or in a biomedical experimental manner. How to precisely benchmark biomedical agents from an AI Scientist perspective remains largely unexplored. To this end, we draw inspiration from one most important abilities of scientists, understanding the literature, and introduce BioKGBench. In contrast to traditional evaluation benchmark that only focuses on factual QA, where the LLMs are known to have hallucination issues, we first disentangle “Understanding Literature” into two atomic abilities, i) “Understanding” the unstructured text from research papers by performing scientific claim verification, and ii) Ability to interact with structured Knowledge-Graph Question-Answering (KGQA) as a form of “Literature” grounding. We then formulate a novel agent task, dubbed KGCheck, using KGQA and domain-based Retrieval-Augmented Generation (RAG) to identify the factual errors of existing large-scale knowledge graph databases. We collect over two thousand data for two atomic tasks and 225 high-quality annotated data for the agent task. Surprisingly, we discover that state-of-the-art agents, both daily scenarios and biomedical ones, have either failed or inferior performance on our benchmark. We then introduce a simple yet effective baseline, dubbed BKGAgent. On the widely used popular knowledge graph, we discover over 90 factual errors which provide scenarios for agents to make discoveries and demonstrate the effectiveness of our approach. The code and data are available at this https URL.
摘要:为生物医学科学追求人工智能。AI科学家的研究引起了越来越多的关注,其中一种常见的方法是建立一个由大型语言模型(LLM)驱动的Copilot代理。然而,为了评价这样的系统,人们要么依靠对LLM本身的直接问答(QA),要么以生物医学实验的方式。如何从人工智能科学家的角度准确地对生物医学代理进行基准测试在很大程度上仍有待探索。为此,我们从科学家最重要的能力之一–理解文献中获得灵感,并介绍BioKGB边。与只关注事实QA的传统评估基准不同,LLM已知存在幻觉问题,我们首先将“理解文学”分解为两种原子能力,i)通过执行科学主张验证来“理解”研究论文中的非结构化文本,以及ii)与结构化知识交互的能力-图形问答(KGQA)作为“文学”基础的一种形式。然后,我们使用KGQA和基于领域的检索-增强生成(RAG)来制定一种新的代理任务KGCheck来识别现有大规模知识图库中的事实错误。我们为两个原子任务收集了2000多个数据,并为代理任务收集了225个高质量的注释数据。令人惊讶的是,我们发现最先进的代理,无论是日常场景还是生物医学场景,在我们的基准测试中要么表现不佳,要么表现不佳。然后,我们介绍一个简单但有效的基准,称为BKGAgent。在广泛使用的流行知识图上,我们发现了90多个事实错误,这些错误为代理发现提供了场景,并证明了我们方法的有效性。代码和数据可在此HTTPS URL上找到。

[NLP-124] Open-Source Conversational AI with SpeechBrain 1.0
[NLP-124] SpeechBrain 1.0开源对话人工智能

链接: https://arxiv.org/abs/2407.00463
作者: Mirco Ravanelli,Titouan Parcollet,Adel Moumen,Sylvain de Langen,Cem Subakan,Peter Plantinga,Yingzhi Wang,Pooneh Mousavi,Luca Della Libera,Artem Ploujnikov,Francesco Paissan,Davide Borra,Salah Zaiem,Zeyu Zhao,Shucong Zhang,Georgios Karakasidis,Sung-Lin Yeh,Aku Rouhe,Rudolf Braun,Florian Mai,Juan Zuluaga-Gomez,Seyed Mahed Mousavi,Andreas Nautsch,Xuechen Liu,Sangeet Sagar,Jarod Duret,Salima Mdhaffar,Gaelle Laperriere,Renato De Mori,Yannick Esteve
关键词: http URL promotes, URL promotes transparency, open-source Conversational, http URL, URL promotes
中文关键词: http URL促进,URL促进透明度,开源对话,http URL,URL促进
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
备注: Submitted to JMLR (Machine Learning Open Source Software)

点击查看摘要

Abstract:SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much this http URL promotes transparency and replicability by releasing both the pre-trained models and the complete “recipes” of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.
摘要:SpeechBrain是一个基于PyTorch的开源对话人工智能工具包,特别专注于语音处理任务,例如语音识别、语音增强、说话人识别、文本到语音,而这个http URL通过发布预训练的模型以及训练它们所需的完整代码“食谱”和算法来提高透明度和可复制性。本文介绍了SpeechBrain 1.0,这是该工具包发展过程中的一个重要里程碑,该工具包目前拥有200多种语音、音频和语言处理任务的食谱,以及100多种可在Hugging Face上提供的模型。SpeechBrain 1.0引入了新技术来支持多样化的学习模式、大型语言模型(LLM)集成和高级解码策略,以及新颖的模型、任务和模式。它还包括一个新的基准存储库,为研究人员提供了一个统一的平台,用于评估跨不同任务的模型。

[NLP-125] Polarization and Morality: Lexical Analysis of Abortion Discourse on Reddit
[NLP-125] 两极分化与道德:Reddit上堕胎话语的词汇分析

链接: https://arxiv.org/abs/2407.00455
作者: Tessa Stanier,Hagyeong Shin
关键词: Moral Foundations Theory, Moral Foundations Dictionary, study investigates, investigates whether division, division on political
中文关键词: 道德基础理论,道德基础词典,研究调查,调查是否分裂,政治上的分裂
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This study investigates whether division on political topics is mapped with the distinctive patterns of language use. We collect a total 145,832 Reddit comments on the abortion debate and explore the languages of subreddit communities r/prolife and r/prochoice. With consideration of the Moral Foundations Theory, we examine lexical patterns in three ways. First, we compute proportional frequencies of lexical items from the Moral Foundations Dictionary in order to make inferences about each group’s moral considerations when forming arguments for and against abortion. We then create n-gram models to reveal frequent collocations from each stance group and better understand how commonly used words are patterned in their linguistic context and in relation to morality values. Finally, we use Latent Dirichlet Allocation to identify underlying topical structures in the corpus data. Results show that the use of morality words is mapped with the stances on abortion.
摘要:本研究调查政治话题的分歧是否与语言使用的独特模式相对应。我们总共收集了145,832条关于堕胎辩论的Reddit评论,并探索子Reddit社区r/prolife和r/prochoice的语言。考虑到道德基础理论,我们通过三种方式检查词汇模式。首先,我们计算《道德基础词典》中词汇项的比例频率,以便在形成支持和反对堕胎的论点时对每个群体的道德考虑做出推论。然后,我们创建n-gram模型来揭示每个立场组的频繁搭配,并更好地了解常用词如何在其语言背景中以及与道德价值观的关系中形成模式。最后,我们使用潜在Dirichlet分配来识别文集数据中的潜在主题结构。结果表明,道德词语的使用与堕胎的立场相对应。

[NLP-126] Self-Translate-Train: A Simple but Strong Baseline for Cross-lingual Transfer of Large Language Models
[NLP-126] 自翻译训练:大型语言模型跨语言迁移的简单但强大的基线

链接: https://arxiv.org/abs/2407.00454
作者: Ryokan Ri,Shun Kiyono,Sho Takase
关键词: Cross-lingual transfer, promising technique, improve performance, utilizing data, language
中文关键词: 跨语言迁移、有前途的技术、提高性能、利用数据、语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-lingual transfer is a promising technique for utilizing data in a source language to improve performance in a target language. However, current techniques often require an external translation system or suffer from suboptimal performance due to over-reliance on cross-lingual generalization of multi-lingual pretrained language models. In this study, we propose a simple yet effective method called Self-Translate-Train. It leverages the translation capability of a large language model to generate synthetic training data in the target language and fine-tunes the model with its own generated data. We evaluate the proposed method on a wide range of tasks and show substantial performance gains across several non-English languages.
摘要:跨语言传输是一种很有前途的技术,可以利用源语言的数据来提高目标语言的性能。然而,当前的技术通常需要外部翻译系统,或者由于过度依赖多语言预训练语言模型的跨语言概括而导致性能不佳。在这项研究中,我们提出了一种简单而有效的方法,称为自翻译训练。它利用大型语言模型的翻译能力来生成目标语言的合成训练数据,并使用其自己生成的数据对模型进行微调。我们在广泛的任务中评估了所提出的方法,并在几种非英语语言中表现出了显着的性能提升。

[NLP-127] PerSEval: Assessing Personalization in Text Summarizers
[NLP-127] PerSEval:评估文本摘要中的个性化

链接: https://arxiv.org/abs/2407.00453
作者: Sourish Dasgupta,Ankush Chander,Parth Borad,Isha Motiyani,Tanmoy Chakraborty
关键词: individuals’ subjective understanding, understanding of saliency, topics of attention, cater to individuals’, individuals’ subjective
中文关键词: 个人的主观理解,对显着性的理解,关注话题,迎合个人的,个人的主观
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Personalized summarization models cater to individuals’ subjective understanding of saliency, as represented by their reading history and current topics of attention. Existing personalized text summarizers are primarily evaluated based on accuracy measures such as BLEU, ROUGE, and METEOR. However, a recent study argued that accuracy measures are inadequate for evaluating the degree of personalization of these models and proposed EGISES, the first metric to evaluate personalized text summaries. It was suggested that accuracy is a separate aspect and should be evaluated standalone. In this paper, we challenge the necessity of an accuracy leaderboard, suggesting that relying on accuracy-based aggregated results might lead to misleading conclusions. To support this, we delve deeper into EGISES, demonstrating both theoretically and empirically that it measures the degree of responsiveness, a necessary but not sufficient condition for degree-of-personalization. We subsequently propose PerSEval, a novel measure that satisfies the required sufficiency condition. Based on the benchmarking of ten SOTA summarization models on the PENS dataset, we empirically establish that – (i) PerSEval is reliable w.r.t human-judgment correlation (Pearson’s r = 0.73; Spearman’s \rho = 0.62; Kendall’s \tau = 0.42), (ii) PerSEval has high rank-stability, (iii) PerSEval as a rank-measure is not entailed by EGISES-based ranking, and (iv) PerSEval can be a standalone rank-measure without the need of any aggregated ranking.
摘要:个性化的摘要模式迎合了个体对突显的主观理解,表现为阅读历史和当前关注的话题。现有的个性化文本摘要主要基于BLEU、胭脂和流星等准确度指标进行评估。然而,最近的一项研究认为,准确性度量不足以评估这些模型的个性化程度,并提出了EGISES,这是第一个评估个性化文本摘要的度量。有人建议,准确性是一个单独的方面,应该单独评估。在这篇文章中,我们对准确性排行榜的必要性提出了质疑,认为依赖基于准确性的汇总结果可能会导致误导结论。为了支持这一点,我们更深入地研究了EGISES,从理论和经验上证明了它衡量了响应性的程度,这是个性化程度的必要条件,但不是充分条件。我们随后提出了PerSEval,这是一种满足所要求的充分性条件的新度量。基于对10个SOTA摘要模型在PENS数据集上的基准测试,我们实证地证明:(I)PerSEval是可靠的,与人的判断相关(Pearson‘s r=0.73;Spearman’s\rho=0.62;Kendall‘s\tau=0.42),(Ii)PerSEval具有很高的排名稳定性,(Iii)PerSEval作为排名度量不需要基于EGISES的排名,以及(Iv)PerSEval可以是一个独立的排名度量,不需要任何聚合排名。

[NLP-128] A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models
[NLP-128] 多语言大语言模型并行库开发方案

链接: https://arxiv.org/abs/2407.00436
作者: Peiqin Lin,André F. T. Martins,Hinrich Schütze
关键词: Recent studies, parallel corpora, multilingual large language, large language models, exploiting parallel corpora
中文关键词: 最近的研究,平行库,多语言大型语言,大型语言模型,利用平行库
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus containing just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models due to their stronger capacity for cross-task transfer. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.
摘要:最近的研究强调了利用平行语料库来增强多语言大型语言模型的潜力,从而提高了双语任务(如机器翻译)和通用任务(如文本分类)的性能。在这些发现的基础上,我们的全面研究旨在确定利用平行语料库的最有效策略。我们调查了并行语料库的质量和数量、培训目标和模型规模对跨不同语言和任务的并行语料库增强的多语言大型语言模型性能的影响。我们的分析揭示了几个关键的见解:(I)过滤噪音翻译对于有效利用平行语料库至关重要,而语言识别和短句过滤效果甚微;(Ii)即使是只包含10K个平行句子的语料库,也可以产生与从更大的数据集获得的结果相当的结果;(Iii)在各种培训目标及其组合中,仅使用机器翻译目标可以产生最好的结果;(Iv)较大的多语言模型从平行语料库中受益更多,因为它们具有更强的跨任务迁移能力。我们的研究为优化利用平行语料库来增强多语言大型语言模型提供了有价值的见解,将以前的发现从有限的语言和任务扩展到更广泛的场景。

[NLP-129] Brevity is the soul of wit: Pruning long files for code generation
[NLP-129] 简洁是智慧的灵魂:修剪长文件以生成代码

链接: https://arxiv.org/abs/2407.00434
作者: Aaditya K. Singh,Yu Yang,Kushal Tirumala,Mostafa Elhoushi,Ari S. Morcos
关键词: Data, higher quality data, curation is commonly, commonly considered, higher quality
中文关键词: 数据,更高质量的数据,策展通常被普遍认为是更高质量的
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Data curation is commonly considered a “secret-sauce” for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data pruning has become a larger and larger focus. Specifically, many have shown that de-duplicating data, or sub-selecting higher quality data, can lead to efficiency or performance improvements. Generally, three types of methods are used to filter internet-scale corpora: embedding-based, heuristic-based, and classifier-based. In this work, we contrast the former two in the domain of finetuning LLMs for code generation. We find that embedding-based methods are often confounded by length, and that a simple heuristic–pruning long files–outperforms other methods in compute-limited regimes. Our method can yield up to a 2x efficiency benefit in training (while matching performance) or a 3.5% absolute performance improvement on HumanEval (while matching compute). However, we find that perplexity on held-out long files can increase, begging the question of whether optimizing data mixtures for common coding benchmarks (HumanEval, MBPP) actually best serves downstream use cases. Overall, we hope our work builds useful intuitions about code data (specifically, the low quality of extremely long code files) provides a compelling heuristic-based method for data pruning, and brings to light questions in how we evaluate code generation models.
摘要:数据管理通常被认为是LLM训练的秘密武器,高质量的数据通常会带来更好的LLM性能。考虑到互联网刮来的语料库的规模,数据修剪已成为一个越来越大的关注点。具体地说,许多人已经证明,消除重复数据或细分选择更高质量的数据可以提高效率或性能。通常,过滤互联网规模的语料库有三种方法:基于嵌入的、基于启发式的和基于分类器的。在这项工作中,我们在代码生成的精调LLM领域对比了前两种方法。我们发现,基于嵌入的方法经常被长度混淆,并且一个简单的启发式方法–修剪长文件–在计算受限的情况下比其他方法性能更好。我们的方法可以在训练中产生高达2倍的效率收益(在匹配性能时),或者在Human Eval上产生3.5%的绝对性能改进(在匹配计算时)。然而,我们发现延迟的长文件的困惑可能会增加,这就提出了一个问题,即为公共编码基准(HumanEval,MBPP)优化数据混合是否真的最好地服务于下游用例。总而言之,我们希望我们的工作建立了关于代码数据的有用的直觉(特别是非常长的代码文件的低质量),为数据修剪提供了一种引人注目的基于启发式的方法,并揭示了我们如何评估代码生成模型的问题。

[NLP-130] Fontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey
[NLP-130] 丰特斯。中世纪拉丁语文本的词性标记和词化。跨流派调查

链接: https://arxiv.org/abs/2407.00418
作者: Krzysztof Nowak,Jędrzej Ziębura,Krzysztof Wróbel,Aleksander Smywiński-Pohl
关键词: Medieval Latin texts, Polish Medieval Latin, automatic linguistic annotation, Medieval Latin, Latin texts
中文关键词: 中世纪拉丁文本,波兰中世纪拉丁语,自动语言注释,中世纪拉丁语,拉丁文本
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study introduces the eFontes models for automatic linguistic annotation of Medieval Latin texts, focusing on lemmatization, part-of-speech tagging, and morphological feature determination. Using the Transformers library, these models were trained on Universal Dependencies (UD) corpora and the newly developed eFontes corpus of Polish Medieval Latin. The research evaluates the models’ performance, addressing challenges such as orthographic variations and the integration of Latinized vernacular terms. The models achieved high accuracy rates: lemmatization at 92.60%, part-of-speech tagging at 83.29%, and morphological feature determination at 88.57%. The findings underscore the importance of high-quality annotated corpora and propose future enhancements, including extending the models to Named Entity Recognition.
摘要:本研究介绍了用于中世纪拉丁文本自动语言注释的eFontes模型,重点关注词形化、词性标记和形态特征确定。使用Transformers库,这些模型在通用从属关系(UD)库和新开发的波兰中世纪拉丁语eFontes库上进行训练。该研究评估了模型的性能,解决了拼写差异和拉丁化白话术语整合等挑战。这些模型实现了很高的准确率:词形分解率为92.60%,词性标记率为83.29%,形态特征确定率为88.57%。这些研究结果强调了高质量注释库的重要性,并提出了未来的增强措施,包括将模型扩展到命名实体识别。

[NLP-131] oo Late to Train Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs
[NLP-131] oo训练太晚了,太早了,无法使用?低资源孟加拉LL的必要性和可行性研究

链接: https://arxiv.org/abs/2407.00416
作者: Tamzeed Mahfuz,Satak Kumar Dey,Ruwad Naswan,Hasnaen Adil,Khondker Salman Sayeed,Haz Sameen Shahgir
关键词: English-oriented Large Language, English-oriented Large, exhibits enhanced cross-lingual, enhanced cross-lingual transfer, cross-lingual transfer capabilities
中文关键词: 以英语为导向的大型语言,以英语为导向的大型,展现出增强的跨语言、增强的跨语言迁移、跨语言迁移能力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Each new generation of English-oriented Large Language Models (LLMs) exhibits enhanced cross-lingual transfer capabilities and significantly outperforms older LLMs on low-resource languages. This prompts the question: Is there a need for LLMs dedicated to a particular low-resource language? We aim to explore this question for Bengali, a low-to-moderate resource Indo-Aryan language native to the Bengal region of South Asia. We compare the performance of open-weight and closed-source LLMs such as LLaMA-3 and GPT-4 against fine-tuned encoder-decoder models across a diverse set of Bengali downstream tasks, including translation, summarization, paraphrasing, question-answering, and natural language inference. Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent. Key challenges include inefficient tokenization of Bengali script by existing LLMs, leading to increased computational costs and potential performance degradation. Additionally, we highlight biases in machine-translated datasets commonly used for Bengali NLP tasks. We conclude that there is a significant need for a Bengali-oriented LLM, but the field currently lacks the high-quality pretraining and instruction-tuning datasets necessary to develop a highly effective model. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2407.00416 [cs.CL] (or arXiv:2407.00416v1 [cs.CL] for this version)
摘要:每一代面向英语的大语言模型都显示出增强的跨语言迁移能力,并且在低资源语言上的表现明显优于旧的大语言模型。这就引出了一个问题:是否需要专门针对特定低资源语言的LLM?我们的目标是为孟加拉语探索这个问题,孟加拉语是南亚孟加拉地区的一种低到中等资源的印度雅利安语。我们比较了开放权重和封闭源代码的LLMS(如Llama-3和GPT-4)与微调的编解码器模型在不同的孟加拉下游任务集上的性能,包括翻译、摘要、释义、问答和自然语言推理。我们的发现表明,虽然LLM通常在推理任务中表现出色,但在需要孟加拉文字生成的任务中,他们的表现并不一致。关键挑战包括现有LLM对孟加拉语脚本的低效标记化,导致计算成本增加和潜在的性能下降。此外,我们还强调了通常用于孟加拉语NLP任务的机器翻译数据集中的偏差。我们的结论是,对面向孟加拉语的法律培训有很大的需求,但该领域目前缺乏开发高效模式所需的高质量的预训和教学调整数据集。科目:计算和语言(cs.CL)引用为:arxiv:2407.00416cs.CL

[NLP-132] SHADE: Semantic Hypernym Annotator for Domain-specific Entities – DnD Domain Use Case
[NLP-132] SHADE:域特定实体的语义超假名注释器-- DnD域用例

链接: https://arxiv.org/abs/2407.00407
作者: Akila Peiris,Nisansa de Silva
关键词: important NLP task, Manual data annotation, important NLP, Manual data, NLP task
中文关键词: 重要NLP任务、手动数据注释、重要NLP、手动数据、NLP任务
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Manual data annotation is an important NLP task but one that takes considerable amount of resources and effort. In spite of the costs, labeling and categorizing entities is essential for NLP tasks such as semantic evaluation. Even though annotation can be done by non-experts in most cases, due to the fact that this requires human labor, the process is costly. Another major challenge encountered in data annotation is maintaining the annotation consistency. Annotation efforts are typically carried out by teams of multiple annotators. The annotations need to maintain the consistency in relation to both the domain truth and annotation format while reducing human errors. Annotating a specialized domain that deviates significantly from the general domain, such as fantasy literature, will see a lot of human error and annotator disagreement. So it is vital that proper guidelines and error reduction mechanisms are enforced. One such way to enforce these constraints is using a specialized application. Such an app can ensure that the notations are consistent, and the labels can be pre-defined or restricted reducing the room for errors. In this paper, we present SHADE, an annotation software that can be used to annotate entities in the high fantasy literature domain. Specifically in Dungeons and Dragons lore extracted from the Forgotten Realms Fandom Wiki.
摘要:人工数据标注是一项重要的自然语言处理任务,但需要花费大量的资源和精力。尽管代价高昂,但对实体进行标记和分类对于语义评估等自然语言处理任务是必不可少的。尽管在大多数情况下,注释可以由非专家完成,但由于这需要人力,这一过程代价高昂。在数据标注中遇到的另一个主要挑战是维护标注的一致性。注释工作通常由多个注释员组成的团队执行。注释需要保持与领域事实和注释格式相关的一致性,同时减少人为错误。注释一个明显偏离一般领域的专业领域,如奇幻文学,会看到许多人为错误和注释者的不同意见。因此,执行适当的指导方针和减少错误的机制至关重要。实施这些约束的一种方法是使用专门的应用程序。这样的应用程序可以确保符号的一致性,并且可以预定义或限制标签,以减少出错的空间。在本文中,我们介绍了一个标注软件Shade,它可以用来标注高级幻想文学领域的实体。特别是从被遗忘的王国粉丝维基那里摘录的《地下城与龙》的故事。

[NLP-133] Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
[NLP-133] 如果您只需要检索,那么上下文真的很长吗?迈向复杂困难的长期背景NLP

链接: https://arxiv.org/abs/2407.00402
作者: Omer Goldman,Alon Jacovi,Aviv Slobodkin,Aviya Maimon,Ido Dagan,Reut Tsarfaty
关键词: language models’ capabilities, Improvements in language, making long-context evaluation, language models’, models’ capabilities
中文关键词: 语言模型的能力、语言的改进、进行长上下文评估、语言模型、模型的能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Improvements in language models’ capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of “long-context”, defined simply by the total length of the model’s input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.
摘要:语言模型能力的提高将其应用推向了更长的语境,使长语境的评估和开发成为一个活跃的研究领域。然而,许多不同的用例被归类到“长上下文”这一总括术语下,该术语简单地由模型输入的总长度定义,包括–例如,干草堆中的针任务、图书摘要和信息聚合。考虑到它们的不同困难,在这份立场文件中,我们认为根据不同任务的上下文长度合并不同的任务是徒劳的。作为一个社区,我们需要更精确的词汇表来理解长上下文任务的相似或不同之处。我们建议根据使长上下文更难处理的属性来拆解长上下文的分类。我们提出了两个正交轴:(I)扩散:在上下文中找到必要的信息有多难?(2)范围:需要查找多少必要信息?我们调查了关于长上下文的文献,为这个分类作为一个信息性描述符提供了理由,并将文献放在与之相关的位置。我们的结论是,最困难和最有趣的环境,其必要的信息非常长,并在输入中高度分散,严重不足。通过使用描述性词汇并讨论长语境下难度的相关性质,我们可以在这一领域进行更有见地的研究。我们呼吁仔细设计具有明显较长背景的任务和基准,同时考虑到使其与较短背景有质的不同的特点。

[NLP-134] A Study on Effect of Reference Knowledge Choice in Generating Technical Content Relevant to SAPPhIRE Model Using Large Language Model
[NLP-134] 使用大语言模型生成与SAP PhIRE模型相关的技术内容中参考知识选择的影响研究

链接: https://arxiv.org/abs/2407.00396
作者: Kausik Bhattacharya,Anubhab Majumder,Amaresh Chakrabarti
关键词: SAPPhIRE model, Large Language Model, stimulus in design, model, inspirational stimulus
中文关键词: SPP PhIRE模型、大型语言模型、设计刺激、模型、鼓舞人心的刺激
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Representation of systems using the SAPPhIRE model of causality can be an inspirational stimulus in design. However, creating a SAPPhIRE model of a technical or a natural system requires sourcing technical knowledge from multiple technical documents regarding how the system works. This research investigates how to generate technical content accurately relevant to the SAPPhIRE model of causality using a Large Language Model, also called LLM. This paper, which is the first part of the two-part research, presents a method for hallucination suppression using Retrieval Augmented Generating with LLM to generate technical content supported by the scientific information relevant to a SAPPhIRE con-struct. The result from this research shows that the selection of reference knowledge used in providing context to the LLM for generating the technical content is very important. The outcome of this research is used to build a software support tool to generate the SAPPhIRE model of a given technical system.
摘要:使用蓝宝石因果关系模型来表示系统可以成为设计中的灵感刺激。然而,创建技术或自然系统的蓝宝石模型需要从有关系统如何工作的多个技术文档中获取技术知识。这项研究调查了如何使用大型语言模型(也称为LLM)准确地生成与蓝宝石因果关系模型相关的技术内容。本文是两部分研究的第一部分,提出了一种使用LLM的检索增广生成来抑制幻觉的方法,以生成与蓝宝石结构相关的科学信息支持的技术内容。这项研究的结果表明,参考知识的选择在为LLM提供上下文以生成技术内容时是非常重要的。本研究的成果被用来构建一个软件支持工具,以生成给定技术系统的蓝宝石模型。

[NLP-135] Advancing Process Verification for Large Language Models via Tree-Based Preference Learning
[NLP-135] 通过基于树的偏好学习推进大型语言模型的流程验证

链接: https://arxiv.org/abs/2407.00390
作者: Mingqian He,Yongliang Shen,Wenqi Zhang,Zeqi Tan,Weiming Lu
关键词: Large Language Models, Large Language, Language Models, demonstrated remarkable potential, introducing extra verifiers
中文关键词: 大型语言模型、大型语言、语言模型展示了显着的潜力,引入了额外的验证器
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable potential in handling complex reasoning tasks by generating step-by-step rationales.Some methods have proven effective in boosting accuracy by introducing extra verifiers to assess these paths. However, existing verifiers, typically trained on binary-labeled reasoning paths, fail to fully utilize the relative merits of intermediate steps, thereby limiting the effectiveness of the feedback provided. To overcome this limitation, we propose Tree-based Preference Learning Verifier (Tree-PLV), a novel approach that constructs reasoning trees via a best-first search algorithm and collects step-level paired data for preference training. Compared to traditional binary classification, step-level preferences more finely capture the nuances between reasoning steps, allowing for a more precise evaluation of the complete reasoning path. We empirically evaluate Tree-PLV across a range of arithmetic and commonsense reasoning tasks, where it significantly outperforms existing benchmarks. For instance, Tree-PLV achieved substantial performance gains over the Mistral-7B self-consistency baseline on GSM8K (67.55% to 82.79%), MATH (17.00% to 26.80%), CSQA (68.14% to 72.97%), and StrategyQA (82.86% to 83.25%).Additionally, our study explores the appropriate granularity for applying preference learning, revealing that step-level guidance provides feedback that better aligns with the evaluation of the reasoning process.
摘要:大型语言模型通过生成逐步推理在处理复杂推理任务方面显示出巨大的潜力,一些方法通过引入额外的验证器来评估这些路径,从而有效地提高了准确率。然而,现有的验证器通常在二进制标记的推理路径上训练,不能充分利用中间步骤的相对优点,从而限制了所提供的反馈的有效性。为了克服这一局限性,我们提出了一种基于树的偏好学习验证器(Tree-PLV),该方法通过最佳优先搜索算法构建推理树,并收集步长级别的配对数据进行偏好训练。与传统的二进制分类相比,步骤级偏好更精细地捕捉推理步骤之间的细微差别,允许更精确地评估完整的推理路径。我们在一系列算术和常识推理任务中对Tree-PLV进行了经验评估,在这些任务中,它的性能显著优于现有的基准测试。例如,Tree-PLV在GSM8K(67.55%到82.79%)、数学(17.00%到26.80%)、CSQA(68.14%到72.97%)和Strategy yQA(82.86%到83.25%)上取得了显著的成绩提升。此外,我们的研究探索了应用偏好学习的合适粒度,揭示了步骤级指导提供的反馈更好地符合推理过程的评估。

[NLP-136] GraphArena: Benchmarking Large Language Models on Graph Computational Problems
[NLP-136] GraphArena:在图计算问题上对大型语言模型进行基准测试

链接: https://arxiv.org/abs/2407.00379
作者: Jianheng Tang,Qifan Zhang,Yuhan Li,Jia Li
关键词: Large Language Models, Large Language, arms race, Language Models, examine their progresses
中文关键词: 大型语言模型,大型语言,军备竞赛,语言模型,检查他们的进展
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The “arms race” of Large Language Models (LLMs) demands novel, challenging, and diverse benchmarks to faithfully examine their progresses. We introduce GraphArena, a benchmarking tool designed to evaluate LLMs on graph computational problems using million-scale real-world graphs from diverse scenarios such as knowledge graphs, social networks, and molecular structures. GraphArena offers a suite of 10 computational tasks, encompassing four polynomial-time (e.g., Shortest Distance) and six NP-complete challenges (e.g., Travelling Salesman Problem). It features a rigorous evaluation framework that classifies LLM outputs as correct, suboptimal (feasible but not optimal), or hallucinatory (properly formatted but infeasible). Evaluation of 10 leading LLMs, including GPT-4o and LLaMA3-70B-Instruct, reveals that even top-performing models struggle with larger, more complex graph problems and exhibit hallucination issues. Despite the application of strategies such as chain-of-thought prompting, these issues remain unresolved. GraphArena contributes a valuable supplement to the existing LLM benchmarks and is open-sourced at this https URL.
摘要:大型语言模型的“军备竞赛”要求具有新颖性、挑战性和多样性的基准来忠实地检验它们的进展。我们介绍了GraphArena,这是一个基准测试工具,旨在使用来自不同场景(如知识图、社会网络和分子结构)的百万级真实图来评估图计算问题的LLM。GraphArena提供了一套10个计算任务,包括4个多项式时间(例如,最短距离)和6个NP完全挑战(例如,Traveling Salesman问题)。它的特点是有一个严格的评估框架,将LLM的输出归类为正确、次优(可行但不是最佳)或幻觉(格式正确但不可行)。对包括GPT-40和LLaMA3-70B-Indict在内的10个领先LLM的评估显示,即使是表现最好的模特也会遇到更大、更复杂的图形问题,并表现出幻觉问题。GraphArena对现有的LLM基准测试做出了有价值的补充,并且在该HTTPS URL上是开源的。

[NLP-137] he Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention
[NLP-137] 多元化干预的文本到图像生成的事实税:基准和事实增强干预

链接: https://arxiv.org/abs/2407.00377
作者: Yixin Wan,Di Wu,Haoran Wang,Kai-Wei Chang
关键词: models depicting individuals, commonly adopted, depicting individuals, Prompt-based, diversity interventions
中文关键词: 描述个人的模型,常用的,描述个人的,基于预算的,多样性干预
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Prompt-based “diversity interventions” are commonly adopted to improve the diversity of Text-to-Image (T2I) models depicting individuals with various racial or gender traits. However, will this strategy result in nonfactual demographic distribution, especially when generating real historical figures? In this work, we propose DemOgraphic FActualIty Representation (DoFaiR), a benchmark to systematically quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. DoFaiR consists of 756 meticulously fact-checked test instances to reveal the factuality tax of various diversity prompts through an automated evidence-supported evaluation pipeline. Experiments on DoFaiR unveil that diversity-oriented instructions increase the number of different gender and racial groups in DALLE-3’s generations at the cost of historically inaccurate demographic distributions. To resolve this issue, we propose Fact-Augmented Intervention (FAI), which instructs a Large Language Model (LLM) to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history, and incorporate it into the generation context of T2I models. By orienting model generations using the reflected historical truths, FAI significantly improves the demographic factuality under diversity interventions while preserving diversity.
摘要:基于提示的“多样性干预”通常被用来提高文本到图像(T2I)模型的多样性,该模型描述了具有不同种族或性别特征的个体。然而,这种策略会导致不真实的人口分布吗,特别是在生成真实的历史人物时?在这项工作中,我们提出了人口真实性表示(DoFaiR),这是一个基准,系统地量化使用多样性干预和在T2I模型中保持人口真实性之间的权衡。DoFaiR由756个经过仔细事实核查的测试实例组成,通过一个自动化的证据支持的评估管道来揭示各种多样性提示的真实性。DoFaiR上的实验揭示,以多样性为导向的指导增加了DALE-3代S一代中不同性别和种族群体的数量,代价是历史上不准确的人口分布。为了解决这一问题,我们提出了事实增强干预(FAI),它指示一个大型语言模型(LLM)反映历史上关于性别和种族构成的事实信息,并将其纳入T2I模型的生成上下文中。通过使用反映的历史真相来确定模型世代的方向,FAI在保持多样性的同时,显著改善了多样性干预下的人口事实。

[NLP-138] How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models
[NLP-138] 如何培训事实验证者:通过多模式开放模型进行知识转移

链接: https://arxiv.org/abs/2407.00369
作者: Jaeyoung Lee,Ximing Lu,Jack Hessel,Faeze Brahman,Youngjae Yu,Yonatan Bisk,Yejin Choi,Saadia Gabriel
关键词: provide effective real-time, effective real-time verification, social media, growing influx, provide effective
中文关键词: 提供有效的实时、有效的实时验证、社交媒体、不断增长的涌入、提供有效的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Given the growing influx of misinformation across news and social media, there is a critical need for systems that can provide effective real-time verification of news claims. Large language or multimodal model based verification has been proposed to scale up online policing mechanisms for mitigating spread of false and harmful content. While these can potentially reduce burden on human fact-checkers, such efforts may be hampered by foundation model training data becoming outdated. In this work, we test the limits of improving foundation model performance without continual updating through an initial study of knowledge transfer using either existing intra- and inter- domain benchmarks or explanations generated from large language models (LLMs). We evaluate on 12 public benchmarks for fact-checking and misinformation detection as well as two other tasks relevant to content moderation – toxicity and stance detection. Our results on two recent multi-modal fact-checking benchmarks, Mocheg and Fakeddit, indicate that knowledge transfer strategies can improve Fakeddit performance over the state-of-the-art by up to 1.7% and Mocheg performance by up to 2.9%.
摘要:鉴于新闻和社交媒体上不断涌入的错误信息,迫切需要能够对新闻声明提供有效实时验证的系统。已经提出了基于大型语言或多模式模型的验证,以扩大在线监管机制,以减少虚假和有害内容的传播。虽然这些可能会减轻人类事实核查人员的负担,但这种努力可能会因为基础模型培训数据过时而受到阻碍。在这项工作中,我们通过使用现有的域内和域间基准或从大型语言模型(LLM)生成的解释对知识转移进行初步研究,测试在不持续更新的情况下提高基础模型性能的限制。我们对事实核查和错误信息检测以及与内容审核相关的其他两项任务–毒性和立场检测–的12个公共基准进行了评估。我们在最近的两个多模式事实核查基准Mocheg和Fakeddit上的结果表明,知识转移策略可以将Fakeddit的性能提高1.7%,将Mocheg的性能提高2.9%。

[NLP-139] Financial Knowledge Large Language Model
[NLP-139] 金融知识大语言模型

链接: https://arxiv.org/abs/2407.00365
作者: Cehao Yang,Chengjin Xu,Yiyan Qi
关键词: making significant strides, Artificial intelligence, large language models, processed and interpreted, intelligence is making
中文关键词: 取得重大进展,人工智能、大型语言模型、处理和解释,智能正在取得
类目: Computation and Language (cs.CL)
备注: 66 pages

点击查看摘要

Abstract:Artificial intelligence is making significant strides in the finance industry, revolutionizing how data is processed and interpreted. Among these technologies, large language models (LLMs) have demonstrated substantial potential to transform financial services by automating complex tasks, enhancing customer service, and providing detailed financial analysis. Firstly, we introduce IDEA-FinBench, an evaluation benchmark specifically tailored for assessing financial knowledge in large language models (LLMs). This benchmark utilizes questions from two globally respected and authoritative financial professional exams, aimimg to comprehensively evaluate the capability of LLMs to directly address exam questions pertinent to the finance sector. Secondly, we propose IDEA-FinKER, a Financial Knowledge Enhancement framework designed to facilitate the rapid adaptation of general LLMs to the financial domain, introducing a retrieval-based few-shot learning method for real-time context-level knowledge injection, and a set of high-quality financial knowledge instructions for fine-tuning any general LLM. Finally, we present IDEA-FinQA, a financial question-answering system powered by LLMs. This system is structured around a scheme of real-time knowledge injection and factual enhancement using external knowledge. IDEA-FinQA is comprised of three main modules: the data collector, the data querying module, and LLM-based agents tasked with specific functions.
摘要:人工智能正在金融行业取得重大进展,彻底改变了数据的处理和解释方式。在这些技术中,大型语言模型(LLM)通过自动化复杂的任务、增强客户服务和提供详细的金融分析,显示了转变金融服务的巨大潜力。首先,我们介绍了IDEA-FinBch,一个专门为评估大型语言模型(LLM)中的金融知识而定制的评估基准。这一基准利用了两个全球知名和权威的金融专业考试的试题,旨在全面评估LLMS直接解决与金融部门相关的考试问题的能力。其次,我们提出了一个金融知识增强框架Idea-Finker,该框架旨在促进普通LLM快速适应金融领域,引入了一种基于检索的少镜头学习方法来实时注入上下文级知识,并引入了一套高质量的金融知识指令来微调任何普通LLM。最后,我们提出了一个基于LLMS的金融问答系统IDEA-FinQA。该系统围绕一种利用外部知识进行实时知识注入和事实增强的方案构建。IDEA-FinQA由三个主要模块组成:数据收集器、数据查询模块和基于LLM的代理,负责特定的功能。

[NLP-140] From RAG to RICHES: Retrieval Interlaced with Sequence Generation
[NLP-140] 从RAG到RICIES:检索与序列生成交织

链接: https://arxiv.org/abs/2407.00361
作者: Palak Jain,Livio Baldini Soares,Tom Kwiatkowski
关键词: present RICHES, RICHES, conventional RAG systems, sequence generation tasks, sequence generation
中文关键词: 当前RICIES、RICIES、传统RAG系统、序列生成任务、序列生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures, Preprint

点击查看摘要

Abstract:We present RICHES, a novel approach that interleaves retrieval with sequence generation tasks. RICHES offers an alternative to conventional RAG systems by eliminating the need for separate retriever and generator. It retrieves documents by directly decoding their contents, constrained on the corpus. Unifying retrieval with generation allows us to adapt to diverse new tasks via prompting alone. RICHES can work with any Instruction-tuned model, without additional training. It provides attributed evidence, supports multi-hop retrievals and interleaves thoughts to plan on what to retrieve next, all within a single decoding pass of the LLM. We demonstrate the strong performance of RICHES across ODQA tasks including attributed and multi-hop QA.
摘要:我们介绍了RICIES,这是一种将检索与序列生成任务交织在一起的新型方法。RICIES通过无需单独的取回器和发生器,提供了传统RAG系统的替代方案。它通过直接解码文档的内容来检索文档,并限制在文集上。将检索与生成统一起来,使我们能够仅通过提示来适应多样化的新任务。RICIES可以与任何经过指令调整的模型一起工作,无需额外培训。它提供归因证据、支持多跳检索并交织想法以计划接下来检索什么,所有这些都在LLM的单次解码过程中完成。我们展示了RICIES在ODQA任务(包括归因和多跳QA)中的强劲性能。

[NLP-141] Korean Aspect-Based Sentiment Analysis via Implicit-Feature Alignment with Corpus Filtering
[NLP-141] 通过隐含特征对齐和数据库过滤进行基于韩语语音的情绪分析

链接: https://arxiv.org/abs/2407.00342
作者: Kibeom Nam
关键词: Aspect-Based Sentiment Analysis, Sentiment Analysis, Korean restaurant reviews, Investigations into Aspect-Based, Aspect-Based Sentiment
中文关键词: 基于杀虫剂的情感分析,情感分析,韩国餐厅评论,对杀虫剂的调查,基于杀虫剂的情感
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, EMNLP 2024 (submitted), DMLR@ICML 2024

点击查看摘要

Abstract:Investigations into Aspect-Based Sentiment Analysis (ABSA) for Korean restaurant reviews are notably lacking in the existing literature. Our research proposes an intuitive and effective framework for ABSA in low-resource languages such as Korean. It optimizes prediction labels by integrating translated benchmark and unlabeled Korean data. Using a model fine-tuned on translated data, we pseudo-labeled the actual Korean NLI set. Subsequently, we applied LaBSE and MSP-based filtering to this pseudo-NLI set as implicit feature, enhancing Aspect Category Detection and Polarity determination through additional training. Incorporating dual filtering, this model bridged dataset gaps, achieving positive results in Korean ABSA with minimal resources. Through additional data injection pipelines, our approach aims to utilize high-resource data and construct effective models within communities, whether corporate or individual, in low-resource language countries. Compared to English ABSA, our framework showed an approximately 3% difference in F1 scores and accuracy. We release the dataset and our code for Korean ABSA, at this link.
摘要:针对韩国餐馆评论的基于方面的情感分析(ABSA)的研究在现有文献中明显缺乏。我们的研究为ABSA在韩语等低资源语言中提供了一个直观而有效的框架。它通过集成翻译后的基准数据和未标记的韩语数据来优化预测标签。使用在翻译数据上微调的模型,我们对实际的韩语NLI集进行了伪标记。随后,我们将LaBSE和基于MSP的过滤作为隐含特征应用到这个伪NLI集合中,通过额外的训练来增强特征类别检测和极性判定。结合双重过滤,该模型弥合了数据集的差距,以最少的资源在韩国ABSA中取得了积极的结果。通过额外的数据注入管道,我们的方法旨在利用高资源数据,并在低资源语言国家的社区内构建有效的模型,无论是公司还是个人。与英语ABSA相比,我们的框架显示出F1分数和准确率大约3%的差异。我们通过此链接发布韩国ABSA的数据集和代码。

[NLP-142] Iterative Data Augmentation with Large Language Models for Aspect-based Sentiment Analysis
[NLP-142] 使用大型语言模型进行迭代数据增强,用于基于Aspects的情绪分析

链接: https://arxiv.org/abs/2407.00341
作者: Haiyun Li,Qihuang Zhong,Ke Zhu,Juhua Liu,Bo Du,Dacheng Tao
关键词: Aspect-based Sentiment Analysis, sentiment analysis task, important sentiment analysis, Sentiment Analysis, Aspect-based Sentiment
中文关键词: 基于Aspects的情绪分析,情绪分析任务,重要情绪分析,情绪分析,基于Aspects的情绪
类目: Computation and Language (cs.CL)
备注: Work in process

点击查看摘要

Abstract:Aspect-based Sentiment Analysis (ABSA) is an important sentiment analysis task, which aims to determine the sentiment polarity towards an aspect in a sentence. Due to the expensive and limited labeled data, data augmentation (DA) has become the standard for improving the performance of ABSA. However, current DA methods usually have some shortcomings: 1) poor fluency and coherence, 2) lack of diversity of generated data, and 3) reliance on some existing labeled data, hindering its applications in real-world scenarios. In response to these problems, we propose a systematic Iterative Data augmentation framework, namely IterD, to boost the performance of ABSA. The core of IterD is to leverage the powerful ability of large language models (LLMs) to iteratively generate more fluent and diverse synthetic labeled data, starting from an unsupervised sentence corpus. Extensive experiments on 4 widely-used ABSA benchmarks show that IterD brings consistent and significant performance gains among 5 baseline ABSA models. More encouragingly, the synthetic data generated by IterD can achieve comparable or even better performance against the manually annotated data.
摘要:基于体的情感分析(ABSA)是一项重要的情感分析任务,其目的是确定句子中体的情感极性。由于标签数据的昂贵和有限,数据增强(DA)已成为提高ABSA性能的标准。然而,目前的DA方法通常存在以下缺点:1)流畅性和连贯性差;2)生成的数据缺乏多样性;3)依赖于一些已有的标记数据,阻碍了其在实际场景中的应用。针对这些问题,我们提出了一个系统的迭代数据增强框架,即IterD,以提高ABSA的性能。IterD的核心是利用大型语言模型的强大能力,从无监督的句子语料库开始,迭代地生成更流畅和多样化的合成标签数据。在4个广泛使用的ABSA基准测试上的广泛实验表明,IterD在5个基准ABSA模型中带来了一致且显著的性能提升。更令人鼓舞的是,IterD生成的合成数据可以达到与手动注释数据相当甚至更好的性能。

[NLP-143] LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods
[NLP-143] LLM生成的自然语言满足缩放定律:新的探索和数据增强方法

链接: https://arxiv.org/abs/2407.00322
作者: Zhenhua Wang,Guang Xu,Ming Ren
关键词: large language models, natural language processing, natural language, witnessed enhancements, LLM
中文关键词: 大型语言模型、自然语言处理、自然语言、见证增强、LLM
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the ascent of large language models (LLM), natural language processing has witnessed enhancements, such as LLM-based data augmentation. Nonetheless, prior research harbors two primary concerns: firstly, a lack of contemplation regarding whether the natural language generated by LLM (LLMNL) truly aligns with human natural language (HNL), a critical foundational question; secondly, an oversight that augmented data is randomly generated by LLM, implying that not all data may possess equal training value, that could impede the performance of classifiers. To address these challenges, we introduce the scaling laws to intrinsically calculate LLMNL and HNL. Through extensive experiments, we reveal slight deviations (approximately 0.2 Mandelbrot exponent) from Mandelbrot’s law in LLMNL, underscore a complexity advantage in HNL, and supplement an interpretive discussion on language style. This establishes a solid foundation for LLM’s expansion. Further, we introduce a novel data augmentation method for few-shot text classification, termed ZGPTDA, which leverages fuzzy computing mechanisms driven by the conformity to scaling laws to make decisions about GPT-4 augmented data. Extensive experiments, conducted in real-world scenarios, confirms the effectiveness (improving F1 of Bert and RoBerta by 7-10%) and competitiveness (surpassing recent AugGPT and GENCO methods by about 2% accuracy on DeBerta) of ZGPTDA. In addition, we reveal some interesting insights, e.g., Hilberg’s law and Taylor’s law can impart more benefits to text classification, etc.
摘要:随着大型语言模型(LLM)的兴起,自然语言处理得到了改进,如基于LLM的数据扩充。然而,以前的研究有两个主要的担忧:第一,缺乏对LLM生成的自然语言(LLMNL)是否真的与人类自然语言(HNL)一致的思考,这是一个关键的基础性问题;第二,疏忽了LLM随机生成的扩展数据,这意味着并不是所有的数据都具有相同的训练价值,这可能会阻碍分类器的性能。为了应对这些挑战,我们引入了标度定律来本质上计算LLMNL和HNL。通过大量的实验,我们揭示了LLMNL中对Mandelbrot定律的轻微偏离(约0.2 Mandelbrot指数),强调了HNL中的复杂性优势,并补充了对语言风格的解释性讨论。这为LLM的扩张奠定了坚实的基础。在此基础上,提出了一种新的用于少镜头文本分类的数据增强方法ZGPTDA,该方法利用符合尺度律驱动的模糊计算机制对GPT-4扩展数据进行决策。在真实场景中进行的大量实验证实了ZGPTDA的有效性(将Bert和Roberta的F1提高了7%-10%)和竞争力(在DeBerta上比最近的AugGPT和GENCO方法提高了约2%的精度)。此外,我们还揭示了一些有趣的见解,例如希尔伯格定律和泰勒定律可以给文本分类带来更多的好处等。

[NLP-144] LiteSearch: Efficacious Tree Search for LLM
[NLP-144] LiteSearch:高效的LLM树搜索

链接: https://arxiv.org/abs/2407.00320
作者: Ante Wang,Linfeng Song,Ye Tian,Baolin Peng,Dian Yu,Haitao Mi,Jinsong Su,Dong Yu
关键词: Monte Carlo Tree, Recent research suggests, dramatically boost LLM, mathematical reasoning tasks, Monte Carlo
中文关键词: 最近的研究表明,蒙特卡洛树可以极大地促进法学硕士、数学推理任务、蒙特卡洛
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent research suggests that tree search algorithms (e.g. Monte Carlo Tree Search) can dramatically boost LLM performance on complex mathematical reasoning tasks. However, they often require more than 10 times the computational resources of greedy decoding due to wasteful search strategies, making them difficult to be deployed in practical applications. This study introduces a novel guided tree search algorithm with dynamic node selection and node-level exploration budget (maximum number of children) calculation to tackle this issue. By considering the search progress towards the final answer (history) and the guidance from a value network (future) trained without any step-wise annotations, our algorithm iteratively selects the most promising tree node before expanding it within the boundaries of the allocated computational budget. Experiments conducted on the GSM8K and TabMWP datasets demonstrate that our approach not only offers competitive performance but also enjoys significantly lower computational costs compared to baseline methods.
摘要:最近的研究表明,树搜索算法(如蒙特卡罗树搜索)可以显著提高LLM在复杂数学推理任务中的性能。然而,由于搜索策略的浪费,它们往往需要的计算资源是贪婪译码的10倍以上,难以在实际应用中部署。针对这一问题,提出了一种动态节点选择和节点级搜索预算(最大子代数)计算的有向树搜索算法。通过考虑对最终答案(历史)的搜索进度和来自没有任何逐步注释的值网络(未来)的指导,我们的算法迭代地选择最有希望的树节点,然后在分配的计算预算的边界内扩展它。在GSM8K和TabMWP数据集上进行的实验表明,我们的方法不仅提供了具有竞争力的性能,而且与基准方法相比,计算代价大大降低。

[NLP-145] From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models
[NLP-145] 从本地概念到普遍性:评估视觉语言模型的多元文化理解

链接: https://arxiv.org/abs/2407.00263
作者: Mehar Bhatia,Sahithya Ravi,Aditya Chinchure,Eunjeong Hwang,Vered Shwartz
关键词: non-western cultures due, performance remains suboptimal, training datasets, recent advancements, remains suboptimal
中文关键词: 由于非西方文化,性能仍然次优,训练数据集,最近的进步,仍然次优
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under peer review

点击查看摘要

Abstract:Despite recent advancements in vision-language models, their performance remains suboptimal on images from non-western cultures due to underrepresentation in training datasets. Various benchmarks have been proposed to test models’ cultural inclusivity, but they have limited coverage of cultures and do not adequately assess cultural diversity across universal as well as culture-specific local concepts. To address these limitations, we introduce the GlobalRG benchmark, comprising two challenging tasks: retrieval across universals and cultural visual grounding. The former task entails retrieving culturally diverse images for universal concepts from 50 countries, while the latter aims at grounding culture-specific concepts within images from 15 countries. Our evaluation across a wide range of models reveals that the performance varies significantly across cultures – underscoring the necessity for enhancing multicultural understanding in vision-language models.
摘要:尽管视觉语言模型最近取得了进步,但由于训练数据集中的代表性不足,它们在非西方文化图像上的表现仍然不佳。人们提出了各种基准来测试模型的文化包容性,但它们对文化的覆盖范围有限,并且没有充分评估普遍以及特定文化的当地概念的文化多样性。为了解决这些限制,我们引入了GlobalRG基准,其中包括两项具有挑战性的任务:跨共性的检索和文化视觉基础。前一项任务需要检索来自50个国家的普遍概念的文化多样性图像,而后一项任务旨在将特定文化的概念植根于来自15个国家的图像中。我们对各种模型的评估表明,不同文化的表现存在显着差异,这凸显了在视觉语言模型中增强多元文化理解的必要性。

[NLP-146] One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts
[NLP-146] 一个提示是不够的:自动构建专家混合预算

链接: https://arxiv.org/abs/2407.00256
作者: Ruochen Wang,Sohyun An,Minhao Cheng,Tianyi Zhou,Sung Ju Hwang,Cho-Jui Hsieh
关键词: Large Language Models, Large Language, Language Models, exhibit strong generalization, strong generalization capabilities
中文关键词: 大型语言模型,大型语言,语言模型,表现出强大的概括性,强大的概括能力
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: ICML 2024. code available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong generalization capabilities to novel tasks when prompted with language instructions and in-context demos. Since this ability sensitively depends on the quality of prompts, various methods have been explored to automate the instruction design. While these methods demonstrated promising results, they also restricted the searched prompt to one instruction. Such simplification significantly limits their capacity, as a single demo-free instruction might not be able to cover the entire complex problem space of the targeted task. To alleviate this issue, we adopt the Mixture-of-Expert paradigm and divide the problem space into a set of sub-regions; Each sub-region is governed by a specialized expert, equipped with both an instruction and a set of demos. A two-phase process is developed to construct the specialized expert for each region: (1) demo assignment: Inspired by the theoretical connection between in-context learning and kernel regression, we group demos into experts based on their semantic similarity; (2) instruction assignment: A region-based joint search of an instruction per expert complements the demos assigned to it, yielding a synergistic effect. The resulting method, codenamed Mixture-of-Prompts (MoP), achieves an average win rate of 81% against prior arts across several major benchmarks.
摘要:在语言指令和情景演示的提示下,大型语言模型(LLM)对新任务表现出很强的泛化能力。由于这种能力敏感地依赖于提示的质量,因此已经探索了各种方法来自动化指令设计。虽然这些方法显示了有希望的结果,但它们也将搜索提示限制在一条指令上。这种简化大大限制了他们的能力,因为单一的无演示指令可能无法涵盖目标任务的整个复杂问题空间。为了缓解这个问题,我们采用混合专家范式,将问题空间划分为一组子区域;每个子区域由一名专门的专家管理,配备了一套说明和一组演示。构建每个区域的专业专家的过程分为两个阶段:(1)演示分配:受上下文中学习和核回归之间的理论联系的启发,我们根据演示的语义相似度将演示分组为专家;(2)指令分配:基于区域的联合搜索每个专家的指令与分配给它的演示互补,产生协同效应。由此产生的方法,代号为提示混合(MOP),在几个主要基准中实现了相对于现有技术的81%的平均胜率。

[NLP-147] Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription
[NLP-147] 注意差距:用基于变形者的转录分析缺陷

链接: https://arxiv.org/abs/2407.00250
作者: Jaydeep Borkar,David A. Smith
关键词: illegible text resulting, documents frequently suffer, storage damage, frequently suffer, illegible text
中文关键词: 导致文本难以辨认,文档经常遭受损失,存储损坏,经常遭受损失,文本难以辨认
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ICDAR 2024 Workshop on Computational Paleography

点击查看摘要

Abstract:Historical documents frequently suffer from damage and inconsistencies, including missing or illegible text resulting from issues such as holes, ink problems, and storage damage. These missing portions or gaps are referred to as lacunae. In this study, we employ transformer-based optical character recognition (OCR) models trained on synthetic data containing lacunae in a supervised manner. We demonstrate their effectiveness in detecting and restoring lacunae, achieving a success rate of 65%, compared to a base model lacking knowledge of lacunae, which achieves only 5% restoration. Additionally, we investigate the mechanistic properties of the model, such as the log probability of transcription, which can identify lacunae and other errors (e.g., mistranscriptions due to complex writing or ink issues) in line images without directly inspecting the image. This capability could be valuable for scholars seeking to distinguish images containing lacunae or errors from clean ones. Although we explore the potential of attention mechanisms in flagging lacunae and transcription errors, our findings suggest it is not a significant factor. Our work highlights a promising direction in utilizing transformer-based OCR models for restoring or analyzing damaged historical documents.
摘要:历史文档经常受到损坏和不一致的影响,包括由于孔洞、墨水问题和存储损坏等问题而导致的文本丢失或难以辨认。这些缺失的部分或缝隙被称为腔隙。在这项研究中,我们使用基于变压器的光学字符识别(OCR)模型,在有监督的方式下对包含空洞的合成数据进行训练。我们证明了它们在检测和修复腔隙方面的有效性,获得了65%的成功率,而缺乏腔隙知识的基本模型仅实现了5%的修复。此外,我们还研究了该模型的机制属性,例如转录的对数概率,该模型可以在不直接检查图像的情况下识别线条图像中的空洞和其他错误(例如,由于复杂的书写或墨水问题而导致的错误翻译)。这种能力对于试图区分包含漏洞或错误的图像与干净的图像的学者来说可能是有价值的。虽然我们探索了注意机制在标记腔隙和转录错误中的潜在作用,但我们的发现表明它不是一个重要的因素。我们的工作突出了利用基于变压器的OCR模型来恢复或分析损坏的历史文档的一个有前途的方向。

[NLP-148] DiffuseDef: Improved Robustness to Adversarial Attacks
[NLP-148] diffuseDef:增强对抗攻击的鲁棒性

链接: https://arxiv.org/abs/2407.00248
作者: Zhenhao Li,Marek Rei,Lucia Specia
关键词: Pretrained language models, natural language processing, significantly advanced performance, Pretrained language, language processing tasks
中文关键词: 预训练的语言模型、自然语言处理、显着提高的性能、预训练的语言、语言处理任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to system built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over different existing adversarial defense methods and achieves state-of-the-art performance against common adversarial attacks.
摘要:预训练的语言模型在各种自然语言处理任务中显着提高了性能。然而,对抗性攻击继续对使用这些模型构建的系统构成严峻挑战,因为它们可以被精心设计的对抗性文本利用。受到扩散模型预测和减少计算机视觉中噪音的能力的启发,我们提出了一种新颖且灵活的语言分类任务对抗防御方法:DistuseDef,它在编码器和分类器之间引入了扩散层作为降噪器。在推理过程中,对抗性隐藏状态首先与采样噪音相结合,然后迭代去噪,最后集成以产生稳健的文本表示。通过集成对抗性训练、去噪和集成技术,我们证明了DistuseDef比不同的现有对抗性防御方法进行了改进,并在对抗常见对抗性攻击时实现了最先进的性能。

[NLP-149] EHRmonize: A Framework for Medical Concept Abstraction from Electronic Health Records using Large Language Models
[NLP-149] EHRmonize:使用大型语言模型从电子健康记录中提取医疗概念的框架

链接: https://arxiv.org/abs/2407.00242
作者: João Matos,Jack Gallifant,Jian Pei,A. Ian Wong
关键词: Electronic health records, significant clinical expertise, requiring significant clinical, Electronic health, costly task requiring
中文关键词: 电子健康记录、重要的临床专业知识、需要重要的临床、电子健康、需要昂贵的任务
类目: Computation and Language (cs.CL)
备注: submitted for review, total of 10 pages

点击查看摘要

Abstract:Electronic health records (EHRs) contain vast amounts of complex data, but harmonizing and processing this information remains a challenging and costly task requiring significant clinical expertise. While large language models (LLMs) have shown promise in various healthcare applications, their potential for abstracting medical concepts from EHRs remains largely unexplored. We introduce EHRmonize, a framework leveraging LLMs to abstract medical concepts from EHR data. Our study uses medication data from two real-world EHR databases to evaluate five LLMs on two free-text extraction and six binary classification tasks across various prompting strategies. GPT-4o’s with 10-shot prompting achieved the highest performance in all tasks, accompanied by Claude-3.5-Sonnet in a subset of tasks. GPT-4o achieved an accuracy of 97% in identifying generic route names, 82% for generic drug names, and 100% in performing binary classification of antibiotics. While EHRmonize significantly enhances efficiency, reducing annotation time by an estimated 60%, we emphasize that clinician oversight remains essential. Our framework, available as a Python package, offers a promising tool to assist clinicians in EHR data abstraction, potentially accelerating healthcare research and improving data harmonization processes.
摘要:电子健康记录(EHR)包含大量复杂的数据,但协调和处理这些信息仍然是一项具有挑战性和昂贵的任务,需要大量的临床专业知识。虽然大型语言模型(LLM)在各种医疗保健应用中显示出了良好的前景,但它们从EHR中提取医学概念的潜力仍在很大程度上尚未开发。我们介绍了EHRmonize,这是一个利用LLMS从EHR数据中抽象医学概念的框架。我们的研究使用来自两个真实世界的EHR数据库的药物数据来评估五个LLMS在两个自由文本提取和六个二进制分类任务上的不同提示策略。在所有任务中,10枪提示的GPT-40取得了最高的表现,在部分任务中伴随着克劳德-3.5-十四行诗。GPT-40对仿制药名称的识别准确率为97%,对仿制药名称的识别准确率为82%,对抗生素的二进制分类准确率为100%。虽然EHRmonize显著提高了效率,将注释时间减少了约60%,但我们强调,临床医生的监督仍然是必不可少的。我们的框架以Python包的形式提供,提供了一个有前景的工具来帮助临床医生提取EHR数据,潜在地加速了医疗保健研究并改进了数据协调过程。

[NLP-150] Evaluating Human Alignment and Model Faithfulness of LLM Rationale
[NLP-150] 评估LLM理论的人际关系和模型忠实性

链接: https://arxiv.org/abs/2407.00219
作者: Mohsen Fayyaz,Fan Yin,Jiao Sun,Nanyun Peng
关键词: large language models, explain their generations, large language, input texts, texts that reflect
中文关键词: 大型语言模型,解释它们的世代,大型语言,输入文本,反映的文本
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study how well large language models (LLMs) explain their generations with rationales – a set of tokens extracted from the input texts that reflect the decision process of LLMs. We examine LLM rationales extracted with two methods: 1) attribution-based methods that use attention or gradients to locate important tokens, and 2) prompting-based methods that guide LLMs to extract rationales using prompts. Through extensive experiments, we show that prompting-based rationales align better with human-annotated rationales than attribution-based rationales, and demonstrate reasonable alignment with humans even when model performance is poor. We additionally find that the faithfulness limitations of prompting-based methods, which are identified in previous work, may be linked to their collapsed predictions. By fine-tuning these models on the corresponding datasets, both prompting and attribution methods demonstrate improved faithfulness. Our study sheds light on more rigorous and fair evaluations of LLM rationales, especially for prompting-based ones.
摘要:我们研究了大型语言模型(LLM)如何用理论基础来解释它们的生成–从输入文本中提取的一组反映LLM决策过程的符号。我们考察了用两种方法提取的LLM推理:1)基于归因的方法,它使用注意力或梯度来定位重要的表征;2)基于提示的方法,引导LLM使用提示来提取合理性。通过大量的实验,我们表明,基于提示的推理比基于归因的推理更符合人类注释的推理,并且即使在模型性能较差的情况下,也证明了与人类的合理匹配。此外,我们还发现,以前的工作中发现的基于提示的方法的忠实性限制可能与它们崩溃的预测有关。通过在相应的数据集上微调这些模型,提示和归因方法都显示出更高的忠诚度。我们的研究有助于更严格、更公正地评估LLM理论基础,尤其是基于激励的理论基础。

[NLP-151] Detection and Measurement of Syntactic Templates in Generated Text
[NLP-151] 生成文本中语法模板的检测和测量

链接: https://arxiv.org/abs/2407.00211
作者: Chantal Shaib,Yanai Elazar,Junyi Jessy Li,Byron C. Wallace
关键词: Recent work, focused on word-level, Recent, templates, word-level features
中文关键词: 最近的工作,重点是单词级、最近的、模板、单词级功能
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work on evaluating the diversity of text generated by LLMs has focused on word-level features. Here we offer an analysis of syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts. We find that most (76%) templates in model-generated text can be found in pre-training data (compared to only 35% of human-authored text), and are not overwritten during fine-tuning processes such as RLHF. This connection to the pre-training data allows us to analyze syntactic templates in models where we do not have the pre-training data. We also find that templates as features are able to differentiate between models, tasks, and domains, and are useful for qualitatively evaluating common model constructions. Finally, we demonstrate the use of templates as a useful tool for analyzing style memorization of training data in LLMs.
摘要:最近在评估LLMS生成的文本多样性方面的工作主要集中在词级特征上。在这里,我们提供了一个句法特征的分析,以表征模型中的一般重复,而不是频繁的n-gram。具体地说,我们定义了句法模板,并表明模型在下游任务中产生模板文本的速度高于在人类参考文本中发现的速度。我们发现,模型生成的文本中的大多数(76%)模板可以在预训练数据中找到(相比之下,只有35%的人创作的文本),并且在RLHF等微调过程中不会被覆盖。这种到预训练数据的连接允许我们在没有预训练数据的模型中分析句法模板。我们还发现,作为功能的模板能够区分模型、任务和域,并且对于定性评估常见的模型构造很有用。最后,我们演示了模板的使用作为一种有用的工具来分析LLMS中训练数据的风格记忆。

[NLP-152] MetaKP: On-Demand Keyphrase Generation
[NLP-152] MetaKP:按需关键词生成

链接: https://arxiv.org/abs/2407.00191
作者: Di Wu,Xiaoxian Shen,Kai-Wei Chang
关键词: Traditional keyphrase prediction, prediction methods predict, Traditional keyphrase, failing to cater, predict a single
中文关键词: 传统关键短语预测,预测方法预测,传统关键短语,未能迎合,预测单一
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional keyphrase prediction methods predict a single set of keyphrases per document, failing to cater to the diverse needs of users and downstream applications. To bridge the gap, we introduce on-demand keyphrase generation, a novel paradigm that requires keyphrases that conform to specific high-level goals or intents. For this task, we present MetaKP, a large-scale benchmark comprising four datasets, 7500 documents, and 3760 goals across news and biomedical domains with human-annotated keyphrases. Leveraging MetaKP, we design both supervised and unsupervised methods, including a multi-task fine-tuning approach and a self-consistency prompting method with large language models. The results highlight the challenges of supervised fine-tuning, whose performance is not robust to distribution shifts. By contrast, the proposed self-consistency prompting approach greatly improves the performance of large language models, enabling GPT-4o to achieve 0.548 SemF1, surpassing the performance of a fully fine-tuned BART-base model. Finally, we demonstrate the potential of our method to serve as a general NLP infrastructure, exemplified by its application in epidemic event detection from social media.
摘要:传统的关键短语预测方法对每个文档只预测一组关键短语,不能满足用户和下游应用的多样化需求。为了弥补这一差距,我们引入了按需关键短语生成,这是一种新的范式,要求关键短语符合特定的高级目标或意图。对于这项任务,我们提出了MetaKP,这是一个大规模的基准测试,包括四个数据集、7500个文档和3760个目标,涉及新闻和生物医学领域,具有人类注释的关键短语。利用MetaKP,我们设计了有监督和无监督的方法,包括多任务微调方法和使用大型语言模型的自我一致性提示方法。这些结果突出了监督微调的挑战,其性能对分布变化的健壮性不强。相比之下,提出的自洽提示方法大大提高了大型语言模型的性能,使GPT-40达到了0.548的SemF1,超过了完全微调的基于BART的模型的性能。最后,我们展示了我们的方法作为一般NLP基础设施的潜力,例如它在社交媒体上的流行病事件检测中的应用。

[NLP-153] Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach
[NLP-153] GPT-4可以帮助检测停止电子烟意图吗?自动数据注释方法的探索

链接: https://arxiv.org/abs/2407.00167
作者: Sai Krishna Revanth Vuruma,Dezhi Wu,Saborny Sen Gupta,Lucas Aust,Valerie Lookingbill,Wyatt Bellamy,Yang Ren,Erin Kasson,Li-Shiun Chen,Patricia Cavazos-Rehg,Dian Hu,Ming Huang
关键词: use-associated lung injury, United States, EVALI outbreak, vaping use-associated lung, comprehend vaping behaviors
中文关键词: 使用相关肺损伤,美国,EVATI爆发,电子烟使用相关肺,了解电子烟行为
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: Accepted for the AI Applications in Public Health and Social Services workshop at the 22nd International Conference on Artificial Intelligence in Medicine (AIME 2024)

点击查看摘要

Abstract:In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use-associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due to the ubiquity of social media platforms, over 4.7 billion users worldwide use them for connectivity, communications, news, and entertainment with a significant portion of the discourse related to health, thereby establishing social media data as an invaluable organic data resource for public health research. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users’ quit-vaping intentions. Leveraging OpenAI’s latest large language model GPT-4 for sentence-level quit vaping intention detection, this study compares the outcomes of this model against layman and clinical expert annotations. Using different prompting strategies such as zero-shot, one-shot, few-shot and chain-of-thought prompting, we developed 8 prompts with varying levels of detail to explain the task to GPT-4 and also evaluated the performance of the strategies against each other. These preliminary findings emphasize the potential of GPT-4 in social media data analysis, especially in identifying users’ subtle intentions that may elude human detection.
摘要:近年来,美国使用电子烟或电子烟的人数大幅增加,导致2019年电子烟和电子烟使用相关肺损伤(EVALI)疫情期间导致住院和死亡的病例显著上升,突显了了解电子烟行为并制定有效的戒烟策略的紧迫性。由于社交媒体平台无处不在,全球超过47亿用户使用它们进行连接、通信、新闻和娱乐,其中很大一部分内容与健康有关,从而将社交媒体数据确立为公共卫生研究的宝贵有机数据资源。在这项研究中,我们从Reddit上的一个Vaping子社区中提取了一个样本数据集,以分析用户的戒烟意图。利用OpenAI最新的大型语言模型GPT-4进行句子级戒烟意图检测,将该模型的结果与外行人和临床专家的注释进行了比较。本研究采用零射、一射、少射和连锁式提示等不同的提示策略,编制了8个不同细节程度的提示,向GPT-4进行任务解释,并对不同提示策略的表现进行了评价。这些初步发现强调了GPT-4在社交媒体数据分析方面的潜力,特别是在识别用户可能躲避人类发现的微妙意图方面。

[NLP-154] he Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic
[NLP-154] 他Qiyas基准:用阿拉伯语衡量ChatGPT数学和语言理解

链接: https://arxiv.org/abs/2407.00146
作者: Shahad Al-Khalifa,Hend Al-Khalifa
关键词: models pre-trained exclusively, language models pre-trained, Arabic data, Arabic, growing importance
中文关键词: 专门预训练的模型,预训练的语言模型,阿拉伯语数据,阿拉伯语,重要性日益增长
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the growing importance of Arabic as a global language, there is a notable lack of language models pre-trained exclusively on Arabic data. This shortage has led to limited benchmarks available for assessing language model performance in Arabic. To address this gap, we introduce two novel benchmarks designed to evaluate models’ mathematical reasoning and language understanding abilities in Arabic. These benchmarks are derived from a General Aptitude Test (GAT) called Qiyas exam, a standardized test widely used for university admissions in Saudi Arabia. For validation purposes, we assess the performance of ChatGPT-3.5-trubo and ChatGPT-4 on our benchmarks. Our findings reveal that these benchmarks pose a significant challenge, with ChatGPT-4 achieving an overall average accuracy of 64%, while ChatGPT-3.5-trubo achieved an overall accuracy of 49% across the various question types in the Qiyas benchmark. We believe the release of these benchmarks will pave the way for enhancing the mathematical reasoning and language understanding capabilities of future models tailored for the low-resource Arabic language.
摘要:尽管阿拉伯语作为一种全球语言的重要性与日俱增,但明显缺乏专门针对阿拉伯语数据进行预训练的语言模型。这一短缺导致可用于评估阿拉伯文语文模型执行情况的基准有限。为了弥补这一差距,我们引入了两个新的基准,旨在评估模型的数学推理和阿拉伯语语言理解能力。这些基准来自于一种名为齐亚斯考试的通用能力倾向测试(GAT),这是一种在沙特阿拉伯大学招生中广泛使用的标准化考试。为了验证目的,我们在我们的基准上评估了ChatGPT-3.5-trubo和ChatGPT-4的性能。我们的研究结果表明,这些基准构成了一个巨大的挑战,ChatGPT-4的总体平均准确率达到了%,而ChatGPT-3.5-TRubo在齐亚斯基准测试中的各种问题类型的总体准确率达到了49%。我们相信,这些基准的发布将为加强未来模型的数学推理和语言理解能力铺平道路,这些模型是为资源少的阿拉伯语量身定做的。

[NLP-155] A Simple Attention-Based Mechanism for Bimodal Emotion Classification
[NLP-155] 基于注意力的简单双峰情绪分类机制

链接: https://arxiv.org/abs/2407.00134
作者: Mazen Elabd,Sardar Jaf
关键词: Big data, learning important features, Big, important features, learning
中文关键词: 大数据,学习重要功能,大的,重要功能,学习
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Big data contain rich information for machine learning algorithms to utilize when learning important features during classification tasks. Human beings express their emotion using certain words, speech (tone, pitch, speed) or facial expression. Artificial Intelligence approach to emotion classification are largely based on learning from textual information. However, public datasets containing text and speech data provide sufficient resources to train machine learning algorithms for the tack of emotion classification. In this paper, we present novel bimodal deep learning-based architectures enhanced with attention mechanism trained and tested on text and speech data for emotion classification. We report details of different deep learning based architectures and show the performance of each architecture including rigorous error analyses. Our finding suggests that deep learning based architectures trained on different types of data (text and speech) outperform architectures trained only on text or speech. Our proposed attention-based bimodal architecture outperforms several state-of-the-art systems in emotion classification.
摘要:大数据包含了丰富的信息,机器学习算法在分类任务中学习重要特征时可以利用这些信息。人类使用特定的词语、语音(音调、音调、速度)或面部表情来表达自己的情感。人工智能的情感分类方法在很大程度上是基于对文本信息的学习。然而,包含文本和语音数据的公共数据集为训练情感分类的机器学习算法提供了足够的资源。在本文中,我们提出了一种新的基于双峰深度学习的结构,该结构具有增强的注意力机制,并在文本和语音数据上进行了情感分类。我们报告了不同基于深度学习的体系结构的详细信息,并展示了每个体系结构的性能,包括严格的错误分析。我们的发现表明,基于深度学习的架构在不同类型的数据(文本和语音)上训练的性能优于仅在文本或语音上训练的架构。我们提出的基于注意力的双峰结构在情感分类方面优于几个最先进的系统。

[NLP-156] Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
[NLP-156] Granite函数调用模型:通过颗粒任务的多任务学习引入函数调用能力

链接: https://arxiv.org/abs/2407.00121
作者: Ibrahim Abdelaziz,Kinjal Basu,Mayank Agarwal,Sadhana Kumaravel,Matthew Stallone,Rameswar Panda,Yara Rizk,GP Bhargav,Maxwell Crouse,Chulaka Gunasekara,Shajith Ikbal,Sachin Joshi,Hima Karanam,Vineet Kumar,Asim Munawar,Sumit Neelam,Dinesh Raghu,Udit Sharma,Adriana Meza Soria,Dheeraj Sreedhar,Praveen Venkateswaran,Merve Unuvar,David Cox,Salim Roukos,Luis Lastras,Pavan Kapanipathi
关键词: Large language models, recently shown tremendous, shown tremendous promise, Large language, function calling
中文关键词: 大型语言模型,最近展示了巨大的潜力,展示了巨大的前景,大型语言,函数调用
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We present a comprehensive evaluation on multiple out-of-domain datasets comparing GRANITE-20B-FUNCTIONCALLING to more than 15 other best proprietary and open models. GRANITE-20B-FUNCTIONCALLING provides the best performance among all open models on the Berkeley Function Calling Leaderboard and fourth overall. As a result of the diverse tasks and datasets used for training our model, we show that GRANITE-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.
摘要:大型语言模型(LLM)最近在充当代理系统的主干方面显示出了巨大的前景,它们在SWE-BENCH和Agent-BENCH等多方面、具有挑战性的基准测试中的表现就证明了这一点。然而,要实现LLM作为自主代理的真正潜力,它们必须学会识别、调用外部工具和应用程序接口(API)并与其交互,以完成复杂的任务。这些任务一起称为函数调用。赋予LLMS函数调用能力会带来许多优势,例如访问数据库和知识源中的当前和特定于领域的信息,以及外包可由工具(例如,Python解释器或计算器)可靠执行的任务的能力。虽然在使用LLM进行函数调用方面已经取得了重大进展,但仍然缺乏与GPT、Claude和Gemini等专有LLM一样性能的开放模型。该模型使用多任务训练方法对函数调用中包含的七个基本任务进行训练,这些任务分别是嵌套函数调用、函数链接、并行函数、函数名称检测、参数-值对检测、次优函数和响应生成。我们对多个域外数据集进行了综合评估,将Granite-20B-FuncIONCALLING与其他15个最好的专有和开放模型进行了比较。Granite-20B-FuncIONCALLING在伯克利功能的所有开放模型中提供了最好的性能,称为排行榜和第四。由于用于训练我们的模型的任务和数据集的多样性,我们表明Granite-20B-Function-CALLING在七个不同的评估数据集中对多个任务具有更好的泛化能力。

[NLP-157] Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations
[NLP-157] 用于对话中多模式情感识别的高效长距离潜在感知图神经网络

链接: https://arxiv.org/abs/2407.00119
作者: Yuntao Shou,Wei Ai,Jiayi Du,Tao Meng,Haiyan Liu
关键词: genuine emotional state, graph neural networks, aims to analyze, multi-modal emotion recognition, analyze the genuine
中文关键词: 真实的情感状态,图神经网络,旨在分析、多模式情感识别、分析真实的
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 3 tables

点击查看摘要

Abstract:The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52% and 35%, respectively.
摘要:会话中多模式情感识别的目标是根据会话中的多模式信息分析每个话语的真实情感状态,这是会话理解的关键。现有的方法侧重于使用图神经网络(GNN)来建模会话关系和捕获上下文潜在语义关系。然而,由于GNN的复杂性,现有的方法不能有效地捕捉长距离话语之间的潜在依赖关系,这限制了Merc的性能。本文提出了一种高效的长距离潜在关系感知图神经网络(ELR-GNN),用于会话中的多模式情感识别。具体地说,我们首先使用预先提取的文本、视频和音频特征作为bi-LSTM的输入,以获取上下文语义信息和低层话语特征。然后,我们利用低层话语特征来构建会话情感交互图。为了有效地捕获长距离话语之间的潜在依赖关系,我们使用扩展的广义向前推算法来预测全局话语之间的情感传播,并设计了一个情感关系感知算子来捕捉不同话语之间的潜在语义关联。此外,我们结合早期融合和自适应后期融合机制来融合说话人关系信息和语境之间的潜在依赖信息。最后,我们提取高层语篇特征,并将其反馈到MLP中进行情感预测。大量实验结果表明,ELR-GNN在基准数据集IEMOCAP和MELD上达到了最好的性能,运行时间分别减少了52%和35%。

[NLP-158] OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
[NLP-158] OmniJARRIS:统一的视觉-语言-动作代币化实现遵循代理的开放世界指令

链接: https://arxiv.org/abs/2407.00114
作者: Zihao Wang,Shaofei Cai,Zhancun Mu,Haowei Lin,Ceyao Zhang,Xuejie Liu,Qing Li,Anji Liu,Xiaojian Ma,Yitao Liang
关键词: open-world instruction-following agents, instruction-following agents, open-world Minecraft, behavior, tokens
中文关键词: 开放世界描述跟随代理,描述跟随代理,开放世界我的世界,行为,代币
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories \tau = o_0 , a_0 , \dots and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.
摘要:我们提出了一种新的视觉-语言-动作(VLA)模型OmniJARVIS,用于开放世界中的开放世界教学跟踪智能体。与以往要么发出文本目标以分离控制器,要么直接产生控制命令的工作相比,OmniJARVIS寻求一条不同的途径,通过对多通道交互数据的统一标记化来确保强大的推理和高效的决策能力。首先,我们引入了一种自监督方法来学习行为编码器,它为行为轨迹\tau=o_0,a_0,\dots产生离散化的令牌,以及以这些令牌为条件的模仿学习(IL)策略解码器。这些额外的行为标记将被扩充到预先训练的多模式语言模型(MLM)的词汇表中。然后,使用这个编码器,我们将涉及任务指令、记忆、思维、观察、文本响应、行为轨迹等的长期多模式交互打包到统一的令牌序列中,并使用自回归转换器对它们进行建模。多亏了语义上有意义的行为令牌,最终得到的VLA模型OmniJARVIS可以推理(通过生成思想链)、计划、回答问题和行动(通过为IL策略解码器生成行为令牌)。OmniJARVIS在开放世界的《我的世界》中展示了在原子性、程序性和开放式任务的全面集合上的出色表现。我们的分析进一步揭示了交互数据形成、统一标记化及其扩展潜力方面的关键设计原则。

[NLP-159] Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models
[NLP-159] 利用微调小语言模型准确预测配体-蛋白质相互作用亲和力

链接: https://arxiv.org/abs/2407.00111
作者: Ben Fauber
关键词: instruction fine-tuned pretrained, fine-tuned pretrained generative, pretrained generative small, generative small language, small language models
中文关键词: 指令微调预训练、微调预训练生成器、预训练生成器小型、生成器小型语言、小型语言模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We describe the accurate prediction of ligand-protein interaction (LPI) affinities, also known as drug-target interactions (DTI), with instruction fine-tuned pretrained generative small language models (SLMs). We achieved accurate predictions for a range of affinity values associated with ligand-protein interactions on out-of-sample data in a zero-shot setting. Only the SMILES string of the ligand and the amino acid sequence of the protein were used as the model inputs. Our results demonstrate a clear improvement over machine learning (ML) and free-energy perturbation (FEP+) based methods in accurately predicting a range of ligand-protein interaction affinities, which can be leveraged to further accelerate drug discovery campaigns against challenging therapeutic targets.
摘要:我们描述了配体-蛋白质相互作用(LPI)亲和力(也称为药物-靶点相互作用(RTI))的准确预测,并使用经过微调的预训练生成小语言模型(SLC)。我们在零激发设置下对样本外数据上与配体-蛋白质相互作用相关的一系列亲和力值实现了准确预测。仅使用配体的SMILES串和蛋白质的氨基酸序列作为模型输入。我们的结果表明,在准确预测一系列配体-蛋白质相互作用亲和力方面,与基于机器学习(ML)和自由能微扰(BEP+)的方法相比有了明显的改进,这可用于进一步加速针对具有挑战性的治疗目标的药物发现活动。

[NLP-160] A Case Study on Contextual Machine Translation in a Professional Scenario of Subtitling
[NLP-160] 专业字幕场景下的上下文机器翻译案例研究

链接: https://arxiv.org/abs/2407.00108
作者: Sebastian Vincent,Charlotte Prescott,Chris Bayliss,Chris Oakley,Carolina Scarton
关键词: enhance translation quality, Incorporating extra-textual context, translation quality, Incorporating extra-textual, machine translation
中文关键词: 提高翻译质量,消除文本外上下文,翻译质量,消除文本外,机器翻译
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted to EAMT 2024

点击查看摘要

Abstract:Incorporating extra-textual context such as film metadata into the machine translation (MT) pipeline can enhance translation quality, as indicated by automatic evaluation in recent work. However, the positive impact of such systems in industry remains unproven. We report on an industrial case study carried out to investigate the benefit of MT in a professional scenario of translating TV subtitles with a focus on how leveraging extra-textual context impacts post-editing. We found that post-editors marked significantly fewer context-related errors when correcting the outputs of MTCue, the context-aware model, as opposed to non-contextual models. We also present the results of a survey of the employed post-editors, which highlights contextual inadequacy as a significant gap consistently observed in MT. Our findings strengthen the motivation for further work within fully contextual MT.
摘要:正如最近工作中的自动评估所表明的那样,将电影元数据等非文本上下文融入机器翻译(MT)管道可以提高翻译质量。然而,此类系统对工业的积极影响尚未得到证实。我们报告了一项工业案例研究,旨在调查MT在翻译电视字幕的专业场景中的好处,重点关注利用文本外上下文如何影响后期编辑。我们发现,与非上下文模型相比,在纠正MTCue(上下文感知模型)的输出时,后期编辑标记的上下文相关错误明显较少。我们还介绍了对受雇的后编辑的调查结果,该调查强调了上下文不足是MT中一贯观察到的一个重大差距。我们的发现增强了在完全上下文MT中进一步工作的动力。

[NLP-161] UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI
[NLP-161] UnUnlearning:Unlearning不足以实现先进生成人工智能中的内容监管

链接: https://arxiv.org/abs/2407.00106
作者: Ilia Shumailov,Jamie Hayes,Eleni Triantafillou,Guillermo Ortiz-Jimenez,Nicolas Papernot,Matthew Jagielski,Itay Yona,Heidi Howard,Eugene Bagdasaryan
关键词: Exact unlearning, allowed a user, user to retract, retract their data, data from machine
中文关键词: 精确的取消学习,允许用户、用户撤回、撤回他们的数据、来自机器的数据
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Exact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate the impractical costs associated with exact unlearning. More recently unlearning is often discussed as an approach for removal of impermissible knowledge i.e. knowledge that the model should not possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the model does not have a certain malicious capability, then it cannot be used for the associated malicious purpose. In this paper we revisit the paradigm in which unlearning is used for in Large Language Models (LLMs) and highlight an underlying inconsistency arising from in-context learning. Unlearning can be an effective control mechanism for the training phase, yet it does not prevent the model from performing an impermissible act during inference. We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications.
摘要:精确遗忘最初是作为一种隐私机制引入的,它允许用户根据请求从机器学习模型中检索他们的数据。不久之后,不精确的方案被提出,以减轻与精确遗忘相关的不切实际的成本。承诺的是,如果模型没有特定的恶意功能,则不能用于相关的恶意目的。我们讨论了忘却对现代LLM的可行性,并考察了更广泛的影响。

[NLP-162] Curriculum Learning with Quality-Driven Data Selection
[NLP-162] 质量驱动数据选择的课程学习

链接: https://arxiv.org/abs/2407.00102
作者: Biao Wu,Fang Meng,Ling Chen
关键词: Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, impressive multimodal capabilities
中文关键词: 多模式大型语言、大型语言模型、大型语言、多模式大型、令人印象深刻的多模式能力
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The impressive multimodal capabilities demonstrated by OpenAI’s GPT-4 have generated significant interest in the development of Multimodal Large Language Models (MLLMs). Visual instruction tuning of MLLMs with machine-generated instruction-following data has shown to enhance zero-shot capabilities across various tasks. However, there has been limited exploration into controlling the quality of the instruction data.Current methodologies for data selection in MLLMs often rely on single, unreliable scores or use downstream tasks for selection, which is time-consuming and can lead to potential overfitting on the chosen evaluation datasets. To mitigate these limitations, we propose a novel data selection methodology that utilizes image-text correlation and model perplexity to evaluate and select data of varying quality. This approach leverages the distinct distribution of these two attributes, mapping data quality into a two-dimensional space that allows for the selection of data based on their location within this distribution. By utilizing this space, we can analyze the impact of task type settings, used as prompts, on data quality. Additionally, this space can be used to construct multi-stage subsets of varying quality to facilitate curriculum learning. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in five commonly assessed capabilities compared to using the complete dataset. Our codes, data, and models are publicly available at: \urlhttps://anonymous.4open.science/r/EHIT-31B4
摘要:OpenAI的GPT-4展示了令人印象深刻的多模式能力,这引起了人们对多模式大型语言模型(MLLMS)开发的浓厚兴趣。具有机器生成的指令跟随数据的MLLMS的可视指令调优已显示出增强了跨各种任务的零射能力。然而,在控制教学数据质量方面的探索一直是有限的。目前,MLLMS中的数据选择方法往往依赖于单一的、不可靠的分数或使用下游任务进行选择,这既耗时又可能导致对所选评估数据集的潜在过度匹配。为了缓解这些局限性,我们提出了一种新的数据选择方法,该方法利用图文相关性和模型困惑度来评估和选择不同质量的数据。该方法利用这两个属性的不同分布,将数据质量映射到二维空间,从而允许根据数据在该分布中的位置来选择数据。通过利用这个空间,我们可以分析用作提示的任务类型设置对数据质量的影响。此外,这个空间可以用来构建不同质量的多阶段子集,以促进课程学习。我们的研究包括在各种数据集上进行的全面实验。结果强调,与使用完整的数据集相比,在五个通常评估的能力方面有实质性的增强。我们的代码、数据和模型可在以下网址公开获得:\urlhttps://anonymous.4open.science/r/EHIT-31B4

[NLP-163] Enhancing In-Context Learning via Implicit Demonstration Augmentation
[NLP-163] 通过内隐演示增强增强上下文学习

链接: https://arxiv.org/abs/2407.00100
作者: Xiaoling Zhou,Wei Ye,Yidong Wang,Chaoya Jiang,Zhemg Lee,Rui Xie,Shikun Zhang
关键词: enables large pre-trained, pre-trained language models, large pre-trained language, ICL effectiveness heavily, in-context learning
中文关键词: 实现大型预训练、预训练语言模型、大型预训练语言、ICL有效性、上下文学习
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL 2024 Main 19 pages,10 figures

点击查看摘要

Abstract:The emergence of in-context learning (ICL) enables large pre-trained language models (PLMs) to make predictions for unseen inputs without updating parameters. Despite its potential, ICL’s effectiveness heavily relies on the quality, quantity, and permutation of demonstrations, commonly leading to suboptimal and unstable performance. In this paper, we tackle this challenge for the first time from the perspective of demonstration augmentation. Specifically, we start with enriching representations of demonstrations by leveraging their deep feature distribution. We then theoretically reveal that when the number of augmented copies approaches infinity, the augmentation is approximately equal to a novel logit calibration mechanism integrated with specific statistical properties. This insight results in a simple yet highly efficient method that significantly improves the average and worst-case accuracy across diverse PLMs and tasks. Moreover, our method effectively reduces performance variance among varying demonstrations, permutations, and templates, and displays the capability to address imbalanced class distributions.
摘要:情境学习(ICL)的出现使大型预训练语言模型(PLM)能够在不更新参数的情况下对看不见的输入进行预测。尽管有潜力,但ICL的有效性在很大程度上依赖于演示的质量、数量和排列,通常会导致次优和不稳定的性能。在本文中,我们首次从演示增强的角度来应对这一挑战。具体地说,我们首先通过利用演示的深层功能分布来丰富演示的表示。然后,我们从理论上揭示了当扩展拷贝数趋于无穷大时,扩展近似等于一种结合了特定统计特性的新的Logit校准机制。这种洞察产生了一种简单但高效的方法,显著提高了不同PLM和任务的平均和最差情况的准确性。此外,我们的方法有效地减少了不同演示、排列和模板之间的性能差异,并显示了解决类分布不平衡的能力。

[NLP-164] ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
[NLP-164] ARES:交替强化学习和有监督的微调,通过多样化的人工智能反馈增强多模式思维链推理

链接: https://arxiv.org/abs/2407.00087
作者: Ju-Seung Byun,Jiyun Chun,Jihyung Kil,Andrew Perrault
关键词: Large Multimodal Models, Large Multimodal, comprehending human instructions, demonstrate remarkable results, excel at comprehending
中文关键词: 大型多模式模型,大型多模式,理解人类指令,表现出显着的结果,擅长理解
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) excel at comprehending human instructions and demonstrate remarkable results across a broad spectrum of tasks. Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) further refine LLMs by aligning them with specific preferences. These methods primarily use ranking-based feedback for entire generations. With advanced AI models (Teacher), such as GPT-4 and Claude 3 Opus, we can request various types of detailed feedback that are expensive for humans to provide. We propose a two-stage algorithm ARES that Alternates REinforcement Learning (RL) and Supervised Fine-Tuning (SFT). First, we request the Teacher to score how much each sentence contributes to solving the problem in a Chain-of-Thought (CoT). This sentence-level feedback allows us to consider individual valuable segments, providing more granular rewards for the RL procedure. Second, we ask the Teacher to correct the wrong reasoning after the RL stage. The RL procedure requires massive efforts for hyperparameter tuning and often generates errors like repetitive words and incomplete sentences. With the correction feedback, we stabilize the RL fine-tuned model through SFT. We conduct experiments on multi-model dataset ScienceQA and A-OKVQA to demonstrate the effectiveness of our proposal. ARES rationale reasoning achieves around 70% win rate against baseline models judged by GPT-4o. Additionally, we observe that the improved rationale reasoning leads to a 2.5% increase in inference answer accuracy on average for the multi-modal datasets.
摘要:大型多通道模型(LMM)擅长理解人类的指令,并在广泛的任务范围内展示了显著的结果。人类反馈强化学习(RLHF)和人工智能反馈强化学习(RLAIF)通过将LLM与特定的偏好对齐来进一步细化LLMS。这些方法主要针对整个世代使用基于排名的反馈。有了先进的人工智能模型(老师),如GPT-4和Claude 3 Opus,我们可以请求各种类型的详细反馈,而这些反馈对于人类来说是昂贵的。我们提出了一种交替强化学习(RL)和有监督精调(SFT)的两阶段算法ARES。首先,我们要求老师在思维链(COT)中对每句话对解决问题的贡献程度进行评分。这种句子级反馈允许我们考虑个别有价值的片段,为RL过程提供更细粒度的回报。其次,我们要求老师在RL阶段之后纠正错误的推理。RL过程需要大量的超参数调整,并且经常会产生重复单词和不完整的句子等错误。在修正反馈的作用下,我们通过SFT稳定了RL微调模型。我们在多模型数据集Science QA和A-OKVQA上进行了实验,验证了该算法的有效性。与GPT-40判断的基线模型相比,战神理性推理达到了70%左右的胜率。此外,我们观察到改进的基本推理导致多模式数据集的推理答案准确率平均提高2.5%。

[NLP-165] Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
[NLP-165] Logicbreaks:理解基于规则的推理颠覆的框架

链接: https://arxiv.org/abs/2407.00075
作者: Anton Xue,Avishree Khare,Rajeev Alur,Surbhi Goel,Eric Wong
关键词: subvert language models, propositional Horn logic, language models, models, large language models
中文关键词: 颠覆语言模型、命题Horn逻辑、语言模型、模型、大型语言模型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study how to subvert language models from following the rules. We model rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if P and Q , then R " for some propositions P , Q , and R . We prove that although transformers can faithfully abide by such rules, maliciously crafted prompts can nevertheless mislead even theoretically constructed models. Empirically, we find that attacks on our theoretical models mirror popular attacks on large language models. Our work suggests that studying smaller theoretical models can help understand the behavior of large language models in rule-based settings like logical reasoning and jailbreak attacks.
摘要:我们研究如何从遵循规则来颠覆语言模型。我们将规则遵循建模为命题Horn逻辑中的推理,这是一个数学系统,其中一些命题P、Q和R的规则具有“如果P和Q,那么R”的形式。我们证明,尽管变形金刚可以忠实地遵守这些规则,但恶意制作的提示仍然可以误导理论上构建的模型。从经验上看,我们发现对理论模型的攻击反映了对大型语言模型的流行攻击。我们的工作表明,研究较小的理论模型可以帮助理解大型语言模型在逻辑推理和越狱攻击等基于规则的环境中的行为。

[NLP-166] Pistis-RAG: A Scalable Cascading Framework Towards Trustworthy Retrieval-Augmented Generation
[NLP-166] Pistis-RAG:迈向值得信赖的检索增强一代的可扩展级联框架

链接: https://arxiv.org/abs/2407.00072
作者: Yu Bai,Yukai Miao,Li Chen,Dan Li,Yanyu Ren,Hongtao Xie,Ce Yang,Xuhui Cai
关键词: Pistis symbolized good, symbolized good faith, Greek mythology, Pistis symbolized, good faith
中文关键词: 皮蒂斯象征善良,象征诚信,希腊神话,皮蒂斯象征,诚信
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In Greek mythology, Pistis symbolized good faith, trust, and reliability, echoing the core principles of RAG in LLM systems. Pistis-RAG, a scalable multi-stage framework, effectively addresses the challenges of large-scale retrieval-augmented generation (RAG). Each stage plays a distinct role: matching refines the search space, pre-ranking prioritizes semantically relevant documents, and ranking aligns with the large language model’s (LLM) preferences. The reasoning and aggregating stage supports the implementation of complex chain-of-thought (CoT) methods within this cascading structure. We argue that the lack of strong alignment between LLMs and the external knowledge ranking methods used in RAG tasks is relevant to the reliance on the model-centric paradigm in RAG frameworks. A content-centric approach would prioritize seamless integration between the LLMs and external information sources, optimizing the content transformation process for each specific task. Critically, our ranking stage deviates from traditional RAG approaches by recognizing that semantic relevance alone may not directly translate to improved generation. This is due to the sensitivity of the few-shot prompt order, as highlighted in prior work \citelu2021fantastically. Current RAG frameworks fail to account for this crucial factor. We introduce a novel ranking stage specifically designed for RAG systems. It adheres to information retrieval principles while considering the unique business scenario captured by LLM preferences and user feedback. Our approach integrates in-context learning (ICL) methods and reasoning steps to incorporate user feedback, ensuring efficient alignment. Experiments on the MMLU benchmark demonstrate a 9.3% performance improvement. The model and code will be open-sourced on GitHub. Experiments on real-world, large-scale data validate our framework’s scalability.
摘要:在希腊神话中,Pistis象征着诚信、信任和可靠,与LLM系统中RAG的核心原则相呼应。Pistis-RAG是一个可扩展的多阶段框架,有效地解决了大规模检索-增强生成(RAG)的挑战。每个阶段都扮演着不同的角色:匹配细化搜索空间,预先排序对语义相关的文档进行优先排序,并根据大型语言模型(LLM)的偏好进行排序。推理和聚合阶段支持在该级联结构中实现复杂的思想链(COT)方法。我们认为,LLM与RAG任务中使用的外部知识排名方法之间缺乏很强的一致性,这与RAG框架中依赖以模型为中心的范式有关。以内容为中心的方法将优先考虑低成本管理系统和外部信息源之间的无缝集成,优化每个特定任务的内容转换过程。关键的是,我们的排名阶段偏离了传统的RAG方法,因为我们认识到单靠语义相关性可能不会直接转化为更好的生成。这是由于少数几枪的提示顺序的敏感性,如前面的工作\cielu2021所强调的那样。目前的RAG框架未能考虑到这一关键因素。我们介绍了一种专门为RAG系统设计的新的排名阶段。它遵循信息检索原则,同时考虑由LLM首选项和用户反馈捕获的独特业务场景。我们的方法集成了情境学习(ICL)方法和推理步骤,以纳入用户反馈,确保有效的对齐。在MMLU基准上的实验表明,该算法的性能有9.3%的提高。该模型和代码将在GitHub上开源。在真实世界、大规模数据上的实验验证了该框架的可扩展性。

[NLP-167] Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization
[NLP-167] 组合推理:通过组合优化在生成人工智能管道中选择原因

链接: https://arxiv.org/abs/2407.00071
作者: Mert Esencan,Tarun Advaith Kumar,Ata Akbari Asanjan,P. Aaron Lott,Masoud Mohseni,Can Unlu,Davide Venturelli,Alan Ho
关键词: Recent Large Language, Large Language Models, Recent Large, Language Models, Large Language
中文关键词: 最近的大型语言,大型语言模型,最近的大型,语言模型,大型语言
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Recent Large Language Models (LLMs) have demonstrated impressive capabilities at tasks that require human intelligence and are a significant step towards human-like artificial intelligence (AI). Yet the performance of LLMs at reasoning tasks have been subpar and the reasoning capability of LLMs is a matter of significant debate. While it has been shown that the choice of the prompting technique to the LLM can alter its performance on a multitude of tasks, including reasoning, the best performing techniques require human-made prompts with the knowledge of the tasks at hand. We introduce a framework for what we call Combinatorial Reasoning (CR), a fully-automated prompting method, where reasons are sampled from an LLM pipeline and mapped into a Quadratic Unconstrained Binary Optimization (QUBO) problem. The framework investigates whether QUBO solutions can be profitably used to select a useful subset of the reasons to construct a Chain-of-Thought style prompt. We explore the acceleration of CR with specialized solvers. We also investigate the performance of simpler zero-shot strategies such as linear majority rule or random selection of reasons. Our preliminary study indicates that coupling a combinatorial solver to generative AI pipelines is an interesting avenue for AI reasoning and elucidates design principles for future CR methods.
摘要:最近的大型语言模型(LLM)在需要人类智能的任务中表现出了令人印象深刻的能力,是向类人类人工智能(AI)迈出的重要一步。然而,LLMS在推理任务中的表现一直不佳,LLMS的推理能力是一个有重大争议的问题。虽然已经表明,对LLM的提示技术的选择可以改变其在包括推理在内的许多任务上的表现,但最好的执行技术需要具有手头任务知识的人工提示。我们介绍了一种称为组合推理(CR)的框架,这是一种全自动提示方法,其中原因从LLM管道中采样并映射到二次无约束二元优化(QUBO)问题。该框架调查Qubo解决方案是否可以有利可图地用于选择有用的原因子集来构建思维链式提示。我们使用专门的求解器来探索CR的加速。我们还研究了更简单的零投篮策略的性能,如线性多数规则或随机选择理由。我们的初步研究表明,将组合求解器耦合到生成式人工智能管道是人工智能推理的一条有趣的途径,并阐明了未来认知无线电方法的设计原则。

[NLP-168] Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
[NLP-168] 压缩然后上菜:以很少的费用为数千个LoRA适配器提供服务

链接: https://arxiv.org/abs/2407.00066
作者: Rickard Brüel-Gabrielsson,Jiacheng Zhu,Onkar Bhardwaj,Leshem Choshen,Kristjan Greenewald,Mikhail Yurochkin,Justin Solomon
关键词: Fine-tuning large language, large language models, yielding numerous copies, Fine-tuning large, LLM differing
中文关键词: 微调大型语言、大型语言模型、产生大量副本、微调大型、LLM不同
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) with low-rank adapters (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRA adapters. We consider compressing adapters individually via SVD and propose a method for joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. Our experiments with up to 500 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 75% of the throughput of serving a single LoRA.
摘要:使用低阶适配器(LORA)对大型语言模型(LLM)进行微调已成为一种常见做法,通常会产生相同LLM的多个副本,只是LORA更新有所不同。这一范例对为每个涉及不同LORA的查询提供实时响应的系统提出了挑战。以前的工作优化了这种系统的设计,但仍然需要不断地加载和卸载LORA,因为在GPU内存中存储数千个LORA是不可行的。为了缓解这个问题,我们调查了在为LORA适配器提供服务时压缩的效果。我们考虑通过奇异值分解来单独压缩适配器,并提出了一种将LORA联合压缩成与LORA特定的尺度矩阵配对的共享基的方法。我们在多达500个LORA上的实验表明,压缩的LORA在保持性能的同时,在实际服务场景中提供了显著的吞吐量提升,其中LORA超过1000个,保持了单个LORA吞吐量的75%。

[NLP-169] A Document-based Knowledge Discovery with Microservices Architecture
[NLP-169] 采用微服务架构的基于文档的知识发现

链接: https://arxiv.org/abs/2407.00053
作者: Habtom Kahsay Gidey,Mario Kesseler,Patrick Stangl,Peter Hillmann,Andreas Karcher
关键词: digitally stored data, organizations lies, conversion of analog, digitally stored, analog data
中文关键词: 数字存储的数据、组织谎言、模拟、数字存储、模拟数据的转换
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The first step towards digitalization within organizations lies in digitization - the conversion of analog data into digitally stored data. This basic step is the prerequisite for all following activities like the digitalization of processes or the servitization of products or offerings. However, digitization itself often leads to ‘data-rich’ but ‘knowledge-poor’ material. Knowledge discovery and knowledge extraction as approaches try to increase the usefulness of digitized data. In this paper, we point out the key challenges in the context of knowledge discovery and present an approach to addressing these using a microservices architecture. Our solution led to a conceptual design focusing on keyword extraction, similarity calculation of documents, database queries in natural language, and programming language independent provision of the extracted information. In addition, the conceptual design provides referential design guidelines for integrating processes and applications for semi-automatic learning, editing, and visualization of ontologies. The concept also uses a microservices architecture to address non-functional requirements, such as scalability and resilience. The evaluation of the specified requirements is performed using a demonstrator that implements the concept. Furthermore, this modern approach is used in the German patent office in an extended version.
摘要:在组织内部迈向数字化的第一步在于数字化–将模拟数据转换为数字存储数据。这一基本步骤是所有后续活动的先决条件,如流程数字化或产品或产品的服务化。然而,数字化本身往往导致“数据丰富”但“知识贫乏”的材料。作为方法的知识发现和知识提取试图增加数字化数据的有用性。在本文中,我们指出了知识发现环境中的关键挑战,并提出了一种使用微服务体系结构来解决这些挑战的方法。我们的解决方案导致了一个概念设计,重点是关键字提取、文档相似度计算、自然语言数据库查询以及与编程语言无关的提取信息的提供。此外,概念设计还为集成用于半自动学习、编辑和可视化本体的过程和应用程序提供了参考设计指南。该概念还使用微服务体系结构来解决非功能需求,如可伸缩性和弹性。使用实现该概念的演示器来执行指定需求的评估。此外,这种现代方法在德国专利局的扩展版本中也得到了使用。

[NLP-170] One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
[NLP-170] 一个队列即可:解决大型语言模型服务中的行头阻塞问题

链接: https://arxiv.org/abs/2407.00047
作者: Archit Patke,Dhemath Reddy,Saurabh Jha,Haoran Qiu,Christian Pinto,Shengkun Cui,Chandra Narayanaswami,Zbigniew Kalbarczyk,Ravishankar Iyer
关键词: increasingly important workload, cloud providers catering, LLM serving, Large language models, Large language
中文关键词: 越来越重要的工作量、云提供商餐饮、LLM服务、大型语言模型、大型语言
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract: Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to in production settings. However, existing LLM serving systems focus on optimization objectives such as request serving throughput or request execution latency rather than the end-to-end latency SLOs. Achieving end-to-end SLOs for latency-sensitive requests is challenging due to head-of-line (HOL) blocking in the request queue, which results from bursty arrival rates and insufficient resources. To address the above challenge, we propose QLM, a multi-model queue management framework for LLM serving. QLM uses stochastic programming to orchestrate the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize SLO attainment. Specifically, QLM uses the following LSOs: model swapping, request eviction, GPU-CPU state swapping, load balancing, and warm model start. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2407.00047 [cs.DC] (or arXiv:2407.00047v1 [cs.DC] for this version)
摘要:大型语言模型(LLM)已成为满足企业和消费者应用的云提供商日益重要的工作负载。来自这些应用程序的LLM推理请求具有在生产设置中必须遵守的端到端延迟SLO。然而,现有的LLM服务系统关注的是诸如请求服务吞吐量或请求执行延迟等优化目标,而不是端到端延迟SLO。由于请求队列中的队头(HOL)阻塞,导致突发到达速率和资源不足,因此为延迟敏感型请求实现端到端SLO是一项挑战。为了解决上述挑战,我们提出了一种多模型队列管理框架QLM。QLM使用随机编程来协调多个LLM服务运营(LSO)的行动,以减少HOL阻塞并最大限度地实现SLO。具体地说,QLM使用以下LSO:模型交换、请求逐出、GPU-CPU状态交换、负载平衡和热模型启动。对具有真实LLM服务数据集的异类GPU设备和模型的评估表明,与其他最先进的LLM服务系统相比,QLm在保持或提高设备利用率的同时,将SLO达标率提高了40%-90%,吞吐量提高了20%-400%。主题:分布式、并行和集群计算(cs.DC);计算和语言(cs.CL);机器学习(cs.LG)引用AS:arxiv:2407.00047cs.DC

[NLP-171] Visual Language Model based Cross-modal Semantic Communication Systems
[NLP-171] 基于视觉语言模型的跨模式语义传播系统

链接: https://arxiv.org/abs/2407.00020
作者: Feibo Jiang,Chuanguo Tang,Li Dong,Kezhi Wang,Kun Yang,Cunhua Pan
关键词: Shannon physical capacity, physical capacity limits, Cross-modal Semantic Communication, transcending the Shannon, Shannon physical
中文关键词: 香农物理能力,物理能力极限,跨模式语义沟通,超越香农,香农物理
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:Semantic Communication (SC) has emerged as a novel communication paradigm in recent years, successfully transcending the Shannon physical capacity limits through innovative semantic transmission concepts. Nevertheless, extant Image Semantic Communication (ISC) systems face several challenges in dynamic environments, including low semantic density, catastrophic forgetting, and uncertain Signal-to-Noise Ratio (SNR). To address these challenges, we propose a novel Vision-Language Model-based Cross-modal Semantic Communication (VLM-CSC) system. The VLM-CSC comprises three novel components: (1) Cross-modal Knowledge Base (CKB) is used to extract high-density textual semantics from the semantically sparse image at the transmitter and reconstruct the original image based on textual semantics at the receiver. The transmission of high-density semantics contributes to alleviating bandwidth pressure. (2) Memory-assisted Encoder and Decoder (MED) employ a hybrid long/short-term memory mechanism, enabling the semantic encoder and decoder to overcome catastrophic forgetting in dynamic environments when there is a drift in the distribution of semantic features. (3) Noise Attention Module (NAM) employs attention mechanisms to adaptively adjust the semantic coding and the channel coding based on SNR, ensuring the robustness of the CSC system. The experimental simulations validate the effectiveness, adaptability, and robustness of the CSC system.
摘要:语义传播是近年来出现的一种新的传播范式,通过创新的语义传播概念,成功地超越了香农的物理容量限制。然而,现有的图像语义通信(ISC)系统在动态环境中面临着一些挑战,包括语义密度低、灾难性遗忘和信噪比不确定。为了应对这些挑战,我们提出了一种基于视觉语言模型的跨通道语义交流(VLM-CSC)系统。VLM-CSC包括三个新的组成部分:(1)跨模式知识库用于在发送端从语义稀疏的图像中提取高密度的文本语义,并在接收端基于文本语义重建原始图像。高密度语义的传输有助于缓解带宽压力。(2)记忆辅助编解码器(MED)采用混合的长/短时记忆机制,使语义编解码器能够克服动态环境中语义特征分布漂移时的灾难性遗忘。(3)噪声注意模块(NAM)采用注意机制,根据信噪比自适应调整语义编码和信道编码,保证了CSC系统的健壮性。实验仿真验证了CSC系统的有效性、适应性和鲁棒性。

[NLP-172] Macroeconomic Forecasting with Large Language Models
[NLP-172] 使用大型语言模型进行宏观经济预测

链接: https://arxiv.org/abs/2407.00890
作者: Andrea Carriero,Davide Pettenuzzo,Shubhranshu Shekhar
关键词: Large Language Models, Language Models, Large Language, comparative analysis evaluating, accuracy of Large
中文关键词: 大型语言模型、语言模型、大型语言、比较分析评估、大型准确性
类目: Econometrics (econ.EM); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a comparative analysis evaluating the accuracy of Large Language Models (LLMs) against traditional macro time series forecasting approaches. In recent times, LLMs have surged in popularity for forecasting due to their ability to capture intricate patterns in data and quickly adapt across very different domains. However, their effectiveness in forecasting macroeconomic time series data compared to conventional methods remains an area of interest. To address this, we conduct a rigorous evaluation of LLMs against traditional macro forecasting methods, using as common ground the FRED-MD database. Our findings provide valuable insights into the strengths and limitations of LLMs in forecasting macroeconomic time series, shedding light on their applicability in real-world scenarios
摘要:本文进行了一项比较分析,评估了大型语言模型(LLM)与传统宏观时间序列预测方法的准确性。近年来,LLM因能够捕捉复杂的数据模式并快速适应非常不同的领域而在预测方面激增。然而,与传统方法相比,它们在预测宏观经济时间序列数据方面的有效性仍然是一个值得关注的领域。为了解决这个问题,我们使用FRED-MD数据库作为共同点,对照传统的宏观预测方法对LLM进行了严格评估。我们的研究结果为LLM在预测宏观经济时间序列方面的优势和局限性提供了宝贵的见解,并揭示了它们在现实世界场景中的适用性

[NLP-173] Decoding moral judgement from text: a pilot study
[NLP-173] 从文本中解码道德判断:一项试点研究

链接: https://arxiv.org/abs/2407.00039
作者: Diana E. Gherman,Thorsten O. Zander
关键词:
中文关键词:
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 7 pages, 2 figures, conference

点击查看摘要

计算机视觉

[CV-0] Empowering 3D Visual Grounding with Reasoning Capabilities

链接: https://arxiv.org/abs/2407.01525
作者: Chenming Zhu,Tai Wang,Wenwei Zhang,Kai Chen,Xihui Liu
关键词: explicit textual descriptions, reason human intentions, Large Language Model, Multi-modal Large Language, implicit instructions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by ECCV24. A comprehensive and hierarchical 3D reasoning grounding benchmark in the era of foundation models. Project page: this https URL

点击查看摘要

Abstract:Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.

[CV-1] MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

链接: https://arxiv.org/abs/2407.01523
作者: Yubo Ma,Yuhang Zang,Liangyu Chen,Meiqi Chen,Yizhu Jiao,Xinze Li,Xinyuan Lu,Ziyu Liu,Yan Ma,Xiaoyi Dong,Pan Zhang,Liangming Pan,Yu-Gang Jiang,Jiaqi Wang,Yixin Cao,Aixin Sun
关键词: long-standing and practical, practical task, Recent Large Vision-Language, Large Vision-Language Models, single-page document understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: this https URL

[CV-2] Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

链接: https://arxiv.org/abs/2407.01521
作者: Bingliang Zhang,Wenda Chu,Julius Berner,Chenlin Meng,Anima Anandkumar,Yang Song
关键词: solving Bayesian inverse, learned data priors, Bayesian inverse problems, solving Bayesian, recently achieved success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inverse problems, such as phase retrieval. To address this challenge, we propose a new method called Decoupled Annealing Posterior Sampling (DAPS) that relies on a novel noise annealing process. Specifically, we decouple consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior as we reduce noise levels. This approach enables the exploration of a larger solution space, improving the success rate for accurate reconstructions. We demonstrate that DAPS significantly improves sample quality and stability across multiple image restoration tasks, particularly in complicated nonlinear inverse problems. For example, we achieve a PSNR of 30.72dB on the FFHQ 256 dataset for phase retrieval, which is an improvement of 9.12dB compared to existing methods.

[CV-3] DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

链接: https://arxiv.org/abs/2407.01519
作者: Chang-Han Yeh,Chin-Yang Lin,Zhixiang Wang,Chi-Wei Hsiao,Ting-Hsuan Chen,Yu-Lun Liu
关键词: pre-trained image restoration, image restoration diffusion, paper introduces, pre-trained image, video restoration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces a method for zero-shot video restoration using pre-trained image restoration diffusion models. Traditional video restoration methods often need retraining for different settings and struggle with limited generalization across various degradation types and datasets. Our approach uses a hierarchical token merging strategy for keyframes and local frames, combined with a hybrid correspondence mechanism that blends optical flow and feature-based nearest neighbor matching (latent merging). We show that our method not only achieves top performance in zero-shot video restoration but also significantly surpasses trained models in generalization across diverse datasets and extreme degradations (8 \times super-resolution and high-standard deviation video denoising). We present evidence through quantitative metrics and visual comparisons on various challenging datasets. Additionally, our technique works with any 2D restoration diffusion model, offering a versatile and powerful tool for video enhancement tasks without extensive retraining. This research leads to more efficient and widely applicable video restoration technologies, supporting advancements in fields that require high-quality video output. See our project page for video results at this https URL.

[CV-4] owards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

链接: https://arxiv.org/abs/2407.01518
作者: Hao Dong,Eleni Chatzi,Olga Fink
关键词: Multimodal Open-Set Domain, open-set domain generalization, open-set domain, involves recognizing, Multimodal Jigsaw Puzzles
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ECCV 2024, code: this https URL

点击查看摘要

Abstract:The task of open-set domain generalization (OSDG) involves recognizing novel classes within unseen domains, which becomes more challenging with multiple modalities as input. Existing works have only addressed unimodal OSDG within the meta-learning framework, without considering multimodal scenarios. In this work, we introduce a novel approach to address Multimodal Open-Set Domain Generalization (MM-OSDG) for the first time, utilizing self-supervision. To this end, we introduce two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles. These tasks facilitate the learning of multimodal representative features, thereby enhancing generalization and open-class detection capabilities. Additionally, we propose a novel entropy weighting mechanism to balance the loss across different modalities. Furthermore, we extend our approach to tackle also the Multimodal Open-Set Domain Adaptation (MM-OSDA) problem, especially in scenarios where unlabeled data from the target domain is available. Extensive experiments conducted under MM-OSDG, MM-OSDA, and Multimodal Closed-Set DG settings on the EPIC-Kitchens and HAC datasets demonstrate the efficacy and versatility of the proposed approach. Our source code is available at this https URL.

[CV-5] E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness

链接: https://arxiv.org/abs/2407.01516
作者: Robin Courant,Nicolas Dufour,Xi Wang,Marc Christie,Vicky Kalogeiton
关键词: Stories and emotions, directing decisions, movement over time, emotions in movies, movies emerge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024. Project page: this https URL

点击查看摘要

Abstract:Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, in particular camera placement and movement over time. Crafting compelling camera trajectories remains a complex iterative process, even for skilful artists. To tackle this, in this paper, we propose a dataset called the Exceptional Trajectories (E.T.) with camera trajectories along with character information and textual captions encompassing descriptions of both camera and character. To our knowledge, this is the first dataset of its kind. To show the potential applications of the E.T. dataset, we propose a diffusion-based approach, named DIRECTOR, which generates complex camera trajectories from textual captions that describe the relation and synchronisation between the camera and characters. To ensure robust and accurate evaluations, we train on the E.T. dataset CLaTr, a Contrastive Language-Trajectory embedding for evaluation metrics. We posit that our proposed dataset and method significantly advance the democratization of cinematography, making it more accessible to common users.

[CV-6] MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

链接: https://arxiv.org/abs/2407.01509
作者: Yusu Qian,Hanrong Ye,Jean-Philippe Fauconnier,Peter Grasch,Yinfei Yang,Zhe Gan
关键词: large language models, evaluate multimodal large, multimodal large language, introduce MIA-Bench, language models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models’ compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models’ ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

[CV-7] FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

链接: https://arxiv.org/abs/2407.01494
作者: Yiming Zhang,Yicheng Gu,Yanhong Zeng,Zhening Xing,Yuancheng Wang,Zhizheng Wu,Kai Chen
关键词: study Neural Foley, Neural Foley, immersive audio-visual experience, study Neural, sound effects synchronizing
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Project page: this https URL

点击查看摘要

Abstract:We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. Despite its wide range of applications, existing approaches encounter limitations when it comes to simultaneously synthesizing high-quality and video-aligned (i.e., semantic relevant and temporal synchronized) sounds. To overcome these limitations, we propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. FoleyCrafter comprises two key components: the semantic adapter for semantic alignment and the temporal controller for precise audio-video synchronization. The semantic adapter utilizes parallel cross-attention layers to condition audio generation on video features, producing realistic sound effects that are semantically relevant to the visual content. Meanwhile, the temporal controller incorporates an onset detector and a timestampbased adapter to achieve precise audio-video alignment. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents. We conduct extensive quantitative and qualitative experiments on standard benchmarks to verify the effectiveness of FoleyCrafter. Models and codes are available at this https URL.

[CV-8] Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning

链接: https://arxiv.org/abs/2407.01491
作者: Siwei Li,Yifan Yang,Yifei Shen,Fangyun Wei,Zongqing Lu,Lili Qiu,Yuqing Yang
关键词: Efficient fine-tuning plays, Efficient fine-tuning, low-rank adaptation emerging, modern large models, fine-tuning plays
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA’s expressiveness and generalization capabilities while preserving its training efficiency. Our approach augments expressiveness through a cascaded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model’s ability to capture complex patterns. Additionally, we introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. The extensive experiments on various language and vision datasets, as well as robustness benchmarks, demonstrate that the proposed method not only significantly outperforms existing baselines, but also mitigates overfitting, enhances model stability, and improves OOD robustness. Code will be release in this https URL very soon.

[CV-9] he Balanced-Pairwise-Affinities Feature Transform

链接: https://arxiv.org/abs/2407.01467
作者: Daniel Shalam,Simon Korman
关键词: facilitate downstream matching, grouping related tasks, designed to upgrade, items to facilitate, facilitate downstream
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2204.03065

点击查看摘要

Abstract:The Balanced-Pairwise-Affinities (BPA) feature transform is designed to upgrade the features of a set of input items to facilitate downstream matching or grouping related tasks. The transformed set encodes a rich representation of high order relations between the input features. A particular min-cost-max-flow fractional matching problem, whose entropy regularized version can be approximated by an optimal transport (OT) optimization, leads to a transform which is efficient, differentiable, equivariant, parameterless and probabilistically interpretable. While the Sinkhorn OT solver has been adapted extensively in many contexts, we use it differently by minimizing the cost between a set of features to itself and using the transport plan’s rows as the new representation. Empirically, the transform is highly effective and flexible in its use and consistently improves networks it is inserted into, in a variety of tasks and training schemes. We demonstrate state-of-the-art results in few-shot classification, unsupervised image clustering and person re-identification. Code is available at \urlthis http URL.

[CV-10] ColPali: Efficient Document Retrieval with Vision Language Models

链接: https://arxiv.org/abs/2407.01449
作者: Manuel Faysse,Hugues Sibille,Tony Wu,Gautier Viaud,Céline Hudelot,Pierre Colombo
关键词: Retrieval Augmented Generation, document retrieval, visually rich structures, information through text, modern document retrieval
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, ColPali, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.

[CV-11] FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources

链接: https://arxiv.org/abs/2407.01445
作者: Xiyuan Wei,Fanjiang Ye,Ori Yonay,Xingyu Chen,Baixi Sun,Dingwen Tao,Tianbao Yang
关键词: Contrastive Language-Image Pretraining, large batch size, Existing studies, Language-Image Pretraining, data involve hundreds
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages

点击查看摘要

Abstract:Existing studies of training state-of-the-art Contrastive Language-Image Pretraining (CLIP) models on large-scale data involve hundreds of or even thousands of GPUs due to the requirement of a large batch size. However, such a large amount of resources is not accessible to most people. While advanced compositional optimization techniques for optimizing global contrastive losses have been demonstrated effective for removing the requirement of large batch size, their performance on large-scale data remains underexplored and not optimized. To bridge the gap, this paper explores several aspects of CLIP training with limited resources (e.g., up to tens of GPUs). First, we introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques while designed and optimized for the distributed setting. Our framework is equipped with an efficient gradient reduction strategy to reduce communication overhead. Second, to further boost training efficiency, we investigate three components of the framework from an optimization perspective: the schedule of the inner learning rate, the update rules of the temperature parameter and the model parameters, respectively. Experiments on different strategies for each component shed light on how to conduct CLIP training more efficiently. Finally, we benchmark the performance of FastCLIP and the state-of-the-art training baseline (OpenCLIP) on different compute scales up to 32 GPUs on 8 nodes, and three data scales ranging from 2.7 million, 9.1 million to 315 million image-text pairs to demonstrate the significant improvement of FastCLIP in the resource-limited setting. We release the code of FastCLIP at this https URL .

[CV-12] AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction

链接: https://arxiv.org/abs/2407.01436
作者: Dubing Chen,Wencheng Han,Jin Fang,Jianbing Shen
关键词: Challenge at CVPR, Open-Occ Dataset Challenge, Flow Prediction track, Dataset Challenge, technical report
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 2nd Place in the 3D Occupancy and Flow Prediction Challenge (CVPR24)

点击查看摘要

Abstract:In this technical report, we present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ Dataset Challenge at CVPR 2024. Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling. Initially, we independently train the occupancy model, followed by flow prediction using sequential frame integration. Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth. Experimental results on the nuScenes dataset demonstrate significant improvements in accuracy and robustness, showcasing the effectiveness of our approach in real-world scenarios. Our single model based on Swin-Base ranks second on the public leaderboard, validating the potential of our method in advancing autonomous car perception systems.

[CV-13] Scarecrow monitoring system:employing mobilenet ssd for enhanced animal supervision

链接: https://arxiv.org/abs/2407.01435
作者: Balaji VS,Mahi AR,Anirudh Ganapathy PS,Manju M
关键词: Mobile Net SSD, SSD Mobile Net, wildlife wreaking havoc, Mobile Net, Net SSD model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 10 figures

点击查看摘要

Abstract:Agriculture faces a growing challenge with wildlife wreaking havoc on crops, threatening sustainability. The project employs advanced object detection, the system utilizes the Mobile Net SSD model for real-time animal classification. The methodology initiates with the creation of a dataset, where each animal is represented by annotated images. The SSD Mobile Net architecture facilitates the use of a model for image classification and object detection. The model undergoes fine-tuning and optimization during training, enhancing accuracy for precise animal classification. Real-time detection is achieved through a webcam and the OpenCV library, enabling prompt identification and categorization of approaching animals. By seamlessly integrating intelligent scarecrow technology with object detection, this system offers a robust solution to field protection, minimizing crop damage and promoting precision farming. It represents a valuable contribution to agricultural sustainability, addressing the challenge of wildlife interference with crops. The implementation of the Intelligent Scarecrow Monitoring System stands as a progressive tool for proactive field management and protection, empowering farmers with an advanced solution for precision agriculture. Keywords: Machine learning, Deep Learning, Computer Vision, MobileNet SSD Comments: 9 pages, 10 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2407.01435 [cs.CV] (or arXiv:2407.01435v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2407.01435 Focus to learn more arXiv-issued DOI via DataCite

[CV-14] FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

链接: https://arxiv.org/abs/2407.01425
作者: Pratheba Selvaraju,Tianyu Ding,Tianyi Chen,Ilya Zharkov,Luming Liang
关键词: generating high-quality images, images and videos, largely due, facto choice, choice for generating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion transformers (DiT) have become the de facto choice for generating high-quality images and videos, largely due to their scalability, which enables the construction of larger models for enhanced performance. However, the increased size of these models leads to higher inference costs, making them less attractive for real-time applications. We present Fast-FORward CAching (FORA), a simple yet effective approach designed to accelerate DiT by exploiting the repetitive nature of the diffusion process. FORA implements a caching mechanism that stores and reuses intermediate outputs from the attention and MLP layers across denoising steps, thereby reducing computational overhead. This approach does not require model retraining and seamlessly integrates with existing transformer-based diffusion models. Experiments show that FORA can speed up diffusion transformers several times over while only minimally affecting performance metrics such as the IS Score and FID. By enabling faster processing with minimal trade-offs in quality, FORA represents a significant advancement in deploying diffusion transformers for real-time applications. Code will be made publicly available at: this https URL.

[CV-15] StyleShot: A Snapshot on Any Style

链接: https://arxiv.org/abs/2407.01414
作者: Junyao Gao,Yanchen Liu,Yanan Sun,Yinhao Tang,Yanhong Zeng,Kai Chen,Cairong Zhao
关键词: crucial and sufficient, sufficient for generalized, good style representation, generalized style transfer, style
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:In this paper, we show that, a good style representation is crucial and sufficient for generalized style transfer without test-time tuning. We achieve this through constructing a style-aware encoder and a well-organized style dataset called StyleGallery. With dedicated design for style learning, this style-aware encoder is trained to extract expressive style representation with decoupling training strategy, and StyleGallery enables the generalization ability. We further employ a content-fusion encoder to enhance image-driven style transfer. We highlight that, our approach, named StyleShot, is simple yet effective in mimicking various desired styles, i.e., 3D, flat, abstract or even fine-grained styles, without test-time tuning. Rigorous experiments validate that, StyleShot achieves superior performance across a wide range of styles compared to existing state-of-the-art methods. The project page is available at: this https URL.

[CV-16] Semantic Compositions Enhance Vision-Language Contrastive Learning

链接: https://arxiv.org/abs/2407.01408
作者: Maxwell Aladago,Lorenzo Torresani,Soroush Vosoughi
关键词: vision-language contrastive learning, leverage within-batch non-matching, within-batch non-matching pairs, contrastive learning, matched image-caption pairs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-C for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification and cross-modal retrieval. The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.

[CV-17] GalLoP: Learning Global and Local Prompts for Vision-Language Models

链接: https://arxiv.org/abs/2407.01400
作者: Marc Lafon,Elias Ramzi,Clément Rambour,Nicolas Audebert,Nicolas Thome
关键词: adapt vision-language models, efficiently adapt vision-language, few-shot image classification, Prompt learning, prompt learning methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To be published at ECCV 2024

点击查看摘要

Abstract:Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning method that learns multiple diverse prompts leveraging both global and local visual features. The training of the local prompts relies on local features with an enhanced vision-text alignment. To focus only on pertinent features, this local alignment is coupled with a sparsity strategy in the selection of the local features. We enforce diversity on the set of prompts using a new ``prompt dropout’’ technique and a multiscale strategy on the local prompts. GalLoP outperforms previous prompt learning methods on accuracy on eleven datasets in different few shots settings and with various backbones. Furthermore, GalLoP shows strong robustness performances in both domain generalization and OOD detection, even outperforming dedicated OOD detection methods. Code and instructions to reproduce our results will be open-sourced.

[CV-18] Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

链接: https://arxiv.org/abs/2407.01397
作者: Matteo Mosconi,Andriy Sorokin,Aniello Panariello,Angelo Porrello,Jacopo Bonato,Marco Cotogni,Luigi Sabetta,Simone Calderara,Rita Cucchiara
关键词: deep learning models, efficiently and effectively, skeletal data, data allows deep, models to perform
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ICPR 2024

点击查看摘要

Abstract:The use of skeletal data allows deep learning models to perform action recognition efficiently and effectively. Herein, we believe that exploring this problem within the context of Continual Learning is crucial. While numerous studies focus on skeleton-based action recognition from a traditional offline perspective, only a handful venture into online approaches. In this respect, we introduce CHARON (Continual Human Action Recognition On skeletoNs), which maintains consistent performance while operating within an efficient framework. Through techniques like uniform sampling, interpolation, and a memory-efficient training stage based on masking, we achieve improved recognition accuracy while minimizing computational overhead. Our experiments on Split NTU-60 and the proposed Split NTU-120 datasets demonstrate that CHARON sets a new benchmark in this domain. The code is available at this https URL.

[CV-19] Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

链接: https://arxiv.org/abs/2407.01394
作者: Pooya Fayyazsanavi,Antonios Anastasopoulos,Jana Košecká
关键词: spoken text presents, text presents unique, presents unique challenges, unique challenges owing, expression nuances
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on \em Gloss2Text translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities improving significantly the performance of state-of-the-art approaches. Through extensive experiments and ablation studies on the PHOENIX Weather 2014T dataset, our approach surpasses state-of-the-art performance in \em Gloss2Text translation, indicating its efficacy in addressing sign language translation and suggesting promising avenues for future research and development.

[CV-20] Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

链接: https://arxiv.org/abs/2407.01392
作者: Boyuan Chen,Diego Marti Monso,Yilun Du,Max Simchowitz,Russ Tedrake,Vincent Sitzmann
关键词: per-token noise levels, presents Diffusion Forcing, independent per-token noise, paper presents Diffusion, Diffusion Forcing
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing’s variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution. Project website: https://boyuan.space/diffusion-forcing

[CV-21] ransferAttn: Transferable-guided Attention Is All You Need for Video Domain Adaptation

链接: https://arxiv.org/abs/2407.01375
作者: André Sacilotti,Samuel Felipe dos Santos,Nicu Sebe,Jurandy Almeida
关键词: Unsupervised domain adaptation, image-based UDA techniques, Unsupervised domain, Transferable-guided Attention Block, Domain Transferable-guided Attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video domain adaptation has still been little explored. Our key idea is to use the transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge from different backbones. To improve the transferability of ViT, we introduce a novel and effective module named Domain Transferable-guided Attention Block~(DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets with different backbones, like ResNet101, I3D, and STAM, verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains. The code will be made freely available.

[CV-22] Hyperspectral Pansharpening: Critical Review Tools and Future Perspectives

链接: https://arxiv.org/abs/2407.01355
作者: Matteo Ciotola,Giuseppe Guarino,Gemine Vivone,Giovanni Poggi,Jocelyn Chanussot,Antonio Plaza,Giuseppe Scarpa
关键词: Hyperspectral pansharpening consists, low-resolution hyperspectral image, low-resolution hyperspectral, high-resolution panchromatic band, Hyperspectral pansharpening
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Hyperspectral pansharpening consists of fusing a high-resolution panchromatic band and a low-resolution hyperspectral image to obtain a new image with high resolution in both the spatial and spectral domains. These remote sensing products are valuable for a wide range of applications, driving ever growing research efforts. Nonetheless, results still do not meet application demands. In part, this comes from the technical complexity of the task: compared to multispectral pansharpening, many more bands are involved, in a spectral range only partially covered by the panchromatic component and with overwhelming noise. However, another major limiting factor is the absence of a comprehensive framework for the rapid development and accurate evaluation of new methods. This paper attempts to address this issue. We started by designing a dataset large and diverse enough to allow reliable training (for data-driven methods) and testing of new methods. Then, we selected a set of state-of-the-art methods, following different approaches, characterized by promising performance, and reimplemented them in a single PyTorch framework. Finally, we carried out a critical comparative analysis of all methods, using the most accredited quality indicators. The analysis highlights the main limitations of current solutions in terms of spectral/spatial quality and computational efficiency, and suggests promising research directions. To ensure full reproducibility of the results and support future research, the framework (including codes, evaluation procedures and links to the dataset) is shared on this https URL, as a single Python-based reference benchmark toolbox. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) Cite as: arXiv:2407.01355 [cs.CV] (or arXiv:2407.01355v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2407.01355 Focus to learn more arXiv-issued DOI via DataCite

[CV-23] PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction

链接: https://arxiv.org/abs/2407.01349
作者: Xuan Yu,Yili Liu,Chenrui Han,Sitong Mao,Shunbo Zhou,Rong Xiong,Yiyi Liao,Yue Wang
关键词: challenging task, Panoptic reconstruction, instance, Panoptic, segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Panoptic reconstruction is a challenging task in 3D scene understanding. However, most existing methods heavily rely on pre-trained semantic segmentation models and known 3D object bounding boxes for 3D panoptic segmentation, which is not available for in-the-wild scenes. In this paper, we propose a novel zero-shot panoptic reconstruction method from RGB-D images of scenes. For zero-shot segmentation, we leverage open-vocabulary instance segmentation, but it has to face partial labeling and instance association challenges. We tackle both challenges by propagating partial labels with the aid of dense generalized features and building a 3D instance graph for associating 2D instance IDs. Specifically, we exploit partial labels to learn a classifier for generalized semantic features to provide complete labels for scenes with dense distilled features. Moreover, we formulate instance association as a 3D instance graph segmentation problem, allowing us to fully utilize the scene geometry prior and all 2D instance masks to infer global unique pseudo 3D instance ID. Our method outperforms state-of-the-art methods on the indoor dataset ScanNet V2 and the outdoor dataset KITTI-360, demonstrating the effectiveness of our graph segmentation method and reconstruction network.

[CV-24] AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition

链接: https://arxiv.org/abs/2407.01332
作者: Fadi Boutros,Vitomir Štruc,Naser Damer
关键词: compact student model, high-performing teacher model, aims at improving, improving the performance, student model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:Knowledge distillation (KD) aims at improving the performance of a compact student model by distilling the knowledge from a high-performing teacher model. In this paper, we present an adaptive KD approach, namely AdaDistill, for deep face recognition. The proposed AdaDistill embeds the KD concept into the softmax loss by training the student using a margin penalty softmax loss with distilled class centers from the teacher. Being aware of the relatively low capacity of the compact student model, we propose to distill less complex knowledge at an early stage of training and more complex one at a later stage of training. This relative adjustment of the distilled knowledge is controlled by the progression of the learning capability of the student over the training iterations without the need to tune any hyper-parameters. Extensive experiments and ablation studies show that AdaDistill can enhance the discriminative learning capability of the student and demonstrate superiority over various state-of-the-art competitors on several challenging benchmarks, such as IJB-B, IJB-C, and ICCV2021-MFR

[CV-25] Restyling Unsupervised Concept Based Interpretable Networks with Generative Models

链接: https://arxiv.org/abs/2407.01331
作者: Jayneel Parekh,Quentin Bouniot,Pavlo Mozharovskyi,Alasdair Newson,Florence d’Alché-Buc
关键词: Developing inherently interpretable, Developing inherently, inherently interpretable models, recent years, gained prominence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page available at this https URL

点击查看摘要

Abstract:Developing inherently interpretable models for prediction has gained prominence in recent years. A subclass of these models, wherein the interpretable network relies on learning high-level concepts, are valued because of closeness of concept representations to human communication. However, the visualization and understanding of the learnt unsupervised dictionary of concepts encounters major limitations, specially for large-scale images. We propose here a novel method that relies on mapping the concept features to the latent space of a pretrained generative model. The use of a generative model enables high quality visualization, and naturally lays out an intuitive and interactive procedure for better interpretation of the learnt concepts. Furthermore, leveraging pretrained generative models has the additional advantage of making the training of the system more efficient. We quantitatively ascertain the efficacy of our method in terms of accuracy of the interpretable prediction network, fidelity of reconstruction, as well as faithfulness and consistency of learnt concepts. The experiments are conducted on multiple image recognition benchmarks for large-scale images. Project page available at this https URL

[CV-26] Learning Unsigned Distance Fields from Local Shape Functions for 3D Surface Reconstruction

链接: https://arxiv.org/abs/2407.01330
作者: Jiangbei Hu,Yanggeng Li,Fei Hou,Junhui Hou,Zhebin Zhang,Shengfa Wang,Na Lei,Ying He
关键词: Unsigned distance fields, Unsigned distance, distance fields, provide a versatile, encompassing both watertight
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:Unsigned distance fields (UDFs) provide a versatile framework for representing a diverse array of 3D shapes, encompassing both watertight and non-watertight geometries. Traditional UDF learning methods typically require extensive training on large datasets of 3D shapes, which is costly and often necessitates hyperparameter adjustments for new datasets. This paper presents a novel neural framework, LoSF-UDF, for reconstructing surfaces from 3D point clouds by leveraging local shape functions to learn UDFs. We observe that 3D shapes manifest simple patterns within localized areas, prompting us to create a training dataset of point cloud patches characterized by mathematical functions that represent a continuum from smooth surfaces to sharp edges and corners. Our approach learns features within a specific radius around each query point and utilizes an attention mechanism to focus on the crucial features for UDF estimation. This method enables efficient and robust surface reconstruction from point clouds without the need for shape-specific training. Additionally, our method exhibits enhanced resilience to noise and outliers in point clouds compared to existing methods. We present comprehensive experiments and comparisons across various datasets, including synthetic and real-scanned point clouds, to validate our method’s efficacy.

[CV-27] CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes

链接: https://arxiv.org/abs/2407.01328
作者: Danial Qashqai,Emad Mousavian,Shahriar Baradaran Shokouhi,Sattar Mirzakuchaki
关键词: complex visual interpretation, vehicle vision systems, autonomous vehicle vision, Semantic segmentation, multimodal semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semantic segmentation, as a crucial component of complex visual interpretation, plays a fundamental role in autonomous vehicle vision systems. Recent studies have significantly improved the accuracy of semantic segmentation by exploiting complementary information and developing multimodal methods. Despite the gains in accuracy, multimodal semantic segmentation methods suffer from high computational complexity and low inference speed. Therefore, it is a challenging task to implement multimodal methods in driving applications. To address this problem, we propose the Cosine Similarity Fusion Network (CSFNet) as a real-time RGB-X semantic segmentation model. Specifically, we design a Cosine Similarity Attention Fusion Module (CS-AFM) that effectively rectifies and fuses features of two modalities. The CS-AFM module leverages cross-modal similarity to achieve high generalization ability. By enhancing the fusion of cross-modal features at lower levels, CS-AFM paves the way for the use of a single-branch network at higher levels. Therefore, we use dual and single-branch architectures in an encoder, along with an efficient context module and a lightweight decoder for fast and accurate predictions. To verify the effectiveness of CSFNet, we use the Cityscapes, MFNet, and ZJU datasets for the RGB-D/T/P semantic segmentation. According to the results, CSFNet has competitive accuracy with state-of-the-art methods while being state-of-the-art in terms of speed among multimodal semantic segmentation models. It also achieves high efficiency due to its low parameter count and computational complexity. The source code for CSFNet will be available at this https URL.

[CV-28] Gradient-based Class Weighting for Unsupervised Domain Adaptation in Dense Prediction Visual Tasks

链接: https://arxiv.org/abs/2407.01327
作者: Roberto Alcover-Couso,Marcos Escudero-Viñolo,Juan C. SanMiguel,Jesus Bescós
关键词: unsupervised domain adaptation, significant class imbalance, class imbalance remains, addressing the challenge, open issue
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In unsupervised domain adaptation (UDA), where models are trained on source data (e.g., synthetic) and adapted to target data (e.g., real-world) without target annotations, addressing the challenge of significant class imbalance remains an open issue. Despite considerable progress in bridging the domain gap, existing methods often experience performance degradation when confronted with highly imbalanced dense prediction visual tasks like semantic and panoptic segmentation. This discrepancy becomes especially pronounced due to the lack of equivalent priors between the source and target domains, turning class imbalanced techniques used for other areas (e.g., image classification) ineffective in UDA scenarios. This paper proposes a class-imbalance mitigation strategy that incorporates class-weights into the UDA learning losses, but with the novelty of estimating these weights dynamically through the loss gradient, defining a Gradient-based class weighting (GBW) learning. GBW naturally increases the contribution of classes whose learning is hindered by large-represented classes, and has the advantage of being able to automatically and quickly adapt to the iteration training outcomes, avoiding explicitly curricular learning patterns common in loss-weighing strategies. Extensive experimentation validates the effectiveness of GBW across architectures (convolutional and transformer), UDA strategies (adversarial, self-training and entropy minimization), tasks (semantic and panoptic segmentation), and datasets (GTA and Synthia). Analysing the source of advantage, GBW consistently increases the recall of low represented classes.

[CV-29] oCoAD: Two-Stage Contrastive Learning for Industrial Anomaly Detection

链接: https://arxiv.org/abs/2407.01312
作者: Yun Liang,Zhiguang Hu,Junjie Huang,Donglin Di,Anyang Su,Lei Fan
关键词: Current unsupervised anomaly, unsupervised anomaly detection, anomaly detection approaches, detection approaches perform, Current unsupervised
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Current unsupervised anomaly detection approaches perform well on public datasets but struggle with specific anomaly types due to the domain gap between pre-trained feature extractors and target-specific domains. To tackle this issue, this paper presents a two-stage training strategy, called \textbfToCoAD. In the first stage, a discriminative network is trained by using synthetic anomalies in a self-supervised learning manner. This network is then utilized in the second stage to provide a negative feature guide, aiding in the training of the feature extractor through bootstrap contrastive learning. This approach enables the model to progressively learn the distribution of anomalies specific to industrial datasets, effectively enhancing its generalizability to various types of anomalies. Extensive experiments are conducted to demonstrate the effectiveness of our proposed two-stage training strategy, and our model produces competitive performance, achieving pixel-level AUROC scores of 98.21%, 98.43% and 97.70% on MVTec AD, VisA and BTAD respectively.

[CV-30] Multi-State-Action Tokenisation in Decision Transformers for Multi-Discrete Action Spaces

链接: https://arxiv.org/abs/2407.01310
作者: Perusha Moodley,Pramod Kaushik,Dhillu Thambi,Mark Trovinger,Praveen Paruchuri,Xia Hong,Benjamin Rosman
关键词: Decision Transformer architectures, Decision Transformer, enhanced Decision Transformer, existing Decision Transformer, multi-discrete action spaces
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Decision Transformers, in their vanilla form, struggle to perform on image-based environments with multi-discrete action spaces. Although enhanced Decision Transformer architectures have been developed to improve performance, these methods have not specifically addressed this problem of multi-discrete action spaces which hampers existing Decision Transformer architectures from learning good representations. To mitigate this, we propose Multi-State Action Tokenisation (M-SAT), an approach for tokenising actions in multi-discrete action spaces that enhances the model’s performance in such environments. Our approach involves two key changes: disentangling actions to the individual action level and tokenising the actions with auxiliary state information. These two key changes also improve individual action level interpretability and visibility within the attention layers. We demonstrate the performance gains of M-SAT on challenging ViZDoom environments with multi-discrete action spaces and image-based state spaces, including the Deadly Corridor and My Way Home scenarios, where M-SAT outperforms the baseline Decision Transformer without any additional data or heavy computational overheads. Additionally, we find that removing positional encoding does not adversely affect M-SAT’s performance and, in some cases, even improves it.

[CV-31] Robot Instance Segmentation with Few Annotations for Grasping

链接: https://arxiv.org/abs/2407.01302
作者: Moshe Kimhi,David Vainshtein,Chaim Baskin,Dotan Di Castro
关键词: manipulate objects relies, objects relies heavily, ability of robots, robots to manipulate, relies heavily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The ability of robots to manipulate objects relies heavily on their aptitude for visual perception. In domains characterized by cluttered scenes and high object variability, most methods call for vast labeled datasets, laboriously hand-annotated, with the aim of training capable models. Once deployed, the challenge of generalizing to unfamiliar objects implies that the model must evolve alongside its domain. To address this, we propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI), allowing a model to learn by observing scene alterations and leverage visual consistency despite temporal gaps without requiring curated data of interaction sequences. As a result, our approach exploits partially annotated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unlabeled still images. We validate our method on two common benchmarks, ARMBench mix-object-tote and OCID, where it achieves state-of-the-art performance. Notably, on ARMBench, we attain an \textAP_50 of 86.37 , almost a 20% improvement over existing work, and obtain remarkable results in scenarios with extremely low annotation, achieving an \textAP_50 score of 84.89 with just 1 % of annotated data compared to 72 presented in ARMBench on the fully annotated counterpart.

[CV-32] GaussianStego: A Generalizable Stenography Pipeline for Generative 3D Gaussians Splatting

链接: https://arxiv.org/abs/2407.01301
作者: Chenxin Li,Hengyu Liu,Zhiwen Fan,Wuyang Li,Yifan Liu,Panwang Pan,Yixuan Yuan
关键词: point-based techniques pave, Recent advancements, widespread visual data, visual data distribution, real-time neural rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project website: this https URL

点击查看摘要

Abstract:Recent advancements in large generative models and real-time neural rendering using point-based techniques pave the way for a future of widespread visual data distribution through sharing synthesized 3D assets. However, while standardized methods for embedding proprietary or copyright information, either overtly or subtly, exist for conventional visual content such as images and videos, this issue remains unexplored for emerging generative 3D formats like Gaussian Splatting. We present GaussianStego, a method for embedding steganographic information in the rendering of generated 3D assets. Our approach employs an optimization framework that enables the accurate extraction of hidden information from images rendered using Gaussian assets derived from large models, while maintaining their original visual quality. We conduct preliminary evaluations of our method across several potential deployment scenarios and discuss issues identified through analysis. GaussianStego represents an initial exploration into the novel challenge of embedding customizable, imperceptible, and recoverable information within the renders produced by current 3D generative models, while ensuring minimal impact on the rendered content’s quality.

[CV-33] Preserving Full Degradation Details for Blind Image Super-Resolution

链接: https://arxiv.org/abs/2407.01299
作者: Hongda Liu,Longguang Wang,Ye Zhang,Kaiwen Xue,Shunbo Zhou,Yulan Guo
关键词: super-resolution relies heavily, image super-resolution relies, super-resolution relies, relies heavily, degradation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 11 figures, 4 tables

点击查看摘要

Abstract:The performance of image super-resolution relies heavily on the accuracy of degradation information, especially under blind settings. Due to absence of true degradation models in real-world scenarios, previous methods learn distinct representations by distinguishing different degradations in a batch. However, the most significant degradation differences may provide shortcuts for the learning of representations such that subtle difference may be discarded. In this paper, we propose an alternative to learn degradation representations through reproducing degraded low-resolution (LR) images. By guiding the degrader to reconstruct input LR images, full degradation information can be encoded into the representations. In addition, we develop an energy distance loss to facilitate the learning of the degradation representations by introducing a bounded constraint. Experiments show that our representations can extract accurate and highly robust degradation information. Moreover, evaluations on both synthetic and real images demonstrate that our ReDSR achieves state-of-the-art performance for the blind SR tasks.

[CV-34] Formal Verification of Object Detection

链接: https://arxiv.org/abs/2407.01295
作者: Avraham Raviv,Yizhak Y. Elboher,Michelle Aluf-Medina,Yael Leibovich Weiss,Omer Cohen,Roy Assa,Guy Katz,Hillel Kugler
关键词: Deep Neural Networks, Deep Neural, object detection, object detection models, ubiquitous in real-world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are ubiquitous in real-world applications, yet they remain vulnerable to errors and adversarial attacks. This work tackles the challenge of applying formal verification to ensure the safety of computer vision models, extending verification beyond image classification to object detection. We propose a general formulation for certifying the robustness of object detection models using formal verification and outline implementation strategies compatible with state-of-the-art verification tools. Our approach enables the application of these tools, originally designed for verifying classification models, to object detection. We define various attacks for object detection, illustrating the diverse ways adversarial inputs can compromise neural network outputs. Our experiments, conducted on several common datasets and networks, reveal potential errors in object detection models, highlighting system vulnerabilities and emphasizing the need for expanding formal verification to these new domains. This work paves the way for further research in integrating formal verification across a broader range of computer vision applications.

[CV-35] We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

链接: https://arxiv.org/abs/2407.01284
作者: Runqi Qiao,Qiuna Tan,Guanting Dong,Minhui Wu,Chong Sun,Xiaoshuai Song,Zhuoma GongQue,Shanglin Lei,Zhe Wei,Miaoxuan Zhang,Runfeng Qiao,Yifan Zhang,Xiao Zong,Yida Xu,Muxi Diao,Zhimin Bao,Chen Li,Honggang Zhang
关键词: Large Multimodal Models, Multimodal Models, Large Multimodal, received widespread attention, Visual mathematical reasoning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: Work in progress

点击查看摘要

Abstract:Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs’ reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at this https URL.

[CV-36] Small Aerial Target Detection for Airborne Infrared Detection Systems using LightGBM and Trajectory Constraints

链接: https://arxiv.org/abs/2407.01278
作者: Xiaoliang Sun,Liangchao Guo,Wenlong Zhang,Zi Wang,Qifeng Yu
关键词: rapid relative motion, aerial target detection, make robust small, small aerial target, aerial target
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages,10 figures

点击查看摘要

Abstract:Factors, such as rapid relative motion, clutter background, etc., make robust small aerial target detection for airborne infrared detection systems a challenge. Existing methods are facing difficulties when dealing with such cases. We consider that a continuous and smooth trajectory is critical in boosting small infrared aerial target detection performance. A simple and effective small aerial target detection method for airborne infrared detection system using light gradient boosting model (LightGBM) and trajectory constraints is proposed in this article. First, we simply formulate target candidate detection as a binary classification problem. Target candidates in every individual frame are detected via interesting pixel detection and a trained LightGBM model. Then, the local smoothness and global continuous characteristic of the target trajectory are modeled as short-strict and long-loose constraints. The trajectory constraints are used efficiently for detecting the true small infrared aerial targets from numerous target candidates. Experiments on public datasets demonstrate that the proposed method performs better than other existing methods. Furthermore, a public dataset for small aerial target detection in airborne infrared detection systems is constructed. To the best of our knowledge, this dataset has the largest data scale and richest scene types within this field.

[CV-37] OSL-ActionSpotting: A Unified Library for Action Spotting in Sports Videos

链接: https://arxiv.org/abs/2407.01265
作者: Yassine Benzakour,Bruno Cabado,Silvio Giancola,Anthony Cioppa,Bernard Ghanem,Marc Van Droogenbroeck
关键词: Action spotting, sports analytics, sports video analytics, providing insights, tactical decision-making
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Action spotting is crucial in sports analytics as it enables the precise identification and categorization of pivotal moments in sports matches, providing insights that are essential for performance analysis and tactical decision-making. The fragmentation of existing methodologies, however, impedes the progression of sports analytics, necessitating a unified codebase to support the development and deployment of action spotting for video analysis. In this work, we introduce OSL-ActionSpotting, a Python library that unifies different action spotting algorithms to streamline research and applications in sports video analytics. OSL-ActionSpotting encapsulates various state-of-the-art techniques into a singular, user-friendly framework, offering standardized processes for action spotting and analysis across multiple datasets. We successfully integrated three cornerstone action spotting methods into OSL-ActionSpotting, achieving performance metrics that match those of the original, disparate codebases. This unification within a single library preserves the effectiveness of each method and enhances usability and accessibility for researchers and practitioners in sports analytics. By bridging the gaps between various action spotting techniques, OSL-ActionSpotting significantly contributes to the field of sports video analysis, fostering enhanced analytical capabilities and collaborative research opportunities. The scalable and modularized design of the library ensures its long-term relevance and adaptability to future technological advancements in the domain.

[CV-38] Multi-level Reliable Guidance for Unpaired Multi-view Clustering

链接: https://arxiv.org/abs/2407.01247
作者: Like Xin,Wanqi Yang,Lei Wang,Ming Yang
关键词: perform effective joint, unpaired observed samples, effective joint clustering, unpaired multi-view clustering, cluster structures
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we address the challenging problem of unpaired multi-view clustering (UMC), aiming to perform effective joint clustering using unpaired observed samples across multiple views. Commonly, traditional incomplete multi-view clustering (IMC) methods often depend on paired samples to capture complementary information between views. However, the strategy becomes impractical in UMC due to the absence of paired samples. Although some researchers have attempted to tackle the issue by preserving consistent cluster structures across views, they frequently neglect the confidence of these cluster structures, especially for boundary samples and uncertain cluster structures during the initial training. Therefore, we propose a method called Multi-level Reliable Guidance for UMC (MRG-UMC), which leverages multi-level clustering to aid in learning a trustworthy cluster structure across inner-view, cross-view, and common-view, respectively. Specifically, within each view, multi-level clustering fosters a trustworthy cluster structure across different levels and reduces clustering error. In cross-view learning, reliable view guidance enhances the confidence of the cluster structures in other views. Similarly, within the multi-level framework, the incorporation of a common view aids in aligning different views, thereby reducing the clustering error and uncertainty of cluster structure. Finally, as evidenced by extensive experiments, our method for UMC demonstrates significant efficiency improvements compared to 20 state-of-the-art methods.

[CV-39] CLHOP: Combined Audio-Video Learning for Horse 3D Pose and Shape Estimation

链接: https://arxiv.org/abs/2407.01244
作者: Ci Li,Elin Hernlund,Hedvig Kjellström,Silvia Zuffi
关键词: typically relies solely, animals typically relies, highly under-constrained, typically relies, relies solely
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR CV4Animals Workshop 2024

点击查看摘要

Abstract:In the monocular setting, predicting 3D pose and shape of animals typically relies solely on visual information, which is highly under-constrained. In this work, we explore using audio to enhance 3D shape and motion recovery of horses from monocular video. We test our approach on two datasets: an indoor treadmill dataset for 3D evaluation and an outdoor dataset capturing diverse horse movements, the latter being a contribution to this study. Our results show that incorporating sound with visual data leads to more accurate and robust motion regression. This study is the first to investigate audio’s role in 3D animal motion recovery.

[CV-40] SGCCNet: Single-Stage 3D Object Detector With Saliency-Guided Data Augmentation and Confidence Correction Mechanism

链接: https://arxiv.org/abs/2407.01239
作者: Ao Liang,Wenyu Chen,Jian Fang,Huaici Zhao
关键词: attracted widespread research, widespread research interest, research interest due, fast inference speed, inference speed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 16 figures

点击查看摘要

Abstract:The single-stage point-based 3D object detectors have attracted widespread research interest due to their advantages of lightweight and fast inference speed. However, they still face challenges such as inadequate learning of low-quality objects (ILQ) and misalignment between localization accuracy and classification confidence (MLC). In this paper, we propose SGCCNet to alleviate these two issues. For ILQ, SGCCNet adopts a Saliency-Guided Data Augmentation (SGDA) strategy to enhance the robustness of the model on low-quality objects by reducing its reliance on salient features. Specifically, We construct a classification task and then approximate the saliency scores of points by moving points towards the point cloud centroid in a differentiable process. During the training process, SGCCNet will be forced to learn from low saliency features through dropping points. Meanwhile, to avoid internal covariate shift and contextual features forgetting caused by dropping points, we add a geometric normalization module and skip connection block in each stage. For MLC, we design a Confidence Correction Mechanism (CCM) specifically for point-based multi-class detectors. This mechanism corrects the confidence of the current proposal by utilizing the predictions of other key points within the local region in the post-processing stage. Extensive experiments on the KITTI dataset demonstrate the generality and effectiveness of our SGCCNet. On the KITTI \textittest set, SGCCNet achieves 80.82% for the metric of AP_3D on the \textitModerate level, outperforming all other point-based detectors, surpassing IA-SSD and Fast Point R-CNN by 2.35% and 3.42% , respectively. Additionally, SGCCNet demonstrates excellent portability for other point-based detectors

[CV-41] DaBiT: Depth and Blur informed Transformer for Joint Refocusing and Super-Resolution

链接: https://arxiv.org/abs/2407.01230
作者: Crispian Morris,Nantheera Anantrasirichai,Fan Zhang,David Bull
关键词: specifically target motion, recorded videos suffer, accidental focus blur, target motion blur, deblurring methods exist
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In many real-world scenarios, recorded videos suffer from accidental focus blur, and while video deblurring methods exist, most specifically target motion blur. This paper introduces a framework optimised for the joint task of focal deblurring (refocusing) and video super-resolution (VSR). The proposed method employs novel map guided transformers, in addition to image propagation, to effectively leverage the continuous spatial variance of focal blur and restore the footage. We also introduce a flow re-focusing module to efficiently align relevant features between the blurry and sharp domains. Additionally, we propose a novel technique for generating synthetic focal blur data, broadening the model’s learning capabilities to include a wider array of content. We have made a new benchmark dataset, DAVIS-Blur, available. This dataset, a modified extension of the popular DAVIS video segmentation set, provides realistic out-of-focus blur degradations as well as the corresponding blur maps. Comprehensive experiments on DAVIS-Blur demonstrate the superiority of our approach. We achieve state-of-the-art results with an average PSNR performance over 1.9dB greater than comparable existing video restoration methods. Our source code will be made available at this https URL

[CV-42] Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

链接: https://arxiv.org/abs/2407.01220
作者: Zihan Gao,Lingling Li,Licheng Jiao,Fang Liu,Xu Liu,Wenping Ma,Yuwei Guo,Shuyuan Yang
关键词: spanning multiple domains, computer vision research, applications spanning multiple, high-dimensional CLIP features, CLIP features
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enables open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. While effective, however, the per-pixel distillation of high-dimensional CLIP features introduces ambiguity and necessitates complex regularization strategies, adding inefficiencies during training. This paper presents MaskField, which enables fast and efficient 3D open-vocabulary segmentation with neural fields under weak supervision. Unlike previous methods, MaskField distills masks rather than dense high-dimensional CLIP features. MaskFields employ neural fields as binary mask generators and supervise them with masks generated by SAM and classified by coarse CLIP features. MaskField overcomes the ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence, outperforming previous methods with just 5 minutes of training. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models.

[CV-43] Efficient Cutting Tool Wear Segmentation Based on Segment Anything Model

链接: https://arxiv.org/abs/2407.01211
作者: Zongshuo Li,Ding Huo,Markus Meurer,Thomas Bergs
关键词: final geometric precision, wear conditions impact, Tool wear conditions, Tool wear, geometric precision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Tool wear conditions impact the surface quality of the workpiece and its final geometric precision. In this research, we propose an efficient tool wear segmentation approach based on Segment Anything Model, which integrates U-Net as an automated prompt generator to streamline the processes of tool wear detection. Our evaluation covered three Point-of-Interest generation methods and further investigated the effects of variations in training dataset sizes and U-Net training intensities on resultant wear segmentation outcomes. The results consistently highlight our approach’s advantage over U-Net, emphasizing its ability to achieve accurate wear segmentation even with limited training datasets. This feature underscores its potential applicability in industrial scenarios where datasets may be limited.

[CV-44] Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection

链接: https://arxiv.org/abs/2407.01193
作者: Francesco Barbato,Umberto Michieli,Jijoong Moon,Pietro Zanuttigh,Mete Ozay
关键词: robotic systems deployed, detection robotic systems, object detection robotic, Personalized Object Detection, personal devices
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at IROS 2024, 8 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Recent years have seen object detection robotic systems deployed in several personal devices (e.g., home robots and appliances). This has highlighted a challenge in their design, i.e., they cannot efficiently update their knowledge to distinguish between general classes and user-specific instances (e.g., a dog vs. user’s dog). We refer to this challenging task as Instance-level Personalized Object Detection (IPOD). The personalization task requires many samples for model tuning and optimization in a centralized server, raising privacy concerns. An alternative is provided by approaches based on recent large-scale Foundation Models, but their compute costs preclude on-device applications. In our work we tackle both problems at the same time, designing a Few-Shot IPOD strategy called AuXFT. We introduce a conditional coarse-to-fine few-shot learner to refine the coarse predictions made by an efficient object detector, showing that using an off-the-shelf model leads to poor personalization due to neural collapse. Therefore, we introduce a Translator block that generates an auxiliary feature space where features generated by a self-supervised model (e.g., DINOv2) are distilled without impacting the performance of the detector. We validate AuXFT on three publicly available datasets and one in-house benchmark designed for the IPOD task, achieving remarkable gains in all considered scenarios with excellent time-complexity trade-off: AuXFT reaches a performance of 80% its upper bound at just 32% of the inference time, 13% of VRAM and 19% of the model size. Comments: Accepted at IROS 2024, 8 pages, 4 figures, 6 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2407.01193 [cs.CV] (or arXiv:2407.01193v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2407.01193 Focus to learn more arXiv-issued DOI via DataCite

[CV-45] MARS: Multimodal Active Robotic Sensing for Articulated Characterization

链接: https://arxiv.org/abs/2407.01191
作者: Hongliang Zeng,Ping Zhang,Chengjiong Wu,Jiahua Wang,Tingyu Ye,Fang Li
关键词: Precise perception, empowering service robots, empowering service, Precise, Abstract
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Precise perception of articulated objects is vital for empowering service robots. Recent studies mainly focus on point cloud, a single-modal approach, often neglecting vital texture and lighting details and assuming ideal conditions like optimal viewpoints, unrepresentative of real-world scenarios. To address these limitations, we introduce MARS, a novel framework for articulated object characterization. It features a multi-modal fusion module utilizing multi-scale RGB features to enhance point cloud features, coupled with reinforcement learning-based active sensing for autonomous optimization of observation viewpoints. In experiments conducted with various articulated object instances from the PartNet-Mobility dataset, our method outperformed current state-of-the-art methods in joint parameter estimation accuracy. Additionally, through active sensing, MARS further reduces errors, demonstrating enhanced efficiency in handling suboptimal viewpoints. Furthermore, our method effectively generalizes to real-world articulated objects, enhancing robot interactions. Code is available at this https URL.

[CV-46] Multi-View Black-Box Physical Attacks on Infrared Pedestrian Detectors Using Adversarial Infrared Grid

链接: https://arxiv.org/abs/2407.01168
作者: Kalibinuer Tiliwalidi,Chengyin Hu,Weiwen Shi
关键词: extensive research exists, visible spectrum, research exists, infrared spectrum, attacks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While extensive research exists on physical adversarial attacks within the visible spectrum, studies on such techniques in the infrared spectrum are limited. Infrared object detectors are vital in modern technological applications but are susceptible to adversarial attacks, posing significant security threats. Previous studies using physical perturbations like light bulb arrays and aerogels for white-box attacks, or hot and cold patches for black-box attacks, have proven impractical or limited in multi-view support. To address these issues, we propose the Adversarial Infrared Grid (AdvGrid), which models perturbations in a grid format and uses a genetic algorithm for black-box optimization. These perturbations are cyclically applied to various parts of a pedestrian’s clothing to facilitate multi-view black-box physical attacks on infrared pedestrian detectors. Extensive experiments validate AdvGrid’s effectiveness, stealthiness, and robustness. The method achieves attack success rates of 80.00% in digital environments and 91.86% in physical environments, outperforming baseline methods. Additionally, the average attack success rate exceeds 50% against mainstream detectors, demonstrating AdvGrid’s robustness. Our analyses include ablation studies, transfer attacks, and adversarial defenses, confirming the method’s superiority.

[CV-47] Benchmarking Predictive Coding Networks – Made Simple

链接: https://arxiv.org/abs/2407.01163
作者: Luca Pinchetti,Chang Qi,Oleh Lokshyn,Gaspard Olivers,Cornelius Emde,Mufeng Tang,Amine M’Charrak,Simon Frieder,Bayar Menzat,Rafal Bogacz,Thomas Lukasiewicz,Tommaso Salvatori
关键词: predictive coding networks, machine learning, predictive coding, coding networks, networks in machine
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 33 pages, 25 figures

点击查看摘要

Abstract:In this work, we tackle the problems of efficiency and scalability for predictive coding networks in machine learning. To do so, we first propose a library called PCX, whose focus lies on performance and simplicity, and provides a user-friendly, deep-learning oriented interface. Second, we use PCX to implement a large set of benchmarks for the community to use for their experiments. As most works propose their own tasks and architectures, do not compare one against each other, and focus on small-scale tasks, a simple and fast open-source library adopted by the whole community would address all of these concerns. Third, we perform extensive benchmarks using multiple algorithms, setting new state-of-the-art results in multiple tasks and datasets, as well as highlighting limitations inherent to PC that should be addressed. Thanks to the efficiency of PCX, we are able to analyze larger architectures than commonly used, providing baselines to galvanize community efforts towards one of the main open problems in the field: scalability. The code for PCX is available at \textitthis https URL.

[CV-48] Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models

链接: https://arxiv.org/abs/2407.01157
作者: Shaeke Salman,Md Montasir Bin Shams,Xiuwen Liu
关键词: unprecedented zero-shot capabilities, exhibit unprecedented zero-shot, shared embedding space, models exhibit unprecedented, zero-shot capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 14 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:2401.15568 , arXiv:2402.08473

点击查看摘要

Abstract:Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities can be misaligned. In this paper, we extend and utilize a recently developed effective gradient-based procedure that allows us to match the embedding of a given text by minimally modifying an image. Using the procedure, we show that we can align the embeddings of distinguishable texts to any image through unnoticeable adversarial attacks in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and at the same time visually indistinguishable images can be matched to the embeddings of very different texts. Our technique achieves 100% success rate when it is applied to text datasets and images from multiple sources. Without overcoming the vulnerability, multimodal models cannot robustly align inputs from different modalities in a semantically meaningful way. \textbfWarning: the text data used in this paper are toxic in nature and may be offensive to some readers.

[CV-49] Integrated feature analysis for deep learning interpretation and class activation maps

链接: https://arxiv.org/abs/2407.01142
作者: Yanli Li,Tahereh Hassanzadeh,Denis P. Shamonin,Monique Reijnierse,Annette H.M. van der Helm-van Mil,Berend C. Stoel
关键词: integrated feature analysis, Understanding the decisions, integrated feature, feature analysis, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 11 figures, code available: this https URL This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Understanding the decisions of deep learning (DL) models is essential for the acceptance of DL to risk-sensitive applications. Although methods, like class activation maps (CAMs), give a glimpse into the black box, they do miss some crucial information, thereby limiting its interpretability and merely providing the considered locations of objects. To provide more insight into the models and the influence of datasets, we propose an integrated feature analysis method, which consists of feature distribution analysis and feature decomposition, to look closer into the intermediate features extracted by DL models. This integrated feature analysis could provide information on overfitting, confounders, outliers in datasets, model redundancies and principal features extracted by the models, and provide distribution information to form a common intensity scale, which are missing in current CAM algorithms. The integrated feature analysis was applied to eight different datasets for general validation: photographs of handwritten digits, two datasets of natural images and five medical datasets, including skin photography, ultrasound, CT, X-rays and MRIs. The method was evaluated by calculating the consistency between the CAMs average class activation levels and the logits of the model. Based on the eight datasets, the correlation coefficients through our method were all very close to 100%, and based on the feature decomposition, 5%-25% of features could generate equally informative saliency maps and obtain the same model performances as using all features. This proves the reliability of the integrated feature analysis. As the proposed methods rely on very few assumptions, this is a step towards better model interpretation and a useful extension to existing CAM algorithms. Codes: this https URL

[CV-50] M2IST: Multi-Modal Interactive Side-Tuning for Memory-efficient Referring Expression Comprehension

链接: https://arxiv.org/abs/2407.01131
作者: Xuyang Liu,Ting Liu,Siteng Huang,Yue Hu,Quanjun Yin,Donglin Wang,Honggang Chen
关键词: Referring expression comprehension, Referring expression, expression comprehension, vision-language task, task to locate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, applying PETL to REC faces two challenges: (1) insufficient interaction between pre-trained vision and language encoders, and (2) high GPU memory usage due to gradients passing through both heavy encoders. To address these issues, we present M ^2 IST: Multi-Modal Interactive Side-Tuning with M ^3 ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep the pre-trained vision and language encoders fixed and update M ^3 ISAs on side networks to establish connections between them, thereby achieving parameter- and memory-efficient tuning for REC. Empirical results on three benchmarks show M ^2 IST achieves the best performance-parameter-memory trade-off compared to full fine-tuning and other PETL methods, with only 3.14M tunable parameters (2.11% of full fine-tuning) and 15.44GB GPU memory usage (39.61% of full fine-tuning). Source code will soon be publicly available.

[CV-51] RMS-FlowNet: Efficient and Robust Multi-Scale Scene Flow Estimation for Large-Scale Point Clouds

链接: https://arxiv.org/abs/2407.01129
作者: Ramy Battrawy,René Schuster,Didier Stricker
关键词: operate on high-density, efficient scene flow, scene flow, full input resolution, high-density point clouds
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This version of the article has been accepted by International Journal of Computer Vision (IJCV), and published in 23.05.2024

点击查看摘要

Abstract:The proposed RMS-FlowNet++ is a novel end-to-end learning-based architecture for accurate and efficient scene flow estimation that can operate on high-density point clouds. For hierarchical scene f low estimation, existing methods rely on expensive Farthest-Point-Sampling (FPS) to sample the scenes, must find large correspondence sets across the consecutive frames and/or must search for correspondences at a full input resolution. While this can improve the accuracy, it reduces the overall efficiency of these methods and limits their ability to handle large numbers of points due to memory requirements. In contrast to these methods, our architecture is based on an efficient design for hierarchical prediction of multi-scale scene flow. To this end, we develop a special flow embedding block that has two advantages over the current methods: First, a smaller correspondence set is used, and second, the use of Random-Sampling (RS) is possible. In addition, our architecture does not need to search for correspondences at a full input resolution. Exhibiting high accuracy, our RMS-FlowNet++ provides a faster prediction than state-of-the-art methods, avoids high memory requirements and enables efficient scene flow on dense point clouds of more than 250K points at once. Our comprehensive experiments verify the accuracy of RMS FlowNet++ on the established FlyingThings3D data set with different point cloud densities and validate our design choices. Furthermore, we demonstrate that our model has a competitive ability to generalize to the real-world scenes of the KITTI data set without fine-tuning.

[CV-52] Comprehensive Dataset for Urban Streetlight Analysis

链接: https://arxiv.org/abs/2407.01117
作者: Eliza Femi Sherley S,Sanjay T,Shri Kaanth P,Jeffrey Samuel S
关键词: India major streets, Chennai region, systematically from India, India major, high-resolution streetlight images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This article includes a comprehensive collection of over 800 high-resolution streetlight images taken systematically from India’s major streets, primarily in the Chennai region. The images were methodically collected following standardized methods to assure uniformity and quality. Each image has been labelled and grouped into directories based on binary class labels, which indicate whether each streetlight is functional or not. This organized dataset is intended to make it easier to train and evaluate deep neural networks, allowing for the creation of pre-trained models that have robust feature representations. Such models have several potential uses, such as improving smart city surveillance systems, automating street infrastructure monitoring, and increasing urban management efficiency. The availability of this dataset is intended to inspire future research and development in computer vision and smart city technologies, supporting innovation and practical solutions to urban infrastructure concerns. The dataset can be accessed at this https URL.

[CV-53] Semantic-guided Adversarial Diffusion Model for Self-supervised Shadow Removal

链接: https://arxiv.org/abs/2407.01104
作者: Ziqi Zeng,Chen Zhao,Weiling Cai,Chenyu Dong
关键词: Existing unsupervised methods, Existing unsupervised, shadow removal tasks, addressed the challenges, challenges of inconsistent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing unsupervised methods have addressed the challenges of inconsistent paired data and tedious acquisition of ground-truth labels in shadow removal tasks. However, GAN-based training often faces issues such as mode collapse and unstable optimization. Furthermore, due to the complex mapping between shadow and shadow-free domains, merely relying on adversarial learning is not enough to capture the underlying relationship between two domains, resulting in low quality of the generated images. To address these problems, we propose a semantic-guided adversarial diffusion framework for self-supervised shadow removal, which consists of two stages. At first stage a semantic-guided generative adversarial network (SG-GAN) is proposed to carry out a coarse result and construct paired synthetic data through a cycle-consistent structure. Then the coarse result is refined with a diffusion-based restoration module (DBRM) to enhance the texture details and edge artifact at second stage. Meanwhile, we propose a multi-modal semantic prompter (MSP) that aids in extracting accurate semantic information from real images and text, guiding the shadow removal network to restore images better in SG-GAN. We conduct experiments on multiple public datasets, and the experimental results demonstrate the effectiveness of our method.

[CV-54] Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

链接: https://arxiv.org/abs/2407.01094
作者: Mingxiang Liao,Hannan Lu,Xinyu Zhang,Fang Wan,Tianyu Wang,Yuzhong Zhao,Wangmeng Zuo,Qixiang Ye,Jingdong Wang
关键词: Comprehensive and constructive, constructive evaluation protocols, evaluation protocols play, development of sophisticated, dynamics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at this https URL.

[CV-55] Kolmogorov-Arnold Convolutions: Design Principles and Empirical Studies

链接: https://arxiv.org/abs/2407.01092
作者: Ivan Drokin
关键词: sparked significant interest, Kolmogorov-Arnold Networks, scientific community, sparked significant, significant interest
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of Kolmogorov-Arnold Networks (KANs) has sparked significant interest and debate within the scientific community. This paper explores the application of KANs in the domain of computer vision (CV). We examine the convolutional version of KANs, considering various nonlinearity options beyond splines, such as Wavelet transforms and a range of polynomials. We propose a parameter-efficient design for Kolmogorov-Arnold convolutional layers and a parameter-efficient finetuning algorithm for pre-trained KAN models, as well as KAN convolutional versions of self-attention and focal modulation layers. We provide empirical evaluations conducted on MNIST, CIFAR10, CIFAR100, Tiny ImageNet, ImageNet1k, and HAM10000 datasets for image classification tasks. Additionally, we explore segmentation tasks, proposing U-Net-like architectures with KAN convolutions, and achieving state-of-the-art results on BUSI, GlaS, and CVC datasets. We summarized all of our findings in a preliminary design guide of KAN convolutional models for computer vision tasks. Furthermore, we investigate regularization techniques for KANs. All experimental code and implementations of convolutional layers and models, pre-trained on ImageNet1k weights are available on GitHub via this this https URL

[CV-56] CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

链接: https://arxiv.org/abs/2407.01081
作者: Yuxuan Wang,Yijun Liu,Fei Yu,Chen Huang,Kexin Li,Zhiguo Wan,Wanxiang Che
关键词: Chinese vision-language models, constructed on Western-centric, Chinese vision-language, Chinese, Chinese culture
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Despite the rapid development of Chinese vision-language models (VLMs), most existing Chinese vision-language (VL) datasets are constructed on Western-centric images from existing English VL datasets. The cultural bias in the images makes these datasets unsuitable for evaluating VLMs in Chinese culture. To remedy this issue, we present a new Chinese Vision- Language Understanding Evaluation (CVLUE) benchmark dataset, where the selection of object categories and images is entirely driven by Chinese native speakers, ensuring that the source images are representative of Chinese culture. The benchmark contains four distinct VL tasks ranging from image-text retrieval to visual question answering, visual grounding and visual dialogue. We present a detailed statistical analysis of CVLUE and provide a baseline performance analysis with several open-source multilingual VLMs on CVLUE and its English counterparts to reveal their performance gap between English and Chinese. Our in-depth category-level analysis reveals a lack of Chinese cultural knowledge in existing VLMs. We also find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs’ understanding of Chinese culture.

[CV-57] Multimodal Conditional 3D Face Geometry Generation

链接: https://arxiv.org/abs/2407.01074
作者: Christopher Otto,Prashanth Chandran,Sebastian Weiss,Markus Gross,Gaspard Zoss,Derek Bradley
关键词: multimodal conditional, method for multimodal, output identity, identity and expression, FLAME face model
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We present a new method for multimodal conditional 3D face geometry generation that allows user-friendly control over the output identity and expression via a number of different conditioning signals. Within a single model, we demonstrate 3D faces generated from artistic sketches, 2D face landmarks, Canny edges, FLAME face model parameters, portrait photos, or text prompts. Our approach is based on a diffusion process that generates 3D geometry in a 2D parameterized UV domain. Geometry generation passes each conditioning signal through a set of cross-attention layers (IP-Adapter), one set for each user-defined conditioning signal. The result is an easy-to-use 3D face generation tool that produces high resolution geometry with fine-grain user control.

[CV-58] Human-like object concept representations emerge naturally in multimodal large language models

链接: https://arxiv.org/abs/2407.01067
作者: Changde Du,Kaicheng Fu,Bincheng Wen,Yi Sun,Jie Peng,Wei Wei,Ying Gao,Shengpei Wang,Chuncheng Zhang,Jinpeng Li,Shuang Qiu,Le Chang,Huiguang He
关键词: offering crucial insights, Large Language Models, intrigued cognitive scientists, long intrigued cognitive, scientists and neuroscientists
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The conceptualization and categorization of natural objects in the human mind have long intrigued cognitive scientists and neuroscientists, offering crucial insights into human perception and cognition. Recently, the rapid development of Large Language Models (LLMs) has raised the attractive question of whether these models can also develop human-like object representations through exposure to vast amounts of linguistic and multimodal data. In this study, we combined behavioral and neuroimaging analysis methods to uncover how the object concept representations in LLMs correlate with those of humans. By collecting large-scale datasets of 4.7 million triplet judgments from LLM and Multimodal LLM (MLLM), we were able to derive low-dimensional embeddings that capture the underlying similarity structure of 1,854 natural objects. The resulting 66-dimensional embeddings were found to be highly stable and predictive, and exhibited semantic clustering akin to human mental representations. Interestingly, the interpretability of the dimensions underlying these embeddings suggests that LLM and MLLM have developed human-like conceptual representations of natural objects. Further analysis demonstrated strong alignment between the identified model embeddings and neural activity patterns in many functionally defined brain ROIs (e.g., EBA, PPA, RSC and FFA). This provides compelling evidence that the object representations in LLMs, while not identical to those in the human, share fundamental commonalities that reflect key schemas of human conceptual knowledge. This study advances our understanding of machine intelligence and informs the development of more human-like artificial cognitive systems.

[CV-59] Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert

链接: https://arxiv.org/abs/2407.01034
作者: Han EunGi,Oh Hyun-Bin,Kim Sung-Bin,Corentin Nivelet Etcheberry,Suekyeong Nam,Janghoon Joo,Tae-Hyun Oh
关键词: recently garnered attention, garnered attention due, multimedia production, recently garnered, garnered attention
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: INTERSPEECH 2024

点击查看摘要

Abstract:Speech-driven 3D facial animation has recently garnered attention due to its cost-effective usability in multimedia production. However, most current advances overlook the intelligibility of lip movements, limiting the realism of facial expressions. In this paper, we introduce a method for speech-driven 3D facial animation to generate accurate lip movements, proposing an audio-visual multimodal perceptual loss. This loss provides guidance to train the speech-driven 3D facial animators to generate plausible lip motions aligned with the spoken transcripts. Furthermore, to incorporate the proposed audio-visual perceptual loss, we devise an audio-visual lip reading expert leveraging its prior knowledge about correlations between speech and lip motions. We validate the effectiveness of our approach through broad experiments, showing noticeable improvements in lip synchronization and lip readability performance. Codes are available at this https URL.

[CV-60] Overcoming Common Flaws in the Evaluation of Selective Classification Systems

链接: https://arxiv.org/abs/2407.01032
作者: Jeremias Traub,Till J. Bungert,Carsten T. Lüth,Michael Baumgartner,Klaus H. Maier-Hein,Lena Maier-Hein,Paul F Jaeger
关键词: reject low-confidence predictions, promises reliable translation, machine-learning based classification, based classification systems, low-confidence predictions
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the \mathrmAUROC in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ( \mathrmAUGRC ), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of \mathrmAUGRC on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

[CV-61] EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting

链接: https://arxiv.org/abs/2407.01029
作者: Chenxin Li,Brandon Y. Feng,Yifan Liu,Hengyu Liu,Cheng Wang,Weihao Yu,Yixuan Yuan
关键词: important downstream surgical, downstream surgical applications, biological tissues, key to unlock, unlock various important
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accpeted by MICCAI2024

点击查看摘要

Abstract:3D reconstruction of biological tissues from a collection of endoscopic images is a key to unlock various important downstream surgical applications with 3D capabilities. Existing methods employ various advanced neural rendering techniques for photorealistic view synthesis, but they often struggle to recover accurate 3D representations when only sparse observations are available, which is usually the case in real-world clinical scenarios. To tackle this sparsity challenge, we propose a framework leveraging the prior knowledge from multiple foundation models during the reconstruction process, dubbed as \textitEndoSparse. Experimental results indicate that our proposed strategy significantly improves the geometric and appearance quality under challenging sparse-view conditions, including using only three views. In rigorous benchmarking experiments against state-of-the-art methods, \textitEndoSparse achieves superior results in terms of accurate geometry, realistic appearance, and rendering efficiency, confirming the robustness to sparse-view limitations in endoscopic reconstruction. \textitEndoSparse signifies a steady step towards the practical deployment of neural 3D reconstruction in real-world clinical scenarios. Project page: this https URL.

[CV-62] Blind Inversion using Latent Diffusion Priors

链接: https://arxiv.org/abs/2407.01027
作者: Weimin Bai,Siyi Chen,Wenzheng Chen,He Sun
关键词: complex prior distributions, exceptional ability, Diffusion models, Diffusion, inverse problems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as powerful tools for solving inverse problems due to their exceptional ability to model complex prior distributions. However, existing methods predominantly assume known forward operators (i.e., non-blind), limiting their applicability in practical settings where acquiring such operators is costly. Additionally, many current approaches rely on pixel-space diffusion models, leaving the potential of more powerful latent diffusion models (LDMs) underexplored. In this paper, we introduce LatentDEM, an innovative technique that addresses more challenging blind inverse problems using latent diffusion priors. At the core of our method is solving blind inverse problems within an iterative Expectation-Maximization (EM) framework: (1) the E-step recovers clean images from corrupted observations using LDM priors and a known forward model, and (2) the M-step estimates the forward operator based on the recovered images. Additionally, we propose two novel optimization techniques tailored for LDM priors and EM frameworks, yielding more accurate and efficient blind inversion results. As a general framework, LatentDEM supports both linear and non-linear inverse problems. Beyond common 2D image restoration tasks, it enables new capabilities in non-linear 3D inverse rendering problems. We validate LatentDEM’s performance on representative 2D blind deblurring and 3D sparse-view reconstruction tasks, demonstrating its superior efficacy over prior arts.

[CV-63] Coding for Intelligence from the Perspective of Category

链接: https://arxiv.org/abs/2407.01017
作者: Wenhan Yang,Zixuan Hu,Lilang Lin,Jiaying Liu,Ling-Yu Duan
关键词: abstract computational level, abstract computational, interweave recently, significant progress, targets compressing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Coding, which targets compressing and reconstructing data, and intelligence, often regarded at an abstract computational level as being centered around model learning and prediction, interweave recently to give birth to a series of significant progress. The recent trends demonstrate the potential homogeneity of these two fields, especially when deep-learning models aid these two categories for better probability modeling. For better understanding and describing from a unified perspective, inspired by the basic generally recognized principles in cognitive psychology, we formulate a novel problem of Coding for Intelligence from the category theory view. Based on the three axioms: existence of ideal coding, existence of practical coding, and compactness promoting generalization, we derive a general framework to understand existing methodologies, namely that, coding captures the intrinsic relationships of objects as much as possible, while ignoring information irrelevant to downstream tasks. This framework helps identify the challenges and essential elements in solving the specific derived Minimal Description Length (MDL) optimization problem from a broader range, providing opportunities to build a more intelligent system for handling multiple tasks/applications with coding ideas/tools. Centering on those elements, we systematically review recent processes of towards optimizing the MDL problem in more comprehensive ways from data, model, and task perspectives, and reveal their impacts on the potential CfI technical routes. After that, we also present new technique paths to fulfill CfI and provide potential solutions with preliminary experimental evidence. Last, further directions and remaining issues are discussed as well. The discussion shows our theory can reveal many phenomena and insights about large foundation models, which mutually corroborate with recent practices in feature learning.

[CV-64] SOOD: Leveraging Unlabeled Data to Boost Oriented Object Detection

链接: https://arxiv.org/abs/2407.01016
作者: Dingkang Liang,Wei Hua,Chunsheng Shi,Zhikang Zou,Xiaoqing Ye,Xiang Bai
关键词: hot topic recently, boost object detectors, Semi-supervised object detection, Semi-supervised Oriented Object, Oriented Object Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects common in aerial images unexplored. At the same time, the annotation cost of multi-oriented objects is significantly higher than that of their horizontal counterparts. Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images are usually arbitrary orientations, small scales, and aggregation, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various multi-oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.92, +2.39, and +2.57 mAP under 10%, 20%, and 30% labeled data settings, respectively) with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. The code will be made available.

[CV-65] An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations

链接: https://arxiv.org/abs/2407.01014
作者: Weimin Bai,Yifei Wang,Wenzheng Chen,He Sun
关键词: inverse problems due, complex image priors, solving imaging inverse, imaging inverse problems, excel in solving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models excel in solving imaging inverse problems due to their ability to model complex image priors. However, their reliance on large, clean datasets for training limits their practical use where clean data is scarce. In this paper, we propose EMDiffusion, an expectation-maximization (EM) approach to train diffusion models from corrupted observations. Our method alternates between reconstructing clean images from corrupted data using a known diffusion model (E-step) and refining diffusion model weights based on these reconstructions (M-step). This iterative process leads the learned diffusion model to gradually converge to the true clean data distribution. We validate our method through extensive experiments on diverse computational imaging tasks, including random inpainting, denoising, and deblurring, achieving new state-of-the-art performance.

[CV-66] Swish-T:Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance

链接: https://arxiv.org/abs/2407.01012
作者: Youngmin Seo,Jinha Kim,Unsang Park
关键词: activation function Swish, existing non-monotonic activation, original Swish, original Swish function, Swish-T
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:We propose the Swish-T family, an enhancement of the existing non-monotonic activation function Swish. Swish-T is defined by adding a Tanh bias to the original Swish function. This modification creates a family of Swish-T variants, each designed to excel in different tasks, showcasing specific advantages depending on the application context. The Tanh bias allows for broader acceptance of negative values during initial training stages, offering a smoother non-monotonic curve than the original Swish. We ultimately propose the Swish-T _\textbfC function, while Swish-T and Swish-T _\textbfB , byproducts of Swish-T _\textbfC , also demonstrate satisfactory performance. Furthermore, our ablation study shows that using Swish-T _\textbfC as a non-parametric function can still achieve high performance. The superiority of the Swish-T family has been empirically demonstrated across various models and benchmark datasets, including MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100. The code is publicly available at "this https URL.

[CV-67] GMT: A Robust Global Association Model for Multi-Target Multi-Camera Tracking

链接: https://arxiv.org/abs/2407.01007
作者: Huijie Fan,Tinghui Zhao,Qiang Wang,Baojie Fan,Yandong Tang,LianQing Liu
关键词: data association problem, multi-target multi-camera, main challenge, task of multi-target, complications arising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the task of multi-target multi-camera (MTMC) tracking of pedestrians, the data association problem is a key issue and main challenge, especially with complications arising from camera movements, lighting variations, and obstructions. However, most MTMC models adopt two-step approaches, thus heavily depending on the results of the first-step tracking in practical applications. Moreover, the same targets crossing different cameras may exhibit significant appearance variations, which further increases the difficulty of cross-camera matching. To address the aforementioned issues, we propose a global online MTMC tracking model that addresses the dependency on the first tracking stage in two-step methods and enhances cross-camera matching. Specifically, we propose a transformer-based global MTMC association module to explore target associations across different cameras and frames, generating global trajectories directly. Additionally, to integrate the appearance and spatio-temporal features of targets, we propose a feature extraction and fusion module for MTMC tracking. This module enhances feature representation and establishes correlations between the features of targets across multiple cameras. To accommodate high scene diversity and complex lighting condition variations, we have established the VisionTrack dataset, which enables the development of models that are more generalized and robust to various environments. Our model demonstrates significant improvements over comparison methods on the VisionTrack dataset and others.

[CV-68] Embedded Prompt Tuning: Towards Enhanced Calibration of Pretrained Models for Medical Images

链接: https://arxiv.org/abs/2407.01003
作者: Wenqiang Zu,Shenghao Xie,Qing Zhao,Guoqi Li,Lei Ma
关键词: natural imaging downstream, imaging downstream tasks, Foundation models, Foundation models pre-trained, prompt tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 7 figures. arXiv admin note: text overlap with arXiv:2306.09579 , arXiv:2203.12119 by other authors

点击查看摘要

Abstract:Foundation models pre-trained on large-scale data have been widely witnessed to achieve success in various natural imaging downstream tasks. Parameter-efficient fine-tuning (PEFT) methods aim to adapt foundation models to new domains by updating only a small portion of parameters in order to reduce computational overhead. However, the effectiveness of these PEFT methods, especially in cross-domain few-shot scenarios, e.g., medical image analysis, has not been fully explored. In this work, we facilitate the study of the performance of PEFT when adapting foundation models to medical image classification tasks. Furthermore, to alleviate the limitations of prompt introducing ways and approximation capabilities on Transformer architectures of mainstream prompt tuning methods, we propose the Embedded Prompt Tuning (EPT) method by embedding prompt tokens into the expanded channels. We also find that there are anomalies in the feature space distribution of foundation models during pre-training process, and prompt tuning can help mitigate this negative impact. To explain this phenomenon, we also introduce a novel perspective to understand prompt tuning: \textbfPrompt tuning is a distribution calibrator. And we support it by analyzing patch-wise scaling and feature separation operations contained in EPT. Our experiments show that EPT outperforms several state-of-the-art fine-tuning methods by a significant margin on few-shot medical image classification tasks, and completes the fine-tuning process within highly competitive time, indicating EPT is an effective PEFT method. Our code will be released once accepted.

[CV-69] Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

链接: https://arxiv.org/abs/2407.00985
作者: Takayuki Nishimura,Katsuyuki Kuyo,Motonari Kambara,Komei Sugiura
关键词: domestic service robots, object manipulation instruction, open vocabulary instructions, generating segmentation masks, give open vocabulary
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for presentation at IROS2024

点击查看摘要

Abstract:We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera’s field of view and cases in which the order of vertices differs but still represents the same polygon, which leads to erroneous mask generation. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions. We implement a novel loss function using optimal transport to prevent significant loss where the order of vertices differs but still represents the same polygon. To evaluate our approach, we constructed a new dataset based on the REVERIE dataset and Matterport3D dataset. The results demonstrated the effectiveness of the proposed method compared with existing mask generation methods. Remarkably, our best model achieved a +16.32% improvement on the dataset compared with a representative polygon-based method.

[CV-70] FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

链接: https://arxiv.org/abs/2407.00983
作者: Ruinan Jin,Zikang Xu,Yuan Zhong,Qiongsong Yao,Qi Dou,S. Kevin Zhou,Xiaoxiao Li
关键词: offers unprecedented opportunities, healthcare offers unprecedented, enhance medical diagnostics, advent of foundation, offers unprecedented
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 29 pages, 17 figures

点击查看摘要

Abstract:The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmarks, standardized pipelines, and easily adaptable libraries to evaluate and understand the fairness performance of FMs in medical imaging, leading to considerable challenges in formulating and implementing solutions that ensure equitable outcomes across diverse patient populations. To fill this gap, we introduce FairMedFM, a fairness benchmark for FM research in medical imaging.FairMedFM integrates with 17 popular medical imaging datasets, encompassing different modalities, dimensionalities, and sensitive attributes. It explores 20 widely used FMs, with various usages such as zero-shot learning, linear probing, parameter-efficient fine-tuning, and prompting in various downstream tasks – classification and segmentation. Our exhaustive analysis evaluates the fairness performance over different evaluation metrics from multiple perspectives, revealing the existence of bias, varied utility-fairness trade-offs on different FMs, consistent disparities on the same datasets regardless FMs, and limited effectiveness of existing unfairness mitigation methods. Checkout FairMedFM’s project page and open-sourced codebase, which supports extendible functionalities and applications as well as inclusive for studies on FMs in medical imaging over the long term.

[CV-71] Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval

链接: https://arxiv.org/abs/2407.00979
作者: Hanwen Su,Ge Song,Kai Huang,Jiyan Wang,Ming Yang
关键词: sketch-based image retrieval, zero-shot sketch-based image, study the problem, Description Generation Module, Cross-modal Alignment Module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we study the problem of zero-shot sketch-based image retrieval (ZS-SBIR). The prior methods tackle the problem in a two-modality setting with only category labels or even no textual information involved. However, the growing prevalence of Large-scale pre-trained Language Models (LLMs), which have demonstrated great knowledge learned from web-scale data, can provide us with an opportunity to conclude collective textual information. Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers. To this end, we propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval. The network consists of three components: (i) a Description Generation Module that generates textual descriptions for each training category by prompting an LLM with several interrogative sentences, (ii) a Feature Extraction Module that includes two ViTs for sketch and image data, a transformer for extracting tokens of sentences of each training category, finally (iii) a Cross-modal Alignment Module that exchanges the token features of both text-sketch and text-image using cross-attention mechanism, and align the tokens locally and globally. Extensive experiments on three benchmark datasets show our superior performances over the state-of-the-art ZS-SBIR methods.

[CV-72] FALCON: Frequency Adjoint Link with CONtinuous Density Mask for Fast Single Image Dehazing

链接: https://arxiv.org/abs/2407.00972
作者: Donghyun Kim,Seil Kang,Seong Jae Hwang
关键词: pervasive challenge crucial, robust vision applications, addressing atmospheric interference, Frequency Adjoint Link, Image dehazing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image dehazing, addressing atmospheric interference like fog and haze, remains a pervasive challenge crucial for robust vision applications such as surveillance and remote sensing under adverse visibility. While various methodologies have evolved from early works predicting transmission matrix and atmospheric light features to deep learning and dehazing networks, they innately prioritize dehazing quality metrics, neglecting the need for real-time applicability in time-sensitive domains like autonomous driving. This work introduces FALCON (Frequency Adjoint Link with CONtinuous density mask), a single-image dehazing system achieving state-of-the-art performance on both quality and speed. Particularly, we develop a novel bottleneck module, namely, Frequency Adjoint Link, operating in the frequency space to globally expand the receptive field with minimal growth in network size. Further, we leverage the underlying haze distribution based on the atmospheric scattering model via a Continuous Density Mask (CDM) which serves as a continuous-valued mask input prior and a differentiable auxiliary loss. Comprehensive experiments involving multiple state-of-the-art methods and ablation analysis demonstrate FALCON’s exceptional performance in both dehazing quality and speed (i.e., 180 frames-per-second), quantified by metrics such as FPS, PSNR, and SSIM.

[CV-73] Deep learning for automated detection of breast cancer in deep ultraviolet fluorescence images with diffusion probabilistic model

链接: https://arxiv.org/abs/2407.00967
作者: Sepehr Salem Ghahfarokhi,Tyrell To,Julie Jorns,Tina Yen,Bing Yu,Dong Hye Ye
关键词: Data limitation, applying deep learning, significant challenge, challenge in applying, learning to medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IEEE International Symposium on Biomedical Imaging 2024

点击查看摘要

Abstract:Data limitation is a significant challenge in applying deep learning to medical images. Recently, the diffusion probabilistic model (DPM) has shown the potential to generate high-quality images by converting Gaussian random noise into realistic images. In this paper, we apply the DPM to augment the deep ultraviolet fluorescence (DUV) image dataset with an aim to improve breast cancer classification for intraoperative margin assessment. For classification, we divide the whole surface DUV image into small patches and extract convolutional features for each patch by utilizing the pre-trained ResNet. Then, we feed them into an XGBoost classifier for patch-level decisions and then fuse them with a regional importance map computed by Grad-CAM++ for whole surface-level prediction. Our experimental results show that augmenting the training dataset with the DPM significantly improves breast cancer detection performance in DUV images, increasing accuracy from 93% to 97%, compared to using Affine transformations and ProGAN.

[CV-74] SpectralKAN: Kolmogorov-Arnold Network for Hyperspectral Images Change Detection

链接: https://arxiv.org/abs/2407.00949
作者: Yanheng Wang,Xiaohan Yu,Yongsheng Gao,Jianjun Sha,Jian Wang,Lianru Gao,Yonggang Zhang,Xianhui Rong
关键词: including convolutional neural, deep learning methods, graph neural networks, convolutional neural networks, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:It has been verified that deep learning methods, including convolutional neural networks (CNNs), graph neural networks (GNNs), and transformers, can accurately extract features from hyperspectral images (HSIs). These algorithms perform exceptionally well on HSIs change detection (HSIs-CD). However, the downside of these impressive results is the enormous number of parameters, FLOPs, GPU memory, training and test times required. In this paper, we propose an spectral Kolmogorov-Arnold Network for HSIs-CD (SpectralKAN). SpectralKAN represent a multivariate continuous function with a composition of activation functions to extract HSIs feature and classification. These activation functions are b-spline functions with different parameters that can simulate various functions. In SpectralKAN, a KAN encoder is proposed to enhance computational efficiency for HSIs. And a spatial-spectral KAN encoder is introduced, where the spatial KAN encoder extracts spatial features and compresses the spatial dimensions from patch size to one. The spectral KAN encoder then extracts spectral features and classifies them into changed and unchanged categories. We use five HSIs-CD datasets to verify the effectiveness of SpectralKAN. Experimental verification has shown that SpectralKAN maintains high HSIs-CD accuracy while requiring fewer parameters, FLOPs, GPU memory, training and testing times, thereby increasing the efficiency of HSIs-CD. The code will be available at this https URL.

[CV-75] Diffusion Transformer Model With Compact Prior for Low-dose PET Reconstruction

链接: https://arxiv.org/abs/2407.00944
作者: Bin Huang,Xubiao Liu,Lei Fang,Qiegen Liu,Bingxuan Li
关键词: Positron emission tomography, Positron emission, low-dose PET, advanced medical imaging, medical imaging technique
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Positron emission tomography (PET) is an advanced medical imaging technique that plays a crucial role in non-invasive clinical diagnosis. However, while reducing radiation exposure through low-dose PET scans is beneficial for patient safety, it often results in insufficient statistical data. This scarcity of data poses significant challenges for accurately reconstructing high-quality images, which are essential for reliable diagnostic outcomes. In this research, we propose a diffusion transformer model (DTM) guided by joint compact prior (JCP) to enhance the reconstruction quality of low-dose PET imaging. In light of current research findings, we present a pioneering PET reconstruction model that integrates diffusion and transformer models for joint optimization. This model combines the powerful distribution mapping abilities of diffusion models with the capacity of transformers to capture long-range dependencies, offering significant advantages for low-dose PET reconstruction. Additionally, the incorporation of the lesion refining block and penalized weighted least squares (PWLS) enhance the recovery capability of lesion regions and preserves detail information, solving blurring problems in lesion areas and texture details of most deep learning frameworks. Experimental results demonstrate the effectiveness of DTM in enhancing image quality and preserving critical clinical information for low-dose PET scans. Our approach not only reduces radiation exposure risks but also provides a more reliable PET imaging tool for early disease detection and patient management.

[CV-76] PointViG: A Lightweight GNN-based Model for Efficient Point Cloud Analysis

链接: https://arxiv.org/abs/2407.00921
作者: Qiang Zheng,Yafei Qi,Chen Wang,Chao Zhang,Jian Sun
关键词: Graph Neural Networks, Neural Networks, existing approaches encounter, approaches encounter challenges, point cloud analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the domain of point cloud analysis, despite the significant capabilities of Graph Neural Networks (GNNs) in managing complex 3D datasets, existing approaches encounter challenges like high computational costs and scalability issues with extensive scenarios. These limitations restrict the practical deployment of GNNs, notably in resource-constrained environments. To address these issues, this study introduce bPoint\b bVi\bsion bG\bNN (PointViG), an efficient framework for point cloud analysis. PointViG incorporates a lightweight graph convolutional module to efficiently aggregate local features and mitigate over-smoothing. For large-scale point cloud scenes, we propose an adaptive dilated graph convolution technique that searches for sparse neighboring nodes within a dilated neighborhood based on semantic correlation, thereby expanding the receptive field and ensuring computational efficiency. Experiments demonstrate that PointViG achieves performance comparable to state-of-the-art models while balancing performance and complexity. On the ModelNet40 classification task, PointViG achieved 94.3% accuracy with 1.5M parameters. For the S3DIS segmentation task, it achieved an mIoU of 71.7% with 5.3M parameters. These results underscore the potential and efficiency of PointViG in point cloud analysis.

[CV-77] From Category to Scenery: An End-to-End Framework for Multi-Person Human-Object Interaction Recognition in Videos

链接: https://arxiv.org/abs/2407.00917
作者: Tanqiu Qiao,Ruochen Li,Frederick W. B. Li,Hubert P. H. Shum
关键词: Video-based Human-Object Interaction, Video-based Human-Object, recognition explores, behavior and intentions, explores the intricate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICPR 2024

点击查看摘要

Abstract:Video-based Human-Object Interaction (HOI) recognition explores the intricate dynamics between humans and objects, which are essential for a comprehensive understanding of human behavior and intentions. While previous work has made significant strides, effectively integrating geometric and visual features to model dynamic relationships between humans and objects in a graph framework remains a challenge. In this work, we propose a novel end-to-end category to scenery framework, CATS, starting by generating geometric features for various categories through graphs respectively, then fusing them with corresponding visual features. Subsequently, we construct a scenery interactive graph with these enhanced geometric-visual features as nodes to learn the relationships among human and object categories. This methodological advance facilitates a deeper, more structured comprehension of interactions, bridging category-specific insights with broad scenery dynamics. Our method demonstrates state-of-the-art performance on two pivotal HOI benchmarks, including the MPHOI-72 dataset for multi-person HOIs and the single-person HOI CAD-120 dataset.

[CV-78] Deep Image-to-Recipe Translation

链接: https://arxiv.org/abs/2407.00911
作者: Jiangqin Ma,Bilal Mawji,Franz Williams
关键词: profound level, reflecting the intricate, intricate connection, Eat, cherished food memories
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The modern saying, “You Are What You Eat” resonates on a profound level, reflecting the intricate connection between our identities and the food we consume. Our project, Deep Image-to-Recipe Translation, is an intersection of computer vision and natural language generation that aims to bridge the gap between cherished food memories and the art of culinary creation. Our primary objective involves predicting ingredients from a given food image. For this task, we first develop a custom convolutional network and then compare its performance to a model that leverages transfer learning. We pursue an additional goal of generating a comprehensive set of recipe steps from a list of ingredients. We frame this process as a sequence-to-sequence task and develop a recurrent neural network that utilizes pre-trained word embeddings. We address several challenges of deep learning including imbalanced datasets, data cleaning, overfitting, and hyperparameter selection. Our approach emphasizes the importance of metrics such as Intersection over Union (IoU) and F1 score in scenarios where accuracy alone might be misleading. For our recipe prediction model, we employ perplexity, a commonly used and important metric for language models. We find that transfer learning via pre-trained ResNet-50 weights and GloVe embeddings provide an exceptional boost to model performance, especially when considering training resource constraints. Although we have made progress on the image-to-recipe translation, there is an opportunity for future exploration with advancements in model architectures, dataset scalability, and enhanced user interaction.

[CV-79] Heterogeneous Graph-based Framework with Disentangled Representations Learning for Multi-target Cross Domain Recommendation

链接: https://arxiv.org/abs/2407.00909
作者: Xiaopeng Liu,Juan Zhang,Chongqi Ren,Shenghui Xu,Zhaoming Pan,Zhimin Zhang
关键词: data sparsity problem, Cross-Domain Recommendation, recommendation system, CDR, critical solution
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:CDR (Cross-Domain Recommendation), i.e., leveraging information from multiple domains, is a critical solution to data sparsity problem in recommendation system. The majority of previous research either focused on single-target CDR (STCDR) by utilizing data from the source domains to improve the model’s performance on the target domain, or applied dual-target CDR (DTCDR) by integrating data from the source and target domains. In addition, multi-target CDR (MTCDR) is a generalization of DTCDR, which is able to capture the link among different domains. In this paper we present HGDR (Heterogeneous Graph-based Framework with Disentangled Representations Learning), an end-to-end heterogeneous network architecture where graph convolutional layers are applied to model relations among different domains, meanwhile utilizes the idea of disentangling representation for domain-shared and domain-specifc information. First, a shared heterogeneous graph is generated by gathering users and items from several domains without any further side information. Second, we use HGDR to compute disentangled representations for users and items in all domains.Experiments on real-world datasets and online A/B tests prove that our proposed model can transmit information among domains effectively and reach the SOTA performance.

[CV-80] GSO-YOLO: Global Stability Optimization YOLO for Construction Site Detection

链接: https://arxiv.org/abs/2407.00906
作者: Yuming Zhang,Dongzhi Guan,Shouxin Zhang,Junhao Su,Yunzhi Han,Jiabin Liu
关键词: causing economic damage, economic damage due, construction sites, plagued the industry, posing risks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safety issues at construction sites have long plagued the industry, posing risks to worker safety and causing economic damage due to potential hazards. With the advancement of artificial intelligence, particularly in the field of computer vision, the automation of safety monitoring on construction sites has emerged as a solution to this longstanding issue. Despite achieving impressive performance, advanced object detection methods like YOLOv8 still face challenges in handling the complex conditions found at construction sites. To solve these problems, this study presents the Global Stability Optimization YOLO (GSO-YOLO) model to address challenges in complex construction sites. The model integrates the Global Optimization Module (GOM) and Steady Capture Module (SCM) to enhance global contextual information capture and detection stability. The innovative AIoU loss function, which combines CIoU and EIoU, improves detection accuracy and efficiency. Experiments on datasets like SODA, MOCS, and CIS show that GSO-YOLO outperforms existing methods, achieving SOTA performance.

[CV-81] Learning Robust 3D Representation from CLIP via Dual Denoising

链接: https://arxiv.org/abs/2407.00905
作者: Shuqing Luo,Bowen Qu,Wei Gao
关键词: pre-trained vision language, vision language models, under-investigated issue, explore a critical, critical yet under-investigated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we explore a critical yet under-investigated issue: how to learn robust and well-generalized 3D representation from pre-trained vision language models such as CLIP. Previous works have demonstrated that cross-modal distillation can provide rich and useful knowledge for 3D data. However, like most deep learning models, the resultant 3D learning network is still vulnerable to adversarial attacks especially the iterative attack. In this work, we propose Dual Denoising, a novel framework for learning robust and well-generalized 3D representations from CLIP. It combines a denoising-based proxy task with a novel feature denoising network for 3D pre-training. Additionally, we propose utilizing parallel noise inference to enhance the generalization of point cloud features under cross domain settings. Experiments show that our model can effectively improve the representation learning performance and adversarial robustness of the 3D learning network under zero-shot settings without adversarial training. Our code is available at this https URL.

[CV-82] From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

链接: https://arxiv.org/abs/2407.00902
作者: Nan Xu,Fei Wang,Sheng Zhang,Hoifung Poon,Muhao Chen
关键词: Large Language models, Large Language, multiple image-text pairs, similar ICL abilities, capabilities of Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Considering such modality impact, we further utilize modality-driven demonstration strategies to boost ICL performance. We also identify that demonstration selection is closely related to the models’ ability to capture task inductive biases from multimodal ICL. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks even if those tasks are not seen in or even contradict pretraining data.

[CV-83] Dynamically Modulating Visual Place Recognition Sequence Length For Minimum Acceptable Performance Scenarios

链接: https://arxiv.org/abs/2407.00863
作者: Connor Malone,Ankit Vora,Thierry Peynot,Michael Milford
关键词: Mobile robots, GPS become uncertain, critical position estimates, uncertain or unreliable, robots and autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: DOI TBC

点击查看摘要

Abstract:Mobile robots and autonomous vehicles are often required to function in environments where critical position estimates from sensors such as GPS become uncertain or unreliable. Single image visual place recognition (VPR) provides an alternative for localization but often requires techniques such as sequence matching to improve robustness, which incurs additional computation and latency costs. Even then, the sequence length required to localize at an acceptable performance level varies widely; and simply setting overly long fixed sequence lengths creates unnecessary latency, computational overhead, and can even degrade performance. In these scenarios it is often more desirable to meet or exceed a set target performance at minimal expense. In this paper we present an approach which uses a calibration set of data to fit a model that modulates sequence length for VPR as needed to exceed a target localization performance. We make use of a coarse position prior, which could be provided by any other localization system, and capture the variation in appearance across this region. We use the correlation between appearance variation and sequence length to curate VPR features and fit a multilayer perceptron (MLP) for selecting the optimal length. We demonstrate that this method is effective at modulating sequence length to maximize the number of sections in a dataset which meet or exceed a target performance whilst minimizing the median length used. We show applicability across several datasets and reveal key phenomena like generalization capabilities, the benefits of curating features and the utility of non-state-of-the-art feature extractors with nuanced properties.

[CV-84] SAFE: a SAR Feature Extractor based on self-supervised learning and masked Siamese ViTs

链接: https://arxiv.org/abs/2407.00851
作者: Max Muzeau,Joana Frontera-Pons,Chengfang Ren,Jean-Philippe Ovarlez
关键词: Synthetic Aperture Radar, Aperture Radar imagery, Synthetic Aperture, Aperture Radar, earth monitoring
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Due to its all-weather and day-and-night capabilities, Synthetic Aperture Radar imagery is essential for various applications such as disaster management, earth monitoring, change detection and target recognition. However, the scarcity of labeled SAR data limits the performance of most deep learning algorithms. To address this issue, we propose a novel self-supervised learning framework based on masked Siamese Vision Transformers to create a General SAR Feature Extractor coined SAFE. Our method leverages contrastive learning principles to train a model on unlabeled SAR data, extracting robust and generalizable features. SAFE is applicable across multiple SAR acquisition modes and resolutions. We introduce tailored data augmentation techniques specific to SAR imagery, such as sub-aperture decomposition and despeckling. Comprehensive evaluations on various downstream tasks, including few-shot classification, segmentation, visualization, and pattern detection, demonstrate the effectiveness and versatility of the proposed approach. Our network competes with or surpasses other state-of-the-art methods in few-shot classification and segmentation tasks, even without being trained on the sensors used for the evaluation.

[CV-85] DroBoost: An Intelligent Score and Model Boosting Method for Drone Detection

链接: https://arxiv.org/abs/2407.00830
作者: Ogulcan Eryuksel,Kamil Anil Ozfuttu,Fatih Cagatay Akyon,Kadir Sahin,Efe Buyukborekci,Devrim Cavusoglu,Sinan Altinuc
关键词: small visible objects, complex backgrounds, small visible, task where visibility, visibility conditions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Drone detection is a challenging object detection task where visibility conditions and quality of the images may be unfavorable, and detections might become difficult due to complex backgrounds, small visible objects, and hard to distinguish objects. Both provide high confidence for drone detections, and eliminating false detections requires efficient algorithms and approaches. Our previous work, which uses YOLOv5, uses both real and synthetic data and a Kalman-based tracker to track the detections and increase their confidence using temporal information. Our current work improves on the previous approach by combining several improvements. We used a more diverse dataset combining multiple sources and combined with synthetic samples chosen from a large synthetic dataset based on the error analysis of the base model. Also, to obtain more resilient confidence scores for objects, we introduced a classification component that discriminates whether the object is a drone or not. Finally, we developed a more advanced scoring algorithm for object tracking that we use to adjust localization confidence. Furthermore, the proposed technique won 1st Place in the Drone vs. Bird Challenge (Workshop on Small-Drone Surveillance, Detection and Counteraction Techniques at ICIAP 2021).

[CV-86] Image Classification for Snow Detection to Improve Pedestrian Safety

链接: https://arxiv.org/abs/2407.00818
作者: Ricardo de Deijn,Rajeev Bukralia
关键词: visually impaired individuals, winter-related fall injuries, vision approach aimed, reduce winter-related fall, fall injuries
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages 1 figure 1 table Included in MWAIS - 2024 Conference Proceedings Chair: Jacob Young

点击查看摘要

Abstract:This study presents a computer vision approach aimed at detecting snow on sidewalks and pavements to reduce winter-related fall injuries, especially among elderly and visually impaired individuals. Leveraging fine-tuned VGG-19 and ResNet50 convolutional neural networks (CNNs), the research focuses on identifying snow presence in pavement images. The dataset comprises 98 images evenly split between snowy and snow-free conditions, evaluated with a separate test set using the F1 score and accuracy metrics. This work builds upon existing research by employing fine-tuned CNN architectures to accurately detect snow on pavements from smartphone-captured images. The methodology incorporates transfer learning and model ensembling techniques to integrate the best predictions from both the VGG19 and ResNet50 architectures. The study yields accuracy and F1 scores of 81.8% and 81.7%, respectively, showcasing the potential of computer vision in addressing winter-related hazards for vulnerable populations.

[CV-87] A Deep Learning-based Pest Insect Monitoring System for Ultra-low Power Pocket-sized Drones

链接: https://arxiv.org/abs/2407.00815
作者: Luca Crupi,Luca Butera,Alberto Ferrante,Daniele Palossi
关键词: agriculture represent game-changer, represent game-changer technologies, precision agriculture represent, sustainable agribusiness, Smart farming
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Smart farming and precision agriculture represent game-changer technologies for efficient and sustainable agribusiness. Miniaturized palm-sized drones can act as flexible smart sensors inspecting crops, looking for early signs of potential pest outbreaking. However, achieving such an ambitious goal requires hardware-software codesign to develop accurate deep learning (DL) detection models while keeping memory and computational needs under an ultra-tight budget, i.e., a few MB on-chip memory and a few 100s mW power envelope. This work presents a novel vertically integrated solution featuring two ultra-low power System-on-Chips (SoCs), i.e., the dual-core STM32H74 and a multi-core GWT GAP9, running two State-of-the-Art DL models for detecting the Popillia japonica bug. We fine-tune both models for our image-based detection task, quantize them in 8-bit integers, and deploy them on the two SoCs. On the STM32H74, we deploy a FOMO-MobileNetV2 model, achieving a mean average precision (mAP) of 0.66 and running at 16.1 frame/s within 498 mW. While on the GAP9 SoC, we deploy a more complex SSDLite-MobileNetV3, which scores an mAP of 0.79 and peaks at 6.8 frame/s within 33 mW. Compared to a top-notch RetinaNet-ResNet101-FPN full-precision baseline, which requires 14.9x more memory and 300x more operations per inference, our best model drops only 15% in mAP, paving the way toward autonomous palm-sized drones capable of lightweight and precise pest detection.

[CV-88] Controlling Faces Frame generation in StyleGANs latent space operations: Modifying faces to deceive our memory

链接: https://arxiv.org/abs/2407.00803
作者: Agustín Roca,Nicolás Ignacio Britos
关键词: Innocence Project, reducing wrongful convictions, non-profitable organization, Buenos Aires, Laboratorio de Sueño
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Innocence Project is a non-profitable organization that works in reducing wrongful convictions. In collaboration with Laboratorio de Sueño y Memoria from Instituto Tecnológico de Buenos Aires (ITBA), they are studying human memory in the context of face identification. They have a strong hypothesis stating that human memory heavily relies in face’s frame to recognize faces. If this is proved, it could mean that face recognition in police lineups couldn’t be trusted, as they may lead to wrongful convictions. This study uses experiments in order to try to prove this using faces with different properties, such as eyes size, but maintaining its frame as much as possible. In this project, we continue the work from a previous project that provided the basic tool to generate realistic faces using StyleGAN2. We take a deep dive into the internals of this tool to make full use of StyleGAN2 functionalities, while also adding more features, such as modifying certain of its attributes, including mouth-opening or eye-opening. As the usage of this tool heavily relies on maintaining the face-frame, we develop a way to identify the face-frame of each image and a function to compare it to the output of the neural network after applying some operations. We conclude that the face-frame is maintained when modifying eye-opening or mouth opening. When modifying vertical face orientation, gender, age and smile, have a considerable impact on its frame variation. And finally, the horizontal face orientation shows a major impact on the face-frame. This way, the Lab may apply some operations being confident that the face-frame won’t significantly change, making them viable to be used to deceive subjects’ memories. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2407.00803 [cs.CV] (or arXiv:2407.00803v1 [cs.CV] for this version) Submission history From: Agustin Roca [view email] [v1] Sun, 30 Jun 2024 19:10:22 UTC (38,716 KB)

[CV-89] InstantStyle-Plus: Style Transfer with Content-Preserving in Text-to-Image Generation

链接: https://arxiv.org/abs/2407.00788
作者: Haofan Wang,Peng Xing,Renyuan Huang,Hao Ai,Qixun Wang,Xu Bai
关键词: inventive process designed, Style, designed to create, maintains the essence, content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:Style transfer is an inventive process designed to create an image that maintains the essence of the original while embracing the visual style of another. Although diffusion models have demonstrated impressive generative power in personalized subject-driven or style-driven applications, existing state-of-the-art methods still encounter difficulties in achieving a seamless balance between content preservation and style enhancement. For example, amplifying the style’s influence can often undermine the structural integrity of the content. To address these challenges, we deconstruct the style transfer task into three core elements: 1) Style, focusing on the image’s aesthetic characteristics; 2) Spatial Structure, concerning the geometric arrangement and composition of visual elements; and 3) Semantic Content, which captures the conceptual meaning of the image. Guided by these principles, we introduce InstantStyle-Plus, an approach that prioritizes the integrity of the original content while seamlessly integrating the target style. Specifically, our method accomplishes style injection through an efficient, lightweight process, utilizing the cutting-edge InstantStyle framework. To reinforce the content preservation, we initiate the process with an inverted content latent noise and a versatile plug-and-play tile ControlNet for preserving the original image’s intrinsic layout. We also incorporate a global semantic adapter to enhance the semantic content’s fidelity. To safeguard against the dilution of style information, a style extractor is employed as discriminator for providing supplementary style guidance. Codes will be available at this https URL.

[CV-90] Diffusion Models and Representation Learning: A Survey

链接: https://arxiv.org/abs/2407.00783
作者: Michael Fuest,Pingchuan Ma,Ming Gui,Johannes S. Fischer,Vincent Tao Hu,Bjorn Ommer
关键词: attracting significant attention, Diffusion Models, popular generative modeling, generative modeling methods, attracting significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Github Repo: this https URL

点击查看摘要

Abstract:Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models’ essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link: this https URL

[CV-91] Chest-Diffusion: A Light-Weight Text-to-Image Model for Report-to-CXR Generation

链接: https://arxiv.org/abs/2407.00752
作者: Peng Huang,Xue Gao,Lihong Huang,Jing Jiao,Xiaokang Li,Yuanyuan Wang,Yi Guo
关键词: important implications, diverse and controllable, Stable Diffusion, adapt Stable Diffusion, common stable diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-to-image generation has important implications for generation of diverse and controllable images. Several attempts have been made to adapt Stable Diffusion (SD) to the medical domain. However, the large distribution difference between medical reports and natural texts, as well as high computational complexity in common stable diffusion limit the authenticity and feasibility of the generated medical images. To solve above problems, we propose a novel light-weight transformer-based diffusion model learning framework, Chest-Diffusion, for report-to-CXR generation. Chest-Diffusion employs a domain-specific text encoder to obtain accurate and expressive text features to guide image generation, improving the authenticity of the generated images. Meanwhile, we introduce a light-weight transformer architecture as the denoising model, reducing the computational complexity of the diffusion model. Experiments demonstrate that our Chest-Diffusion achieves the lowest FID score 24.456, under the computation budget of 118.918 GFLOPs, which is nearly one-third of the computational complexity of SD.

[CV-92] PolygonGNN: Representation Learning for Polygonal Geometries with Heterogeneous Visibility Graph

链接: https://arxiv.org/abs/2407.00742
作者: Dazhou Yu,Yuntong Hu,Yun Li,Liang Zhao
关键词: building pattern classification, geographic question answering, encompassing tasks, shape coding, building pattern
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Polygon representation learning is essential for diverse applications, encompassing tasks such as shape coding, building pattern classification, and geographic question answering. While recent years have seen considerable advancements in this field, much of the focus has been on single polygons, overlooking the intricate inner- and inter-polygonal relationships inherent in multipolygons. To address this gap, our study introduces a comprehensive framework specifically designed for learning representations of polygonal geometries, particularly multipolygons. Central to our approach is the incorporation of a heterogeneous visibility graph, which seamlessly integrates both inner- and inter-polygonal relationships. To enhance computational efficiency and minimize graph redundancy, we implement a heterogeneous spanning tree sampling method. Additionally, we devise a rotation-translation invariant geometric representation, ensuring broader applicability across diverse scenarios. Finally, we introduce Multipolygon-GNN, a novel model tailored to leverage the spatial and semantic heterogeneity inherent in the visibility graph. Experiments on five real-world and synthetic datasets demonstrate its ability to capture informative representations for polygonal geometries.

[CV-93] Engineering an Efficient Object Tracker for Non-Linear Motion

链接: https://arxiv.org/abs/2407.00738
作者: Momir Adžemović,Predrag Tadić,Andrija Petrović,Mladen Nikolić
关键词: maintaining unique identifiers, video frames, detect and track, scene while maintaining, maintaining unique
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 3 figures, 20 tables

点击查看摘要

Abstract:The goal of multi-object tracking is to detect and track all objects in a scene while maintaining unique identifiers for each, by associating their bounding boxes across video frames. This association relies on matching motion and appearance patterns of detected objects. This task is especially hard in case of scenarios involving dynamic and non-linear motion patterns. In this paper, we introduce DeepMoveSORT, a novel, carefully engineered multi-object tracker designed specifically for such scenarios. In addition to standard methods of appearance-based association, we improve motion-based association by employing deep learnable filters (instead of the most commonly used Kalman filter) and a rich set of newly proposed heuristics. Our improvements to motion-based association methods are severalfold. First, we propose a new transformer-based filter architecture, TransFilter, which uses an object’s motion history for both motion prediction and noise filtering. We further enhance the filter’s performance by careful handling of its motion history and accounting for camera motion. Second, we propose a set of heuristics that exploit cues from the position, shape, and confidence of detected bounding boxes to improve association performance. Our experimental evaluation demonstrates that DeepMoveSORT outperforms existing trackers in scenarios featuring non-linear motion, surpassing state-of-the-art results on three such datasets. We also perform a thorough ablation study to evaluate the contributions of different tracker components which we proposed. Based on our study, we conclude that using a learnable filter instead of the Kalman filter, along with appearance-based association is key to achieving strong general tracking performance.

[CV-94] LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

链接: https://arxiv.org/abs/2407.00737
作者: Mushui Liu,Yuhang Ma,Xinfeng Zhang,Yang Zhen,Zeng Zhao,Zhipeng Hu,Bai Liu,Changjie Fan
关键词: exhibited substantial success, Large Language Models, Diffusion Models, exhibited substantial, substantial success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 13 figures

点击查看摘要

Abstract:Diffusion Models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts that involve multiple objects, attribute binding, and long descriptions. This paper proposes a framework called \textbfLLM4GEN, which enhances the semantic understanding ability of text-to-image diffusion models by leveraging the semantic representation of Large Language Models (LLMs). Through a specially designed Cross-Adapter Module (CAM) that combines the original text features of text-to-image models with LLM features, LLM4GEN can be easily incorporated into various diffusion models as a plug-and-play component and enhances text-to-image generation. Additionally, to facilitate the complex and dense prompts semantic understanding, we develop a LAION-refined dataset, consisting of 1 million (M) text-image pairs with improved image descriptions. We also introduce DensePrompts which contains 7,000 dense prompts to provide a comprehensive evaluation for the text-to-image generation task. With just 10% of the training data required by recent ELLA, LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 7.69% and 9.60% in color on T2I-CompBench, respectively. The extensive experiments on DensePrompts also demonstrate that LLM4GEN surpasses existing state-of-the-art models in terms of sample quality, image-text alignment, and human evaluation. The project website is at: \textcolormagenta\urlthis https URL

[CV-95] CaFNet: A Confidence-Driven Framework for Radar Camera Depth Estimation

链接: https://arxiv.org/abs/2407.00697
作者: Huawei Sun,Hao Feng,Julius Ott,Lorenzo Servadei,Robert Wille
关键词: Depth estimation, scenes accurately, driving for interpreting, critical in autonomous, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Accepted by IROS 2024

点击查看摘要

Abstract:Depth estimation is critical in autonomous driving for interpreting 3D scenes accurately. Recently, radar-camera depth estimation has become of sufficient interest due to the robustness and low-cost properties of radar. Thus, this paper introduces a two-stage, end-to-end trainable Confidence-aware Fusion Net (CaFNet) for dense depth estimation, combining RGB imagery with sparse and noisy radar point cloud data. The first stage addresses radar-specific challenges, such as ambiguous elevation and noisy measurements, by predicting a radar confidence map and a preliminary coarse depth map. A novel approach is presented for generating the ground truth for the confidence map, which involves associating each radar point with its corresponding object to identify potential projection surfaces. These maps, together with the initial radar input, are processed by a second encoder. For the final depth estimation, we innovate a confidence-aware gated fusion mechanism to integrate radar and image features effectively, thereby enhancing the reliability of the depth map by filtering out radar noise. Our methodology, evaluated on the nuScenes dataset, demonstrates superior performance, improving upon the current leading model by 3.2% in Mean Absolute Error (MAE) and 2.7% in Root Mean Square Error (RMSE).

[CV-96] Multi-Task Learning for Affect Analysis

链接: https://arxiv.org/abs/2407.00679
作者: Fazeel Asim
关键词: Undergraduate Final Year, Final Year dissertation, Undergraduate Final, Final Year, Dimitrios Kollias
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This Project was my Undergraduate Final Year dissertation, supervised by Dimitrios Kollias This research delves into the realm of affective computing for image analysis, aiming to enhance the efficiency and effectiveness of multi-task learning in the context of emotion recognition. This project investigates two primary approaches: uni-task solutions and a multi-task approach to the same problems. Each approach undergoes testing, exploring various formulations, variations, and initialization strategies to come up with the best configuration. The project utilizes existing a neural network architecture, adapting it for multi-task learning by modifying output layers and loss functions. Tasks encompass 7 basic emotion recognition, action unit detection, and valence-arousal estimation. Comparative analyses involve uni-task models for each individual task, facilitating the assessment of multi-task model performance. Variations within each approach, including, loss functions, and hyperparameter tuning, undergo evaluation. The impact of different initialization strategies and pre-training techniques on model convergence and accuracy is explored. The research aspires to contribute to the burgeoning field of affective computing, with applications spanning healthcare, marketing, and human-computer interaction. By systematically exploring multi-task learning formulations, this research aims to contribute to the development of more accurate and efficient models for recognizing and understanding emotions in images. The findings hold promise for applications in diverse industries, paving the way for advancements in affective computing

[CV-97] Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

链接: https://arxiv.org/abs/2407.00676
作者: Yuchuan Tian,Jianhong Han,Hanting Chen,Yuanyuan Xi,Guoyang Zhang,Jie Hu,Chao Xu,Yunhe Wang
关键词: low-level vision, low-level vision tasks, low-level vision models, handful of low-level, intensive computation costs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT – an All-in-One Image Processing Transformer that could effectively address manifold image restoration tasks with large inter-task gaps, such as denoising, deblurring, deraining, dehazing, and desnowing. Rather than popular feature adaptation methods, we propose weight modulation that adapts weights to specific tasks. Firstly, we figure out task-sensitive weights via a toy experiment and introduce task-specific biases on top of them. Secondly, we conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases. Thirdly, we propose synchronous training that updates the task-general backbone model and the task-specific biases simultaneously. In this way, the model is instructed to learn general and task-specific knowledge. Via our simple yet effective method that instructs the IPT to be task experts, Instruct-IPT could better cooperate between tasks with distinct characteristics at humble costs. Further, we propose to maneuver Instruct-IPT with text instructions for better user interfaces. We have conducted experiments on Instruct-IPT to demonstrate the effectiveness of our method on manifold tasks, and we have effectively extended our method to diffusion denoisers as well. The code is available at this https URL.

[CV-98] Resolving Variable Respiratory Motion From Unsorted 4D Computed Tomography

链接: https://arxiv.org/abs/2407.00665
作者: Yuliang Huang,Bjoern Eiben,Kris Thielemans,Jamie R. McClelland
关键词: Computed Tomography, radiotherapy treatment planning, PET and ventilation, downstream clinical applications, clinical applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI 2024

点击查看摘要

Abstract:4D Computed Tomography (4DCT) is widely used for many clinical applications such as radiotherapy treatment planning, PET and ventilation imaging. However, common 4DCT methods reconstruct multiple breath cycles into a single, arbitrary breath cycle which can lead to various artefacts, impacting the downstream clinical applications. Surrogate driven motion models can estimate continuous variable motion across multiple cycles based on CT segments `unsorted’ from 4DCT, but it requires respiration surrogate signals with strong correlation to the internal motion, which are not always available. The method proposed in this study eliminates such dependency by adapting the hyper-gradient method to the optimization of surrogate signals as hyper-parameters, while achieving better or comparable performance, as demonstrated on digital phantom simulations and real patient data. Our method produces a high-quality motion-compensated image together with estimates of the motion, including breath-to-breath variability, throughout the image acquisition. Our method has the potential to improve downstream clinical applications, and also enables retrospective analysis of open access 4DCT dataset where no respiration signals are stored. Code is avaibale at this https URL.

[CV-99] SCMIL: Sparse Context-aware Multiple Instance Learning for Predicting Cancer Survival Probability Distribution in Whole Slide Images

链接: https://arxiv.org/abs/2407.00664
作者: Zekang Yang,Hong Liu,Xiangdong Wang
关键词: Slide Image, Cancer survival prediction, Cancer survival, involves analyzing, tumor microenvironment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: MICCAI2024

点击查看摘要

Abstract:Cancer survival prediction is a challenging task that involves analyzing of the tumor microenvironment within Whole Slide Image (WSI). Previous methods cannot effectively capture the intricate interaction features among instances within the local area of WSI. Moreover, existing methods for cancer survival prediction based on WSI often fail to provide better clinically meaningful predictions. To overcome these challenges, we propose a Sparse Context-aware Multiple Instance Learning (SCMIL) framework for predicting cancer survival probability distributions. SCMIL innovatively segments patches into various clusters based on their morphological features and spatial location information, subsequently leveraging sparse self-attention to discern the relationships between these patches with a context-aware perspective. Considering many patches are irrelevant to the task, we introduce a learnable patch filtering module called SoftFilter, which ensures that only interactions between task-relevant patches are considered. To enhance the clinical relevance of our prediction, we propose a register-based mixture density network to forecast the survival probability distribution for individual patients. We evaluate SCMIL on two public WSI datasets from the The Cancer Genome Atlas (TCGA) specifically focusing on lung adenocarcinom (LUAD) and kidney renal clear cell carcinoma (KIRC). Our experimental results indicate that SCMIL outperforms current state-of-the-art methods for survival prediction, offering more clinically meaningful and interpretable outcomes. Our code is accessible at this https URL.

[CV-100] arsier: Recipes for Training and Evaluating Large Video Description Models

链接: https://arxiv.org/abs/2407.00634
作者: Jiawei Wang,Liping Yuan,Yuchen Zhang
关键词: Generating fine-grained video, Generating fine-grained, fundamental challenge, video, Generating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a +51.4% advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a +12.3% advantage against GPT-4V and a -6.7% disadvantage against Gemini 1.5 Pro. Besides video description, Tarsier proves to be a versatile generalist model, achieving new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. Our second contribution is the introduction of a new benchmark for evaluating video description models, consisting of a new challenging dataset featuring videos from diverse sources and varying complexity, along with an automatic method specifically designed to assess the quality of fine-grained video descriptions. We make our models and evaluation benchmark publicly available at \urlthis https URL.

[CV-101] DEAR: Disentangled Environment and Agent Representations for Reinforcement Learning without Reconstruction

链接: https://arxiv.org/abs/2407.00633
作者: Ameya Pore,Riccardo Muradore,Diego Dall’Alba
关键词: Reinforcement Learning, amount of data, complex and unstructured, large amount, scene is complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 8 figures, 2 tables. Accepted at 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:Reinforcement Learning (RL) algorithms can learn robotic control tasks from visual observations, but they often require a large amount of data, especially when the visual scene is complex and unstructured. In this paper, we explore how the agent’s knowledge of its shape can improve the sample efficiency of visual RL methods. We propose a novel method, Disentangled Environment and Agent Representations (DEAR), that uses the segmentation mask of the agent as supervision to learn disentangled representations of the environment and the agent through feature separation constraints. Unlike previous approaches, DEAR does not require reconstruction of visual observations. These representations are then used as an auxiliary loss to the RL objective, encouraging the agent to focus on the relevant features of the environment. We evaluate DEAR on two challenging benchmarks: Distracting DeepMind control suite and Franka Kitchen manipulation tasks. Our findings demonstrate that DEAR surpasses state-of-the-art methods in sample efficiency, achieving comparable or superior performance with reduced parameters. Our results indicate that integrating agent knowledge into visual RL methods has the potential to enhance their learning efficiency and robustness.

[CV-102] CAMON: Cooperative Agents for Multi-Object Navigation with LLM-based Conversations

链接: https://arxiv.org/abs/2407.00632
作者: Pengying Wu,Yao Mu,Kangjie Zhou,Ji Ma,Junting Chen,Chang Liu
关键词: Visual navigation tasks, Visual navigation, household service robots, Visual, service robots
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
*备注: Accepted to the RSS 2024 Workshop: GROUND

点击查看摘要

Abstract:Visual navigation tasks are critical for household service robots. As these tasks become increasingly complex, effective communication and collaboration among multiple robots become imperative to ensure successful completion. In recent years, large language models (LLMs) have exhibited remarkable comprehension and planning abilities in the context of embodied agents. However, their application in household scenarios, specifically in the use of multiple agents collaborating to complete complex navigation tasks through communication, remains unexplored. Therefore, this paper proposes a framework for decentralized multi-agent navigation, leveraging LLM-enabled communication and collaboration. By designing the communication-triggered dynamic leadership organization structure, we achieve faster team consensus with fewer communication instances, leading to better navigation effectiveness and collaborative exploration efficiency. With the proposed novel communication scheme, our framework promises to be conflict-free and robust in multi-object navigation tasks, even when there is a surge in team size.

[CV-103] Consistency Purification: Effective and Efficient Diffusion Purification towards Certified Robustness

链接: https://arxiv.org/abs/2407.00623
作者: Yiquan Li,Zhongzhu Chen,Kun Jin,Jiongxiao Wang,Bo Li,Chaowei Xiao
关键词: purifying noised images, Denoising Diffusion Probabilistic, Diffusion Probabilistic Model, purified images, Stochastic Diffusion Model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Purification, purifying noised images with diffusion models, has been widely used for enhancing certified robustness via randomized smoothing. However, existing frameworks often grapple with the balance between efficiency and effectiveness. While the Denoising Diffusion Probabilistic Model (DDPM) offers an efficient single-step purification, it falls short in ensuring purified images reside on the data manifold. Conversely, the Stochastic Diffusion Model effectively places purified images on the data manifold but demands solving cumbersome stochastic differential equations, while its derivative, the Probability Flow Ordinary Differential Equation (PF-ODE), though solving simpler ordinary differential equations, still requires multiple computational steps. In this work, we demonstrated that an ideal purification pipeline should generate the purified images on the data manifold that are as much semantically aligned to the original images for effectiveness in one step for efficiency. Therefore, we introduced Consistency Purification, an efficiency-effectiveness Pareto superior purifier compared to the previous work. Consistency Purification employs the consistency model, a one-step generative model distilled from PF-ODE, thus can generate on-manifold purified images with a single network evaluation. However, the consistency model is designed not for purification thus it does not inherently ensure semantic alignment between purified and original images. To resolve this issue, we further refine it through Consistency Fine-tuning with LPIPS loss, which enables more aligned semantic meaning while keeping the purified images on data manifold. Our comprehensive experiments demonstrate that our Consistency Purification framework achieves state-of the-art certified robustness and efficiency compared to baseline methods.

[CV-104] Learning Granularity-Aware Affordances from Human-Object Interaction for Tool-Based Functional Grasping in Dexterous Robotics

链接: https://arxiv.org/abs/2407.00614
作者: Fan Yang,Wenrui Chen,Kailun Yang,Haoran Lin,DongSheng Luo,Conghui Tang,Zhiyong Li,Yaonan Wang
关键词: touching specific areas, specific areas precisely, Affordance, initial step, step is teaching
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: The source code and the established dataset will be made publicly available at this https URL

点击查看摘要

Abstract:To enable robots to use tools, the initial step is teaching robots to employ dexterous gestures for touching specific areas precisely where tasks are performed. Affordance features of objects serve as a bridge in the functional interaction between agents and objects. However, leveraging these affordance cues to help robots achieve functional tool grasping remains unresolved. To address this, we propose a granularity-aware affordance feature extraction method for locating functional affordance areas and predicting dexterous coarse gestures. We study the intrinsic mechanisms of human tool use. On one hand, we use fine-grained affordance features of object-functional finger contact areas to locate functional affordance regions. On the other hand, we use highly activated coarse-grained affordance features in hand-object interaction regions to predict grasp gestures. Additionally, we introduce a model-based post-processing module that includes functional finger coordinate localization, finger-to-end coordinate transformation, and force feedback-based coarse-to-fine grasping. This forms a complete dexterous robotic functional grasping framework GAAF-Dex, which learns Granularity-Aware Affordances from human-object interaction for tool-based Functional grasping in Dexterous Robotics. Unlike fully-supervised methods that require extensive data annotation, we employ a weakly supervised approach to extract relevant cues from exocentric (Exo) images of hand-object interactions to supervise feature extraction in egocentric (Ego) images. We have constructed a small-scale dataset, FAH, which includes near 6K images of functional hand-object interaction Exo- and Ego images of 18 commonly used tools performing 6 tasks. Extensive experiments on the dataset demonstrate our method outperforms state-of-the-art methods. The code will be made publicly available at this https URL.

[CV-105] ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding

链接: https://arxiv.org/abs/2407.00609
作者: Quang P.M. Pham,Khoi T.N. Nguyen,Lan C. Ngo,Truong Do,Truong Son Hy
关键词: understanding tasks due, explicit nature, scene understanding tasks, tasks due, compact and explicit
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, multi-view 3D data. This work, to the best of our knowledge, is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding. Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence. ESGNN demands low computational resources and is easy to implement from available frameworks, paving the way for real-time applications such as robotics and computer vision.

[CV-106] Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

链接: https://arxiv.org/abs/2407.00608
作者: Shian Du,Xiaotian Cheng,Qi Qian,Henglu Wei,Yi Xu,Xiangyang Ji
关键词: attracted unprecedented attention, generating highly-personalized images, input concept dataset, textual prompt, input textual prompt
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Personalized text-to-image generation has attracted unprecedented attention in the recent few years due to its unique capability of generating highly-personalized images via using the input concept dataset and novel textual prompt. However, previous methods solely focus on the performance of the reconstruction task, degrading its ability to combine with different textual prompt. Besides, optimizing in the high-dimensional embedding space usually leads to unnecessary time-consuming training process and slow convergence. To address these issues, we propose an efficient method to explore the target embedding in a textual subspace, drawing inspiration from the self-expressiveness property. Additionally, we propose an efficient selection strategy for determining the basis vectors of the textual subspace. The experimental evaluations demonstrate that the learned embedding can not only faithfully reconstruct input image, but also significantly improves its alignment with novel input textual prompt. Furthermore, we observe that optimizing in the textual subspace leads to an significant improvement of the robustness to the initial word, relaxing the constraint that requires users to input the most relevant initial word. Our method opens the door to more efficient representation learning for personalized text-to-image generation.

[CV-107] Hierarchical Memory for Long Video QA

链接: https://arxiv.org/abs/2407.00603
作者: Yiqin Wang,Haoji Zhang,Yansong Tang,Yong Liu,Jiashi Feng,Jifeng Dai,Xiaojie Jin
关键词: Long Video VQA, LOVEU Challenge, Video VQA, paper describes, describes our champion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper describes our champion solution to the LOVEU Challenge @ CVPR’24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream, that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage this https URL

[CV-108] GenderBias-emphVL: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing

链接: https://arxiv.org/abs/2407.00600
作者: Yisong Xiao,Aishan Liu,QianJia Cheng,Zhenfei Yin,Siyuan Liang,Jiapeng Li,Jing Shao,Xianglong Liu,Dacheng Tao
关键词: Large Vision-Language Models, Large Vision-Language, exhibit significant gender, widely adopted, exhibit significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have been widely adopted in various applications; however, they exhibit significant gender biases. Existing benchmarks primarily evaluate gender bias at the demographic group level, neglecting individual fairness, which emphasizes equal treatment of similar individuals. This research gap limits the detection of discriminatory behaviors, as individual fairness offers a more granular examination of biases that group fairness may overlook. For the first time, this paper introduces the GenderBias-\emphVL benchmark to evaluate occupation-related gender bias in LVLMs using counterfactual visual questions under individual fairness criteria. To construct this benchmark, we first utilize text-to-image diffusion models to generate occupation images and their gender counterfactuals. Subsequently, we generate corresponding textual occupation options by identifying stereotyped occupation pairs with high semantic similarity but opposite gender proportions in real-world statistics. This method enables the creation of large-scale visual question counterfactuals to expose biases in LVLMs, applicable in both multimodal and unimodal contexts through modifying gender attributes in specific modalities. Overall, our GenderBias-\emphVL benchmark comprises 34,581 visual question counterfactual pairs, covering 177 occupations. Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs (\eg, LLaVA) and state-of-the-art commercial APIs, including GPT-4o and Gemini-Pro. Our findings reveal widespread gender biases in existing LVLMs. Our benchmark offers: (1) a comprehensive dataset for occupation-related gender bias evaluation; (2) an up-to-date leaderboard on LVLM biases; and (3) a nuanced understanding of the biases presented by these models. \footnoteThe dataset and code are available at the \hrefthis https URLwebsite.

[CV-109] Unveiling Glitches: A Deep Dive into Image Encoding Bugs within CLIP

链接: https://arxiv.org/abs/2407.00592
作者: Ayush Ranjan,Daniel Wen,Karthik Bhat
关键词: responsible application, CLIP, Discrepancy Analysis Framework, Transformative Caption Analysis, CLIP image comprehension
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding the limitations and weaknesses of state-of-the-art models in artificial intelligence is crucial for their improvement and responsible application. In this research, we focus on CLIP, a model renowned for its integration of vision and language processing. Our objective is to uncover recurring problems and blind spots in CLIP’s image comprehension. By delving into both the commonalities and disparities between CLIP and human image understanding, we augment our comprehension of these models’ capabilities. Through our analysis, we reveal significant discrepancies in CLIP’s interpretation of images compared to human perception, shedding light on areas requiring improvement. Our methodologies, the Discrepancy Analysis Framework (DAF) and the Transformative Caption Analysis for CLIP (TCAC), enable a comprehensive evaluation of CLIP’s performance. We identify 14 systemic faults, including Action vs. Stillness confusion, Failure to identify the direction of movement or positioning of objects in the image, Hallucination of Water-like Features, Misattribution of Geographic Context, among others. By addressing these limitations, we lay the groundwork for the development of more accurate and nuanced image embedding models, contributing to advancements in artificial intelligence.

[CV-110] OfCaM: Global Human Mesh Recovery via Optimization-free Camera Motion Scale Calibration

链接: https://arxiv.org/abs/2407.00574
作者: Fengyuan Yang,Kerui Gu,Ha Linh Nguyen,Angela Yao
关键词: motion, camera motion, scale factor, unknown scale factor, Simultaneous Localization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Accurate camera motion estimation is critical to estimate human motion in the global space. A standard and widely used method for estimating camera motion is Simultaneous Localization and Mapping (SLAM). However, SLAM only provides a trajectory up to an unknown scale factor. Different from previous attempts that optimize the scale factor, this paper presents Optimization-free Camera Motion Scale Calibration (OfCaM), a novel framework that utilizes prior knowledge from human mesh recovery (HMR) models to directly calibrate the unknown scale factor. Specifically, OfCaM leverages the absolute depth of human-background contact joints from HMR predictions as a calibration reference, enabling the precise recovery of SLAM camera trajectory scale in global space. With this correctly scaled camera motion and HMR’s local motion predictions, we achieve more accurate global human motion estimation. To compensate for scenes where we detect SLAM failure, we adopt a local-to-global motion mapping to fuse with previously derived motion to enhance robustness. Simple yet powerful, our method sets a new standard for global human mesh estimation tasks, reducing global human motion error by 60% over the prior SOTA while also demanding orders of magnitude less inference time compared with optimization-based methods.

[CV-111] Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

链接: https://arxiv.org/abs/2407.00569
作者: Weihong Zhong,Xiaocheng Feng,Liang Zhao,Qiming Li,Lei Huang,Yuxuan Gu,Weitao Ma,Yuan Xu,Bing Qin
关键词: Large Vision-Language Models, Large Vision-Language, understanding visual information, human languages, generated hallucinations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted to ACL 2024 Main Conference. 21 pages, 20 figures

点击查看摘要

Abstract:Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs’ subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called MMHalSnowball to evaluate LVLMs’ behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least 31% , indicating that LVLMs are prone to accept the generated hallucinations and make false claims that they would not have supported without distractions. We term this phenomenon Multimodal Hallucination Snowballing. To mitigate this, we further propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input, providing models with direct access to the visual information. Experiments show that our method can mitigate more than 24% of the snowballed multimodal hallucination while maintaining capabilities.

[CV-112] Explaining Chest X-ray Pathology Models using Textual Concepts

链接: https://arxiv.org/abs/2407.00557
作者: Vijay Sadashivaiah,Mannudeep K. Kalra,Pingkun Yan,James A. Hendler
关键词: Deep learning models, opaque nature poses, nature poses challenges, Deep learning, revolutionized medical imaging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning models have revolutionized medical imaging and diagnostics, yet their opaque nature poses challenges for clinical adoption and trust. Amongst approaches to improve model interpretability, concept-based explanations aim to provide concise and human understandable explanations of any arbitrary classifier. However, such methods usually require a large amount of manually collected data with concept annotation, which is often scarce in the medical domain. In this paper, we propose Conceptual Counterfactual Explanations for Chest X-ray (CoCoX) that leverage existing vision-language models (VLM) joint embedding space to explain black-box classifier outcomes without the need for annotated datasets. Specifically, we utilize textual concepts derived from chest radiography reports and a pre-trained chest radiography-based VLM to explain three common cardiothoracic pathologies. We demonstrate that the explanations generated by our method are semantically meaningful and faithful to underlying pathologies.

[CV-113] Privacy-Preserving and Trustworthy Deep Learning for Medical Imaging

链接: https://arxiv.org/abs/2407.00538
作者: Kiarash Sedghighadikolaei,Attila A Yavuz
关键词: Deep Radiomics, Deep Radiomics pipeline, impacted healthcare systems, Machine Learning, notably impacted healthcare
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The shift towards efficient and automated data analysis through Machine Learning (ML) has notably impacted healthcare systems, particularly Radiomics. Radiomics leverages ML to analyze medical images accurately and efficiently for precision medicine. Current methods rely on Deep Learning (DL) to improve performance and accuracy (Deep Radiomics). Given the sensitivity of medical images, ensuring privacy throughout the Deep Radiomics pipeline-from data generation and collection to model training and inference-is essential, especially when outsourced. Thus, Privacy-Enhancing Technologies (PETs) are crucial tools for Deep Radiomics. Previous studies and systematization efforts have either broadly overviewed PETs and their applications or mainly focused on subsets of PETs for ML algorithms. In Deep Radiomics, where efficiency, accuracy, and privacy are crucial, many PETs, while theoretically applicable, may not be practical without specialized optimizations or hybrid designs. Additionally, not all DL models are suitable for Radiomics. Consequently, there is a need for specialized studies that investigate and systematize the effective and practical integration of PETs into the Deep Radiomics pipeline. This work addresses this research gap by (1) classifying existing PETs, presenting practical hybrid PETS constructions, and a taxonomy illustrating their potential integration with the Deep Radiomics pipeline, with comparative analyses detailing assumptions, architectural suitability, and security, (2) Offering technical insights, describing potential challenges and means of combining PETs into the Deep Radiomics pipeline, including integration strategies, subtilities, and potential challenges, (3) Proposing potential research directions, identifying challenges, and suggesting solutions to enhance the PETs in Deep Radiomics.

[CV-114] AI-powered multimodal modeling of personalized hemodynamics in aortic stenosis

链接: https://arxiv.org/abs/2407.00535
作者: Caglar Ozturk,Daniel H. Pak,Luca Rosalia,Debkalpa Goswami,Mary E. Robakowski,Raymond McKay,Christopher T. Nguyen,James S. Duncan,Ellen T. Roche
关键词: common valvular heart, valvular heart disease, Aortic stenosis, developed countries, common valvular
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
*备注: CO and DHP contributed equally to this work. JSD and ETR are corresponding authors

点击查看摘要

Abstract:Aortic stenosis (AS) is the most common valvular heart disease in developed countries. High-fidelity preclinical models can improve AS management by enabling therapeutic innovation, early diagnosis, and tailored treatment planning. However, their use is currently limited by complex workflows necessitating lengthy expert-driven manual operations. Here, we propose an AI-powered computational framework for accelerated and democratized patient-specific modeling of AS hemodynamics from computed tomography. First, we demonstrate that our automated meshing algorithms can generate task-ready geometries for both computational and benchtop simulations with higher accuracy and 100 times faster than existing approaches. Then, we show that our approach can be integrated with fluid-structure interaction and soft robotics models to accurately recapitulate a broad spectrum of clinical hemodynamic measurements of diverse AS patients. The efficiency and reliability of these algorithms make them an ideal complementary tool for personalized high-fidelity modeling of AS biomechanics, hemodynamics, and treatment planning.

[CV-115] A Medical Low-Back Pain Physical Rehabilitation Dataset for Human Body Movement Analysis

链接: https://arxiv.org/abs/2407.00521
作者: Sao Mai Nguyen,Maxime Devanne,Olivier Remy-Neris,Mathieu Lempereur,André Thepaut
关键词: showing encouraging results, non-medical applications, limited use contexts, automatic monitoring, monitoring and coaching
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While automatic monitoring and coaching of exercises are showing encouraging results in non-medical applications, they still have limitations such as errors and limited use contexts. To allow the development and assessment of physical rehabilitation by an intelligent tutoring system, we identify in this article four challenges to address and propose a medical dataset of clinical patients carrying out low back-pain rehabilitation exercises. The dataset includes 3D Kinect skeleton positions and orientations, RGB videos, 2D skeleton data, and medical annotations to assess the correctness, and error classification and localisation of body part and timespan. Along this dataset, we perform a complete research path, from data collection to processing, and finally a small benchmark. We evaluated on the dataset two baseline movement recognition algorithms, pertaining to two different approaches: the probabilistic approach with a Gaussian Mixture Model (GMM), and the deep learning approach with a Long-Short Term Memory (LSTM). This dataset is valuable because it includes rehabilitation relevant motions in a clinical setting with patients in their rehabilitation program, using a cost-effective, portable, and convenient sensor, and because it shows the potential for improvement on these challenges. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.5.4; I.4.8 Cite as: arXiv:2407.00521 [cs.LG] (or arXiv:2407.00521v1 [cs.LG] for this version) Journalreference: IJCNN 2024

[CV-116] oward a Diffusion-Based Generalist for Dense Vision Tasks

链接: https://arxiv.org/abs/2407.00503
作者: Yue Fan,Yongqin Xian,Xiaohua Zhai,Alexander Kolesnikov,Muhammad Ferjad Naeem,Bernt Schiele,Federico Tombari
关键词: Building generalized models, Building generalized, intriguing direction, solve many computer, Building
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at CVPR 2024 as a workshop paper

点击查看摘要

Abstract:Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.

[CV-117] Intrinsic PAPR for Point-level 3D Scene Albedo and Shading Editing

链接: https://arxiv.org/abs/2407.00500
作者: Alireza Moazeni,Shichong Peng,Ke Li
关键词: multi-view RGB images, RGB images, multi-view RGB, Intrinsic PAPR, neural rendering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in neural rendering have excelled at novel view synthesis from multi-view RGB images. However, they often lack the capability to edit the shading or colour of the scene at a detailed point-level, while ensuring consistency across different viewpoints. In this work, we address the challenge of point-level 3D scene albedo and shading editing from multi-view RGB images, focusing on detailed editing at the point-level rather than at a part or global level. While prior works based on volumetric representation such as NeRF struggle with achieving 3D consistent editing at the point level, recent advancements in point-based neural rendering show promise in overcoming this challenge. We introduce ``Intrinsic PAPR’', a novel method based on the recent point-based neural rendering technique Proximity Attention Point Rendering (PAPR). Unlike other point-based methods that model the intrinsic decomposition of the scene, our approach does not rely on complicated shading models or simplistic priors that may not universally apply. Instead, we directly model scene decomposition into albedo and shading components, leading to better estimation accuracy. Comparative evaluations against the latest point-based inverse rendering methods demonstrate that Intrinsic PAPR achieves higher-quality novel view rendering and superior point-level albedo and shading editing.

[CV-118] Quantifying Spuriousness of Biased Datasets Using Partial Information Decomposition

链接: https://arxiv.org/abs/2407.00482
作者: Barproda Halder,Faisal Hamman,Pasan Dissanayake,Qiuyi Zhang,Ilia Sucholutsky,Sanghamitra Dutta
关键词: Spurious patterns refer, Partial Information Decomposition, unique information, causally related, patterns refer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Information Theory (cs.IT)
*备注: Accepted at ICML 2024 Workshop on Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

点击查看摘要

Abstract:Spurious patterns refer to a mathematical association between two or more variables in a dataset that are not causally related. However, this notion of spuriousness, which is usually introduced due to sampling biases in the dataset, has classically lacked a formal definition. To address this gap, this work presents the first information-theoretic formalization of spuriousness in a dataset (given a split of spurious and core features) using a mathematical framework called Partial Information Decomposition (PID). Specifically, we disentangle the joint information content that the spurious and core features share about another target variable (e.g., the prediction label) into distinct components, namely unique, redundant, and synergistic information. We propose the use of unique information, with roots in Blackwell Sufficiency, as a novel metric to formally quantify dataset spuriousness and derive its desirable properties. We empirically demonstrate how higher unique information in the spurious features in a dataset could lead a model into choosing the spurious features over the core features for inference, often having low worst-group-accuracy. We also propose a novel autoencoder-based estimator for computing unique information that is able to handle high-dimensional image data. Finally, we also show how this unique information in the spurious feature is reduced across several dataset-based spurious-pattern-mitigation techniques such as data reweighting and varying levels of background mixing, demonstrating a novel tradeoff between unique information (spuriousness) and worst-group-accuracy.

[CV-119] Development of an interactive GUI using MATLAB for the detection of type and stage of Breast Tumor

链接: https://arxiv.org/abs/2407.00480
作者: Poulmi Banerjee,Satadal Saha
关键词: Breast lumps, Breast, Breast cancer, common types, breast lump image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Breast cancer is described as one of the most common types of cancer which has been diagnosed mainly in women. When compared in the ratio of male to female, it has been duly found that the prone of having breast cancer is more in females than males. Breast lumps are classified mainly into two groups namely: cancerous and non-cancerous. When we say that the lump in the breast is cancerous, it means that it can spread via lobules, ducts, areola, stroma to various organs of the body. On the other hand, non-cancerous breast lumps are less harmful but it should be monitored under proper diagnosis to avoid it being transformed to cancerous lump. To diagnose these breast lumps the method of mammogram, ultrasonic images and MRI images are undertaken. Also, for better diagnosis sometimes doctors recommend for biopsy and any unforeseen anomalies occurring there may give rise to inaccurate test report. To avoid these discrepancies, processing the mammogram images is considered to be one of the most reliable methods. In the proposed method MATLAB GUI is developed and some sample images of breast lumps are placed accordingly in the respective axes. With the help of sliders the actual breast lump image is compared with the already stored breast lump sample images and then accordingly the history of the breast lumps is generated in real time in the form of test report.

[CV-120] MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

链接: https://arxiv.org/abs/2407.00468
作者: Jinsheng Huang,Liang Chen,Taian Guo,Fu Zeng,Yusheng Zhao,Bohan Wu,Ye Yuan,Haozhe Zhao,Zhihui Guo,Yichi Zhang,Jingyang Yuan,Wei Ju,Luchen Liu,Tianyu Liu,Baobao Chang,Ming Zhang
关键词: Large Multimodal Models, exhibit impressive cross-modal, impressive cross-modal understanding, Multimodal Models, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 21 pages, code released at this https URL , Homepage at this https URL

点击查看摘要

Abstract:Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises 2,138 question triplets, totaling 6,414 distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by 31.73% , compared to an average gap of 8.03% in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by 23.09% , whereas the gap for previous benchmarks is just 14.64% ). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.

[CV-121] Characterizing Continual Learning Scenarios and Strategies for Audio Analysis

链接: https://arxiv.org/abs/2407.00465
作者: Ruchi Bhatt,Pratibha Kumari,Dwarikanath Mahapatra,Abdulmotaleb El Saddik,Mukesh Saini
关键词: Audio analysis, Audio, analysis, characterize continual learning, audio analysis approaches
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio analysis is useful in many application scenarios. The state-of-the-art audio analysis approaches assume that the data distribution at training and deployment time will be the same. However, due to various real-life environmental factors, the data may encounter drift in its distribution or can encounter new classes in the late future. Thus, a one-time trained model might not perform adequately. In this paper, we characterize continual learning (CL) approaches in audio analysis. In this paper, we characterize continual learning (CL) approaches, intended to tackle catastrophic forgetting arising due to drifts. As there is no CL dataset for audio analysis, we use DCASE 2020 to 2023 datasets to create various CL scenarios for audio-based monitoring tasks. We have investigated the following CL and non-CL approaches: EWC, LwF, SI, GEM, A-GEM, GDumb, Replay, Naive, cumulative, and joint training. The study is very beneficial for researchers and practitioners working in the area of audio analysis for developing adaptive models. We observed that Replay achieved better results than other methods in the DCASE challenge data. It achieved an accuracy of 70.12% for the domain incremental scenario and an accuracy of 96.98% for the class incremental scenario.

[CV-122] pFLFE: Cross-silo Personalized Federated Learning via Feature Enhancement on Medical Image Segmentation

链接: https://arxiv.org/abs/2407.00462
作者: Luyuan Xie,Manqing Lin,Siyuan Liu,ChenMing Xu,Tianyu Luan,Cong Li,Yuejian Fang,Qingni Shen,Zhonghai Wu
关键词: overcome data scarcity, utilizing varied data, personalized cross-silo federated, cross-silo federated learning, Personalized Federated Learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In medical image segmentation, personalized cross-silo federated learning (FL) is becoming popular for utilizing varied data across healthcare settings to overcome data scarcity and privacy concerns. However, existing methods often suffer from client drift, leading to inconsistent performance and delayed training. We propose a new framework, Personalized Federated Learning via Feature Enhancement (pFLFE), designed to mitigate these challenges. pFLFE consists of two main stages: feature enhancement and supervised learning. The first stage improves differentiation between foreground and background features, and the second uses these enhanced features for learning from segmentation masks. We also design an alternative training approach that requires fewer communication rounds without compromising segmentation quality, even with limited communication resources. Through experiments on three medical segmentation tasks, we demonstrate that pFLFE outperforms the state-of-the-art methods.

[CV-123] Diving Deeper Into Pedestrian Behavior Understanding: Intention Estimation Action Prediction and Event Risk Assessment

链接: https://arxiv.org/abs/2407.00446
作者: Amir Rasouli,Iuliia Kotseruba
关键词: event risk assessment, behavior understanding problem, pedestrian behavior understanding, risk assessment, behavior understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 5 figures, 6 tables

点击查看摘要

Abstract:In this paper, we delve into the pedestrian behavior understanding problem from the perspective of three different tasks: intention estimation, action prediction, and event risk assessment. We first define the tasks and discuss how these tasks are represented and annotated in two widely used pedestrian datasets, JAAD and PIE. We then propose a new benchmark based on these definitions, available annotations, and three new classes of metrics, each designed to assess different aspects of the model performance. We apply the new evaluation approach to examine four SOTA prediction models on each task and compare their performance w.r.t. metrics and input modalities. In particular, we analyze the differences between intention estimation and action prediction tasks by considering various scenarios and contextual factors. Lastly, we examine model agreement across these two tasks to show their complementary role. The proposed benchmark reveals new facts about the role of different data modalities, the tasks, and relevant data properties. We conclude by elaborating on our findings and proposing future research directions. Comments: 8 pages, 5 figures, 6 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2407.00446 [cs.CV] (or arXiv:2407.00446v1 [cs.CV] for this version)

[CV-124] AI Age Discrepancy: A Novel Parameter for Frailty Assessment in Kidney Tumor Patients

链接: https://arxiv.org/abs/2407.00438
作者: Jayant Siva,Angelica Bartholomew,Clara Goebel,Gabriel Wallerstein-King,Beatriz López Morato,Nicholas Heller,Jason Scovell,Rebecca Campbell,Andrew Wood,Michal Ozery-Flato,Vesna Barros,Maria Gabrani,Michal Rosen-Zvi,Resha Tejpaul,Vidhyalakshmi Ramesh,Nikolaos Papanikolopoulos,Subodh Regmi,Ryan Ward,Robert Abouassaly,Steven C. Campbell,Erick Remer,Christopher Weight
关键词: global health concern, optimizing surgical outcomes, Age Discrepancy, Kidney Tumor Segmentation, Kidney cancer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Kidney cancer is a global health concern, and accurate assessment of patient frailty is crucial for optimizing surgical outcomes. This paper introduces AI Age Discrepancy, a novel metric derived from machine learning analysis of preoperative abdominal CT scans, as a potential indicator of frailty and postoperative risk in kidney cancer patients. This retrospective study of 599 patients from the 2023 Kidney Tumor Segmentation (KiTS) challenge dataset found that a higher AI Age Discrepancy is significantly associated with longer hospital stays and lower overall survival rates, independent of established factors. This suggests that AI Age Discrepancy may provide valuable insights into patient frailty and could thus inform clinical decision-making in kidney cancer treatment.

[CV-125] Location embedding based pairwise distance learning for fine-grained diagnosis of urinary stones

链接: https://arxiv.org/abs/2407.00431
作者: Qiangguo Jin,Jiapeng Huang,Changming Sun,Hui Cui,Ping Xuan,Ran Su,Leyi Wei,Yu-Jie Wu,Chia-An Wu,Henry B.L. Duh,Yueh-Hsun Lu
关键词: effective treatment strategies, devising effective treatment, treatment strategies, crucial for devising, devising effective
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The precise diagnosis of urinary stones is crucial for devising effective treatment strategies. The diagnostic process, however, is often complicated by the low contrast between stones and surrounding tissues, as well as the variability in stone locations across different patients. To address this issue, we propose a novel location embedding based pairwise distance learning network (LEPD-Net) that leverages low-dose abdominal X-ray imaging combined with location information for the fine-grained diagnosis of urinary stones. LEPD-Net enhances the representation of stone-related features through context-aware region enhancement, incorporates critical location knowledge via stone location embedding, and achieves recognition of fine-grained objects with our innovative fine-grained pairwise distance learning. Additionally, we have established an in-house dataset on urinary tract stones to demonstrate the effectiveness of our proposed approach. Comprehensive experiments conducted on this dataset reveal that our framework significantly surpasses existing state-of-the-art methods.

[CV-126] Parametric Primitive Analysis of CAD Sketches with Vision Transformer

链接: https://arxiv.org/abs/2407.00410
作者: Xiaogang Wang,Liang Wang,Hongyu Wu,Guoqiang Xiao,Kai Xu
关键词: primarily involving CAD, involving CAD primitives, industrial product design, involving CAD, CAD primitives
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The design and analysis of Computer-Aided Design (CAD) sketches play a crucial role in industrial product design, primarily involving CAD primitives and their inter-primitive constraints. To address challenges related to error accumulation in autoregressive models and the complexities associated with self-supervised model design for this task, we propose a two-stage network framework. This framework consists of a primitive network and a constraint network, transforming the sketch analysis task into a set prediction problem to enhance the effective handling of primitives and constraints. By decoupling target types from parameters, the model gains increased flexibility and optimization while reducing complexity. Additionally, the constraint network incorporates a pointer module to explicitly indicate the relationship between constraint parameters and primitive indices, enhancing interpretability and performance. Qualitative and quantitative analyses on two publicly available datasets demonstrate the superiority of this method.

[CV-127] Query-Efficient Hard-Label Black-Box Attack against Vision Transformers

链接: https://arxiv.org/abs/2407.00389
作者: Chao Zhou,Xiaowen Shi,Yuan-Gen Wang
关键词: convolutional neural networks, face similar security, similar security risks, deep convolutional neural, Recent studies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent studies have revealed that vision transformers (ViTs) face similar security risks from adversarial attacks as deep convolutional neural networks (CNNs). However, directly applying attack methodology on CNNs to ViTs has been demonstrated to be ineffective since the ViTs typically work on patch-wise encoding. This article explores the vulnerability of ViTs against adversarial attacks under a black-box scenario, and proposes a novel query-efficient hard-label adversarial attack method called AdvViT. Specifically, considering that ViTs are highly sensitive to patch modification, we propose to optimize the adversarial perturbation on the individual patches. To reduce the dimension of perturbation search space, we modify only a handful of low-frequency components of each patch. Moreover, we design a weight mask matrix for all patches to further optimize the perturbation on different regions of a whole image. We test six mainstream ViT backbones on the ImageNet-1k dataset. Experimental results show that compared with the state-of-the-art attacks on CNNs, our AdvViT achieves much lower L_2 -norm distortion under the same query budget, sufficiently validating the vulnerability of ViTs against adversarial attacks.

[CV-128] he Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention

链接: https://arxiv.org/abs/2407.00377
作者: Yixin Wan,Di Wu,Haoran Wang,Kai-Wei Chang
关键词: models depicting individuals, commonly adopted, depicting individuals, Prompt-based, diversity interventions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Prompt-based “diversity interventions” are commonly adopted to improve the diversity of Text-to-Image (T2I) models depicting individuals with various racial or gender traits. However, will this strategy result in nonfactual demographic distribution, especially when generating real historical figures? In this work, we propose DemOgraphic FActualIty Representation (DoFaiR), a benchmark to systematically quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. DoFaiR consists of 756 meticulously fact-checked test instances to reveal the factuality tax of various diversity prompts through an automated evidence-supported evaluation pipeline. Experiments on DoFaiR unveil that diversity-oriented instructions increase the number of different gender and racial groups in DALLE-3’s generations at the cost of historically inaccurate demographic distributions. To resolve this issue, we propose Fact-Augmented Intervention (FAI), which instructs a Large Language Model (LLM) to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history, and incorporate it into the generation context of T2I models. By orienting model generations using the reflected historical truths, FAI significantly improves the demographic factuality under diversity interventions while preserving diversity.

[CV-129] SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

链接: https://arxiv.org/abs/2407.00367
作者: Peng Dai,Feitong Tan,Qiangeng Xu,David Futschik,Ruofei Du,Sean Fanello,Xiaojuan Qi,Yinda Zhang
关键词: demonstrated great capabilities, producing impressive monocular, video remains under-explored, video generation model, impressive monocular videos
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 3D stereoscopic video generation, video diffusion, inpainting

点击查看摘要

Abstract:Video generation models have demonstrated great capabilities of producing impressive monocular videos, however, the generation of 3D stereoscopic video remains under-explored. We propose a pose-free and training-free approach for generating 3D stereoscopic videos using an off-the-shelf monocular video generation model. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth, and employs a novel frame matrix video inpainting framework. The framework leverages the video generation model to inpaint frames observed from different timestamps and views. This effective approach generates consistent and semantically coherent stereoscopic videos without scene optimization or model fine-tuning. Moreover, we develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora [4 ], Lumiere [2], WALT [8 ], and Zeroscope [ 42]. The experiments demonstrate that our method has a significant improvement over previous methods. The code will be released at \urlthis https URL.

[CV-130] JSCDS: A Core Data Selection Method with Jason-Shannon Divergence for Caries RGB Images-Efficient Learning

链接: https://arxiv.org/abs/2407.00362
作者: Peiliang Zhang,Yujia Tong,Chenghu Du,Chao Che,Yongjun Zhu
关键词: preventing oral diseases, Core data selection, Deep learning-based RGB, data selection, data selection methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted in KDD 2024 Workshop AIDSH

点击查看摘要

Abstract:Deep learning-based RGB caries detection improves the efficiency of caries identification and is crucial for preventing oral diseases. The performance of deep learning models depends on high-quality data and requires substantial training resources, making efficient deployment challenging. Core data selection, by eliminating low-quality and confusing data, aims to enhance training efficiency without significantly compromising model performance. However, distance-based data selection methods struggle to distinguish dependencies among high-dimensional caries data. To address this issue, we propose a Core Data Selection Method with Jensen-Shannon Divergence (JSCDS) for efficient caries image learning and caries classification. We describe the core data selection criterion as the distribution of samples in different classes. JSCDS calculates the cluster centers by sample embedding representation in the caries classification network and utilizes Jensen-Shannon Divergence to compute the mutual information between data samples and cluster centers, capturing nonlinear dependencies among high-dimensional data. The average mutual information is calculated to fit the above distribution, serving as the criterion for constructing the core set for model training. Extensive experiments on RGB caries datasets show that JSCDS outperforms other data selection methods in prediction performance and time consumption. Notably, JSCDS exceeds the performance of the full dataset model with only 50% of the core data, with its performance advantage becoming more pronounced in the 70% of core data.

[CV-131] Enhancing Accuracy and Parameter-Efficiency of Neural Representations for Network Parameterization

链接: https://arxiv.org/abs/2407.00356
作者: Hongjun Choi,Jayaraman J. Thiagarajan,Ruben Glatt,Shusen Liu
关键词: neural network weights, investigate the fundamental, fundamental trade-off, parameter efficiency, parameterization of neural
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we investigate the fundamental trade-off regarding accuracy and parameter efficiency in the parameterization of neural network weights using predictor networks. We present a surprising finding that, when recovering the original model accuracy is the sole objective, it can be achieved effectively through the weight reconstruction objective alone. Additionally, we explore the underlying factors for improving weight reconstruction under parameter-efficiency constraints, and propose a novel training scheme that decouples the reconstruction objective from auxiliary objectives such as knowledge distillation that leads to significant improvements compared to state-of-the-art approaches. Finally, these results pave way for more practical scenarios, where one needs to achieve improvements on both model accuracy and predictor network parameter-efficiency simultaneously.

[CV-132] PhyTracker: An Online Tracker for Phytoplankton

链接: https://arxiv.org/abs/2407.00352
作者: Yang Yu,Qingxuan Lv,Yuezun Li,Zhiqiang Wei,Junyu Dong
关键词: understand marine ecological, marine ecological processes, requires efficient monitoring, aquatic ecosystems, requires efficient
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13pages,eleven figures

点击查看摘要

Abstract:Phytoplankton, a crucial component of aquatic ecosystems, requires efficient monitoring to understand marine ecological processes and environmental conditions. Traditional phytoplankton monitoring methods, relying on non-in situ observations, are time-consuming and resource-intensive, limiting timely analysis. To address these limitations, we introduce PhyTracker, an intelligent in situ tracking framework designed for automatic tracking of phytoplankton. PhyTracker overcomes significant challenges unique to phytoplankton monitoring, such as constrained mobility within water flow, inconspicuous appearance, and the presence of impurities. Our method incorporates three innovative modules: a Texture-enhanced Feature Extraction (TFE) module, an Attention-enhanced Temporal Association (ATA) module, and a Flow-agnostic Movement Refinement (FMR) module. These modules enhance feature capture, differentiate between phytoplankton and impurities, and refine movement characteristics, respectively. Extensive experiments on the PMOT dataset validate the superiority of PhyTracker in phytoplankton tracking, and additional tests on the MOT dataset demonstrate its general applicability, outperforming conventional tracking methods. This work highlights key differences between phytoplankton and traditional objects, offering an effective solution for phytoplankton monitoring.

[CV-133] OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

链接: https://arxiv.org/abs/2407.00316
作者: Adam Sun,Tiange Xiang,Scott Delp,Li Fei-Fei,Ehsan Adeli
关键词: rendering methods require, input video, methods require, fully visible, human
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion models for efficient and high-fidelity human rendering. We propose a pipeline consisting of three stages. In the Initialization stage, complete human masks are generated from partial visibility masks. In the Optimization stage, 3D human Gaussians are optimized with additional supervision by Score-Distillation Sampling (SDS) to create a complete geometry of the human. Finally, in the Refinement stage, in-context inpainting is designed to further improve rendering quality on the less observed human body parts. We evaluate OccFusion on ZJU-MoCap and challenging OcMotion sequences and find that it achieves state-of-the-art performance in the rendering of occluded humans.

[CV-134] Learning Unsupervised Gaze Representation via Eye Mask Driven Information Bottleneck

链接: https://arxiv.org/abs/2407.00315
作者: Yangzhou Jiang,Yinxin Lin,Yaoming Wang,Teng Li,Bilian Ke,Bingbing Ni
关键词: Appearance-based supervised methods, made tremendous advances, Appearance-based supervised, recent gaze estimation, full-face image input
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between facial and eye features, such frameworks are still deficient in extracting useful gaze features from full-face. To alleviate above limitations, this work proposes a novel unsupervised/self-supervised gaze pre-training framework, which forces the full-face branch to learn a low dimensional gaze embedding without gaze annotations, through collaborative feature contrast and squeeze modules. In the heart of this framework is an alternating eye-attended/unattended masking training scheme, which squeezes gaze-related information from full-face branch into an eye-masked auto-encoder through an injection bottleneck design that successfully encourages the model to pays more attention to gaze direction rather than facial textures only, while still adopting the eye self-reconstruction objective. In the same time, a novel eye/gaze-related information contrastive loss has been designed to further boost the learned representation by forcing the model to focus on eye-centered regions. Extensive experimental results on several gaze benchmarks demonstrate that the proposed scheme achieves superior performances over unsupervised state-of-the-art.

[CV-135] Benchmark Evaluation of Image Fusion algorithms for Smartphone Camera Capture

链接: https://arxiv.org/abs/2407.00301
作者: Lucas N. Kirsten
关键词: smartphone camera capture, image quality, image fusion techniques, image, fusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Presented at the ICMLAI 2024, in Mendonza, Argentina

点击查看摘要

Abstract:This paper investigates the trade-off between computational resource utilization and image quality in the context of image fusion techniques for smartphone camera capture. The study explores various combinations of fusion methods, fusion weights, number of frames, and stacking (a.k.a. merging) techniques using a proprietary dataset of images captured with Motorola smartphones. The objective was to identify optimal configurations that balance computational efficiency with image quality. Our results indicate that multi-scale methods and their single-scale fusion counterparts return similar image quality measures and runtime, but single-scale ones have lower memory usage. Furthermore, we identified that fusion methods operating in the YUV color space yield better performance in terms of image quality, resource utilization, and runtime. The study also shows that fusion weights have an overall small impact on image quality, runtime, and memory. Moreover, our results reveal that increasing the number of highly exposed input frames does not necessarily improve image quality and comes with a corresponding increase in computational resources usage and runtime; and that stacking methods, although reducing memory usage, may compromise image quality. Finally, our work underscores the importance of thoughtful configuration selection for image fusion techniques in constrained environments and offers insights for future image fusion method development, particularly in the realm of smartphone applications.

[CV-136] SolarSAM: Building-scale Photovoltaic Potential Assessment Based on Segment Anything Model (SAM) and Remote Sensing for Emerging City

链接: https://arxiv.org/abs/2407.00296
作者: Guohao Wang
关键词: renewable energy source, promising renewable energy, Driven by advancements, energy source, renewable energy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Driven by advancements in photovoltaic (PV) technology, solar energy has emerged as a promising renewable energy source, due to its ease of integration onto building rooftops, facades, and windows. For the emerging cities, the lack of detailed street-level data presents a challenge for effectively assessing the potential of building-integrated photovoltaic (BIPV). To address this, this study introduces SolarSAM, a novel BIPV evaluation method that leverages remote sensing imagery and deep learning techniques, and an emerging city in northern China is utilized to validate the model performance. During the process, SolarSAM segmented various building rooftops using text prompt guided semantic segmentation. Separate PV models were then developed for Rooftop PV, Facade-integrated PV, and PV windows systems, using this segmented data and local climate information. The potential for BIPV installation, solar power generation, and city-wide power self-sufficiency were assessed, revealing that the annual BIPV power generation potential surpassed the city’s total electricity consumption by a factor of 2.5. Economic and environmental analysis were also conducted, including levelized cost of electricity and carbon reduction calculations, comparing different BIPV systems across various building categories. These findings demonstrated the model’s performance and reveled the potential of BIPV power generation in the future.

[CV-137] A deep neural network framework for dynamic multi-valued mapping estimation and its applications

链接: https://arxiv.org/abs/2407.00295
作者: Geng Li,Di Qiu,Lok Ming Lui
关键词: estimating dynamic multi-valued, dynamic multi-valued mappings, dynamic multi-valued, multi-valued mappings, estimating dynamic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of modeling and estimating dynamic multi-valued mappings. While most mathematical models provide a unique solution for a given input, real-world applications often lack deterministic solutions. In such scenarios, estimating dynamic multi-valued mappings is necessary to suggest different reasonable solutions for each input. This paper introduces a deep neural network framework incorporating a generative network and a classification component. The objective is to model the dynamic multi-valued mapping between the input and output by providing a reliable uncertainty measurement. Generating multiple solutions for a given input involves utilizing a discrete codebook comprising finite variables. These variables are fed into a generative network along with the input, producing various output possibilities. The discreteness of the codebook enables efficient estimation of the output’s conditional probability distribution for any given input using a classifier. By jointly optimizing the discrete codebook and its uncertainty estimation during training using a specially designed loss function, a highly accurate approximation is achieved. The effectiveness of our proposed framework is demonstrated through its application to various imaging problems, using both synthetic and real imaging data. Experimental results show that our framework accurately estimates the dynamic multi-valued mapping with uncertainty estimation.

[CV-138] PerAct2: A Perceiver Actor Framework for Bimanual Manipulation Tasks

链接: https://arxiv.org/abs/2407.00278
作者: Markus Grotz,Mohit Shridhar,Tamim Asfour,Dieter Fox
关键词: temporal coordination required, challenging due, due to precise, precise spatial, spatial and temporal
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by extending RLBench to bimanual manipulation. We open-source our code and benchmark comprising 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To kickstart the benchmark, we extended several state-of-the art methods to bimanual manipulation and also present a language-conditioned behavioral cloning agent – PerAct2, which enables the learning and execution of bimanual 6-DoF manipulation tasks. Our novel network architecture efficiently integrates language processing with action prediction, allowing robots to understand and perform complex bimanual tasks in response to user-specified goals. Project website with code is available at: this http URL

[CV-139] Learning a Clinically-Relevant Concept Bottleneck for Lesion Detection in Breast Ultrasound

链接: https://arxiv.org/abs/2407.00267
作者: Arianna Bunnell,Yannik Glaser,Dustin Valdez,Thomas Wolfgruber,Aleen Altamirano,Carol Zamora González,Brenda Y. Hernandez,Peter Sadowski,John A. Shepherd
关键词: Detecting and classifying, Radiology Breast Imaging, breast ultrasound images, artificial intelligence, access to mammography
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Submitted version of manuscript accepted at MICCAI 2024. This preprint has not undergone peer review or any post-submission improvements or corrections

点击查看摘要

Abstract:Detecting and classifying lesions in breast ultrasound images is a promising application of artificial intelligence (AI) for reducing the burden of cancer in regions with limited access to mammography. Such AI systems are more likely to be useful in a clinical setting if their predictions can be explained to a radiologist. This work proposes an explainable AI model that provides interpretable predictions using a standard lexicon from the American College of Radiology’s Breast Imaging and Reporting Data System (BI-RADS). The model is a deep neural network featuring a concept bottleneck layer in which known BI-RADS features are predicted before making a final cancer classification. This enables radiologists to easily review the predictions of the AI system and potentially fix errors in real time by modifying the concept predictions. In experiments, a model is developed on 8,854 images from 994 women with expert annotations and histological cancer labels. The model outperforms state-of-the-art lesion detection frameworks with 48.9 average precision on the held-out testing set, and for cancer classification, concept intervention is shown to increase performance from 0.876 to 0.885 area under the receiver operating characteristic curve. Training and evaluation code is available at this https URL.

[CV-140] From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models

链接: https://arxiv.org/abs/2407.00263
作者: Mehar Bhatia,Sahithya Ravi,Aditya Chinchure,Eunjeong Hwang,Vered Shwartz
关键词: non-western cultures due, performance remains suboptimal, training datasets, recent advancements, remains suboptimal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under peer review

点击查看摘要

Abstract:Despite recent advancements in vision-language models, their performance remains suboptimal on images from non-western cultures due to underrepresentation in training datasets. Various benchmarks have been proposed to test models’ cultural inclusivity, but they have limited coverage of cultures and do not adequately assess cultural diversity across universal as well as culture-specific local concepts. To address these limitations, we introduce the GlobalRG benchmark, comprising two challenging tasks: retrieval across universals and cultural visual grounding. The former task entails retrieving culturally diverse images for universal concepts from 50 countries, while the latter aims at grounding culture-specific concepts within images from 15 countries. Our evaluation across a wide range of models reveals that the performance varies significantly across cultures – underscoring the necessity for enhancing multicultural understanding in vision-language models.

[CV-141] Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review

链接: https://arxiv.org/abs/2407.00252
作者: Moseli Mots’oehli
关键词: acquiring high-quality annotated, achieved significant success, high-quality annotated data, annotated data remains, acquiring high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
*备注: Accepted IEEE ETNCC 2024, 9 pages

点击查看摘要

Abstract:While supervised learning has achieved significant success in computer vision tasks, acquiring high-quality annotated data remains a bottleneck. This paper explores both scholarly and non-scholarly works in AI-assistive deep learning image annotation systems that provide textual suggestions, captions, or descriptions of the input image to the annotator. This potentially results in higher annotation efficiency and quality. Our exploration covers annotation for a range of computer vision tasks including image classification, object detection, regression, instance, semantic segmentation, and pose estimation. We review various datasets and how they contribute to the training and evaluation of AI-assistive annotation systems. We also examine methods leveraging neuro-symbolic learning, deep active learning, and self-supervised learning algorithms that enable semantic image understanding and generate free-text output. These include image captioning, visual question answering, and multi-modal reasoning. Despite the promising potential, there is limited publicly available work on AI-assistive image annotation with textual output capabilities. We conclude by suggesting future research directions to advance this field, emphasizing the need for more publicly accessible datasets and collaborative efforts between academia and industry.

[CV-142] Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription

链接: https://arxiv.org/abs/2407.00250
作者: Jaydeep Borkar,David A. Smith
关键词: illegible text resulting, documents frequently suffer, storage damage, frequently suffer, illegible text
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted to ICDAR 2024 Workshop on Computational Paleography

点击查看摘要

Abstract:Historical documents frequently suffer from damage and inconsistencies, including missing or illegible text resulting from issues such as holes, ink problems, and storage damage. These missing portions or gaps are referred to as lacunae. In this study, we employ transformer-based optical character recognition (OCR) models trained on synthetic data containing lacunae in a supervised manner. We demonstrate their effectiveness in detecting and restoring lacunae, achieving a success rate of 65%, compared to a base model lacking knowledge of lacunae, which achieves only 5% restoration. Additionally, we investigate the mechanistic properties of the model, such as the log probability of transcription, which can identify lacunae and other errors (e.g., mistranscriptions due to complex writing or ink issues) in line images without directly inspecting the image. This capability could be valuable for scholars seeking to distinguish images containing lacunae or errors from clean ones. Although we explore the potential of attention mechanisms in flagging lacunae and transcription errors, our findings suggest it is not a significant factor. Our work highlights a promising direction in utilizing transformer-based OCR models for restoring or analyzing damaged historical documents.

[CV-143] Prompt Refinement with Image Pivot for Text-to-Image Generation

链接: https://arxiv.org/abs/2407.00247
作者: Jingtao Zhan,Qingyao Ai,Yiqun Liu,Yingwei Pan,Ting Yao,Jiaxin Mao,Shaoping Ma,Tao Mei
关键词: automatically refining user-provided, refining user-provided natural, keyword-enriched prompts favored, user-provided natural language, automatically refining
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACL 2024

点击查看摘要

Abstract:For text-to-image generation, automatically refining user-provided natural language prompts into the keyword-enriched prompts favored by systems is essential for the user experience. Such a prompt refinement process is analogous to translating the prompt from “user languages” into “system languages”. However, the scarcity of such parallel corpora makes it difficult to train a prompt refinement model. Inspired by zero-shot machine translation techniques, we introduce Prompt Refinement with Image Pivot (PRIP). PRIP innovatively uses the latent representation of a user-preferred image as an intermediary “pivot” between the user and system languages. It decomposes the refinement process into two data-rich tasks: inferring representations of user-preferred images from user languages and subsequently translating image representations into system languages. Thus, it can leverage abundant data for training. Extensive experiments show that PRIP substantially outperforms a wide range of baselines and effectively transfers to unseen systems in a zero-shot manner.

[CV-144] Methodology to Deploy CNN-Based Computer Vision Models on Immersive Wearable Devices

链接: https://arxiv.org/abs/2407.00233
作者: Kaveh Malek(1),Fernando Moreu(2), ((1) Department of Mechanical Engineering, University of New Mexico, New Mexico, (2) Department of Civil, Construction and Environmental Engineering, University of New Mexico, New Mexico)
关键词: Convolutional Neural Network, Convolutional Neural, Neural Network, Augmented Reality, addressed by Augmented
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 10 pages 8 figures 4300 words

点击查看摘要

Abstract:Convolutional Neural Network (CNN) models often lack the ability to incorporate human input, which can be addressed by Augmented Reality (AR) headsets. However, current AR headsets face limitations in processing power, which has prevented researchers from performing real-time, complex image recognition tasks using CNNs in AR headsets. This paper presents a method to deploy CNN models on AR headsets by training them on computers and transferring the optimized weight matrices to the headset. The approach transforms the image data and CNN layers into a one-dimensional format suitable for the AR platform. We demonstrate this method by training the LeNet-5 CNN model on the MNIST dataset using PyTorch and deploying it on a HoloLens AR headset. The results show that the model maintains an accuracy of approximately 98%, similar to its performance on a computer. This integration of CNN and AR enables real-time image processing on AR headsets, allowing for the incorporation of human input into AI models.

[CV-145] SemUV: Deep Learning based semantic manipulation over UV texture map of virtual human heads

链接: https://arxiv.org/abs/2407.00229
作者: Anirban Mukherjee,Venkat Suprabath Bitra,Vignesh Bondugula,Tarun Reddy Tallapureddy,Dinesh Babu Jayagopi
关键词: manipulating virtual human, virtual human heads, Designing and manipulating, interaction and VFX, human heads
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: CVIP 2024 Preprint

点击查看摘要

Abstract:Designing and manipulating virtual human heads is essential across various applications, including AR, VR, gaming, human-computer interaction and VFX. Traditional graphic-based approaches require manual effort and resources to achieve accurate representation of human heads. While modern deep learning techniques can generate and edit highly photorealistic images of faces, their focus remains predominantly on 2D facial images. This limitation makes them less suitable for 3D applications. Recognizing the vital role of editing within the UV texture space as a key component in the 3D graphics pipeline, our work focuses on this aspect to benefit graphic designers by providing enhanced control and precision in appearance manipulation. Research on existing methods within the UV texture space is limited, complex, and poses challenges. In this paper, we introduce SemUV: a simple and effective approach using the FFHQ-UV dataset for semantic manipulation directly within the UV texture space. We train a StyleGAN model on the publicly available FFHQ-UV dataset, and subsequently train a boundary for interpolation and semantic feature manipulation. Through experiments comparing our method with 2D manipulation technique, we demonstrate its superior ability to preserve identity while effectively modifying semantic features such as age, gender, and facial hair. Our approach is simple, agnostic to other 3D components such as structure, lighting, and rendering, and also enables seamless integration into standard 3D graphics pipelines without demanding extensive domain expertise, time, or resources.

[CV-146] ransformer-based Image and Video Inpainting: Current Challenges and Future Directions

链接: https://arxiv.org/abs/2407.00226
作者: Omar Elharrouss,Rafat Damseh,Abdelkader Nasreddine Belkacem,Elarbi Badidi,Abderrahmane Lakas
关键词: image or video, video inpainting, hot topic, video, Image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper have been submitted to Artificial Intelligence Review journal

点击查看摘要

Abstract:Image inpainting is currently a hot topic within the field of computer vision. It offers a viable solution for various applications, including photographic restoration, video editing, and medical imaging. Deep learning advancements, notably convolutional neural networks (CNNs) and generative adversarial networks (GANs), have significantly enhanced the inpainting task with an improved capability to fill missing or damaged regions in an image or video through the incorporation of contextually appropriate details. These advancements have improved other aspects, including efficiency, information preservation, and achieving both realistic textures and structures. Recently, visual transformers have been exploited and offer some improvements to image or video inpainting. The advent of transformer-based architectures, which were initially designed for natural language processing, has also been integrated into computer vision tasks. These methods utilize self-attention mechanisms that excel in capturing long-range dependencies within data; therefore, they are particularly effective for tasks requiring a comprehensive understanding of the global context of an image or video. In this paper, we provide a comprehensive review of the current image or video inpainting approaches, with a specific focus on transformer-based techniques, with the goal to highlight the significant improvements and provide a guideline for new researchers in the field of image or video inpainting using visual transformers. We categorized the transformer-based techniques by their architectural configurations, types of damage, and performance metrics. Furthermore, we present an organized synthesis of the current challenges, and suggest directions for future research in the field of image or video inpainting.

[CV-147] Multimodal Prototyping for cancer survival prediction

链接: https://arxiv.org/abs/2407.00224
作者: Andrew H. Song,Richard J. Chen,Guillaume Jaume,Anurag J. Vaidya,Alexander S. Baras,Faisal Mahmood
关键词: histology whole-slide images, combining gigapixel histology, gigapixel histology whole-slide, survival methods combining, methods combining gigapixel
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注: ICML 2024

点击查看摘要

Abstract:Multimodal survival methods combining gigapixel histology whole-slide images (WSIs) and transcriptomic profiles are particularly promising for patient prognostication and stratification. Current approaches involve tokenizing the WSIs into smaller patches (10,000 patches) and transcriptomics into gene groups, which are then integrated using a Transformer for predicting outcomes. However, this process generates many tokens, which leads to high memory requirements for computing attention and complicates post-hoc interpretability analyses. Instead, we hypothesize that we can: (1) effectively summarize the morphological content of a WSI by condensing its constituting tokens using morphological prototypes, achieving more than 300x compression; and (2) accurately characterize cellular functions by encoding the transcriptomic profile with biological pathway prototypes, all in an unsupervised fashion. The resulting multimodal tokens are then processed by a fusion network, either with a Transformer or an optimal transport cross-alignment, which now operates with a small and fixed number of tokens without approximations. Extensive evaluation on six cancer types shows that our framework outperforms state-of-the-art methods with much less computation while unlocking new interpretability analyses.

[CV-148] PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration

链接: https://arxiv.org/abs/2407.00203
作者: Yuxuan Sun,Yunlong Zhang,Yixuan Si,Chenglu Zhu,Zhongyi Shui,Kai Zhang,Jingxiong Li,Xingheng Lyu,Tao Lin,Lin Yang
关键词: Vision Language Models, Slide Image, attracted substantial attention, Vision Language, serving as backbones
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Vision Language Models (VLMs) like CLIP have attracted substantial attention in pathology, serving as backbones for applications such as zero-shot image classification and Whole Slide Image (WSI) analysis. Additionally, they can function as vision encoders when combined with large language models (LLMs) to support broader capabilities. Current efforts to train pathology VLMs rely on pathology image-text pairs from platforms like PubMed, YouTube, and Twitter, which provide limited, unscalable data with generally suboptimal image quality. In this work, we leverage large-scale WSI datasets like TCGA to extract numerous high-quality image patches. We then train a large multimodal model to generate captions for these images, creating PathGen-1.6M, a dataset containing 1.6 million high-quality image-caption pairs. Our approach involves multiple agent models collaborating to extract representative WSI patches, generating and refining captions to obtain high-quality image-text pairs. Extensive experiments show that integrating these generated pairs with existing datasets to train a pathology-specific CLIP model, PathGen-CLIP, significantly enhances its ability to analyze pathological images, with substantial improvements across nine pathology-related zero-shot image classification tasks and three whole-slide image tasks. Furthermore, we construct 200K instruction-tuning data based on PathGen-1.6M and integrate PathGen-CLIP with the Vicuna LLM to create more powerful multimodal models through instruction tuning. Overall, we provide a scalable pathway for high-quality data generation in pathology, paving the way for next-generation general pathology models.

[CV-149] SMPLOlympics: Sports Environments for Physically Simulated Humanoids

链接: https://arxiv.org/abs/2407.00187
作者: Zhengyi Luo,Jiashun Wang,Kangni Liu,Haotian Zhang,Chen Tessler,Jingbo Wang,Ye Yuan,Jinkun Cao,Zihui Lin,Fengyi Wang,Jessica Hodgins,Kris Kitani
关键词: variety of Olympic, physically simulated environments, Olympic sports, physically simulated, physically demanding nature
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present SMPLOlympics, a collection of physically simulated environments that allow humanoids to compete in a variety of Olympic sports. Sports simulation offers a rich and standardized testing ground for evaluating and improving the capabilities of learning algorithms due to the diversity and physically demanding nature of athletic activities. As humans have been competing in these sports for many years, there is also a plethora of existing knowledge on the preferred strategy to achieve better performance. To leverage these existing human demonstrations from videos and motion capture, we design our humanoid to be compatible with the widely-used SMPL and SMPL-X human models from the vision and graphics community. We provide a suite of individual sports environments, including golf, javelin throw, high jump, long jump, and hurdling, as well as competitive sports, including both 1v1 and 2v2 games such as table tennis, tennis, fencing, boxing, soccer, and basketball. Our analysis shows that combining strong motion priors with simple rewards can result in human-like behavior in various sports. By providing a unified sports benchmark and baseline implementation of state and reward designs, we hope that SMPLOlympics can help the control and animation communities achieve human-like and performant behaviors.

[CV-150] he impact of model size on catastrophic forgetting in Online Continual Learning

链接: https://arxiv.org/abs/2407.00176
作者: Eunhae Lee
关键词: Continual Learning performance, Continual Learning, Online Continual Learning, Continual Learning efficacy, Learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study investigates the impact of model size on Online Continual Learning performance, with a focus on catastrophic forgetting. Employing ResNet architectures of varying sizes, the research examines how network depth and width affect model performance in class-incremental learning using the SplitCIFAR-10 dataset. Key findings reveal that larger models do not guarantee better Continual Learning performance; in fact, they often struggle more in adapting to new tasks, particularly in online settings. These results challenge the notion that larger models inherently mitigate catastrophic forgetting, highlighting the nuanced relationship between model size and Continual Learning efficacy. This study contributes to a deeper understanding of model scalability and its practical implications in Continual Learning scenarios.

[CV-151] Localizing Anomalies via Multiscale Score Matching Analysis

链接: https://arxiv.org/abs/2407.00148
作者: Ahsan Mahmood,Junier Oliva,Martin Styner
关键词: remain critical challenges, Multiscale Score Matching, Score Matching Analysis, imaging remain critical, challenges in healthcare
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection and localization in medical imaging remain critical challenges in healthcare. This paper introduces Spatial-MSMA (Multiscale Score Matching Analysis), a novel unsupervised method for anomaly localization in volumetric brain MRIs. Building upon the MSMA framework, our approach incorporates spatial information and conditional likelihoods to enhance anomaly detection capabilities. We employ a flexible normalizing flow model conditioned on patch positions and global image features to estimate patch-wise anomaly scores. The method is evaluated on a dataset of 1,650 T1- and T2-weighted brain MRIs from typically developing children, with simulated lesions added to the test set. Spatial-MSMA significantly outperforms existing methods, including reconstruction-based, generative-based, and interpretation-based approaches, in lesion detection and segmentation tasks. Our model achieves superior performance in both distance-based metrics (99th percentile Hausdorff Distance: 7.05 \pm 0.61 , Mean Surface Distance: 2.10 \pm 0.43 ) and component-wise metrics (True Positive Rate: 0.83 \pm 0.01 , Positive Predictive Value: 0.96 \pm 0.01 ). These results demonstrate Spatial-MSMA’s potential for accurate and interpretable anomaly localization in medical imaging, with implications for improved diagnosis and treatment planning in clinical settings. Our code is available at~\urlthis https URL.

[CV-152] InfoNCE: Identifying the Gap Between Theory and Practice

链接: https://arxiv.org/abs/2407.00143
作者: Evgenia Rusak,Patrik Reizinger,Attila Juhos,Oliver Bringmann,Roland S. Zimmermann,Wieland Brendel
关键词: learned representations uncover, contrastive learning, work on contrastive, learned representations, ground-truth latent factors
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Previous theoretical work on contrastive learning (CL) with InfoNCE showed that, under certain assumptions, the learned representations uncover the ground-truth latent factors. We argue these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they assume that within a positive pair, all latent factors either vary to a similar extent, or that some do not vary at all. However, in practice, positive pairs are often generated using augmentations such as strong cropping to just a few pixels. Hence, a more realistic assumption is that all latent factors change, with a continuum of variability across these factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Additionally, we explore and discuss further mismatches between theoretical assumptions and practical implementations, including extensions to hard negative mining and loss ensembles.

[CV-153] Analyzing Quality Bias and Performance in Text-to-Image Generative Models

链接: https://arxiv.org/abs/2407.00138
作者: Nila Masrourisaadat,Nazanin Sedaghatkish,Fatemeh Sarshartehrani,Edward A. Fox
关键词: Advances in generative, demonstrating the ability, text prompts, led to significant, significant interest
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:Advances in generative models have led to significant interest in image synthesis, demonstrating the ability to generate high-quality images for a diverse range of text prompts. Despite this progress, most studies ignore the presence of bias. In this paper, we examine several text-to-image models not only by qualitatively assessing their performance in generating accurate images of human faces, groups, and specified numbers of objects but also by presenting a social bias analysis. As expected, models with larger capacity generate higher-quality images. However, we also document the inherent gender or social biases these models possess, offering a more complete understanding of their impact and limitations.

[CV-154] RepAct: The Re-parameterizable Adaptive Activation Function

链接: https://arxiv.org/abs/2407.00131
作者: Xian Wu,Qingchuan Tao,Shuang Wang
关键词: efficient artificial intelligence, Addressing the imperative, study presents RepAct, re-parameterizable adaptive activation, activation function tailored
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Addressing the imperative need for efficient artificial intelligence in IoT and edge computing, this study presents RepAct, a re-parameterizable adaptive activation function tailored for optimizing lightweight neural networks within the computational limitations of edge devices. By employing a multi-branch structure with learnable adaptive weights, RepAct enriches feature processing and enhances cross-layer interpretability. When evaluated on tasks such as image classification and object detection, RepAct notably surpassed conventional activation functions in lightweight networks, delivering up to a 7.92% accuracy boost on MobileNetV3-Small for the ImageNet100 dataset, while maintaining computational complexity on par with HardSwish. This innovative approach not only maximizes model parameter efficiency but also significantly improves the performance and understanding capabilities of lightweight neural networks, demonstrating its potential for real-time edge computing applications.

[CV-155] Multi-Species Object Detection in Drone Imagery for Population Monitoring of Endangered Animals

链接: https://arxiv.org/abs/2407.00127
作者: Sowmya Sankaran
关键词: Animal populations worldwide, accurately count endangered, count endangered species, rapidly declining, populations worldwide
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Animal populations worldwide are rapidly declining, and a technology that can accurately count endangered species could be vital for monitoring population changes over several years. This research focused on fine-tuning object detection models for drone images to create accurate counts of animal species. Hundreds of images taken using a drone and large, openly available drone-image datasets were used to fine-tune machine learning models with the baseline YOLOv8 architecture. We trained 30 different models, with the largest having 43.7 million parameters and 365 layers, and used hyperparameter tuning and data augmentation techniques to improve accuracy. While the state-of-the-art YOLOv8 baseline had only 0.7% accuracy on a dataset of safari animals, our models had 95% accuracy on the same dataset. Finally, we deployed the models on the Jetson Orin Nano for demonstration of low-power real-time species detection for easy inference on drones.

[CV-156] Automated Web-Based Malaria Detection System with Machine Learning and Deep Learning Techniques

链接: https://arxiv.org/abs/2407.00120
作者: Abraham G Taye,Sador Yemane,Eshetu Negash,Yared Minwuyelet,Moges Abebe,Melkamu Hunegnaw Asmare
关键词: global health burden, causing widespread suffering, significant global health, Malaria parasites pose, health burden
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Malaria parasites pose a significant global health burden, causing widespread suffering and mortality. Detecting malaria infection accurately is crucial for effective treatment and control. However, existing automated detection techniques have shown limitations in terms of accuracy and generalizability. Many studies have focused on specific features without exploring more comprehensive approaches. In our case, we formulate a deep learning technique for malaria-infected cell classification using traditional CNNs and transfer learning models notably VGG19, InceptionV3, and Xception. The models were trained using NIH datasets and tested using different performance metrics such as accuracy, precision, recall, and F1-score. The test results showed that deep CNNs achieved the highest accuracy – 97%, followed by Xception with an accuracy of 95%. A machine learning model SVM achieved an accuracy of 83%, while an Inception-V3 achieved an accuracy of 94%. Furthermore, the system can be accessed through a web interface, where users can upload blood smear images for malaria detection.

[CV-157] AI-Driven Skin Cancer Diagnosis: Grad-CAM and Expert Annotations for Enhanced Interpretability

链接: https://arxiv.org/abs/2407.00104
作者: Iván Matas,Carmen Serrano,Francisca Silva,Amalia Serrano,Tomás Toledo-Pastrana,Begoña Acha
关键词: optimizing resource utilization, BCC, provide interpretable support, BCC dermoscopic features, BCC dermoscopic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Image and Video Processing (eess.IV)
*备注: 8 pages, 4 figures, 4 tables, under review

点击查看摘要

Abstract:An AI tool has been developed to provide interpretable support for the diagnosis of BCC via teledermatology, thus speeding up referrals and optimizing resource utilization. The interpretability is provided in two ways: on the one hand, the main BCC dermoscopic patterns are found in the image to justify the BCC/Non BCC classification. Secondly, based on the common visual XAI Grad-CAM, a clinically inspired visual explanation is developed where the relevant features for diagnosis are located. Since there is no established ground truth for BCC dermoscopic features, a standard reference is inferred from the diagnosis of four dermatologists using an Expectation Maximization (EM) based algorithm. The results demonstrate significant improvements in classification accuracy and interpretability, positioning this approach as a valuable tool for early BCC detection and referral to dermatologists. The BCC/non-BCC classification achieved an accuracy rate of 90%. For Clinically-inspired XAI results, the detection of BCC patterns useful to clinicians reaches 99% accuracy. As for the Clinically-inspired Visual XAI results, the mean of the Grad-CAM normalized value within the manually segmented clinical features is 0.57, while outside this region it is 0.16. This indicates that the model struggles to accurately identify the regions of the BCC patterns. These results prove the ability of the AI tool to provide a useful explanation.

[CV-158] LMVD: A Large-Scale Multimodal Vlog Dataset for Depression Detection in the Wild

链接: https://arxiv.org/abs/2407.00024
作者: Lang He,Kai Chen,Junnan Zhao,Yimeng Wang,Ercheng Pei,Haifeng Chen,Jiewei Jiang,Shiqing Zhang,Jie Zhang,Zhongmin Wang,Tao He,Prayag Tiwari
关键词: including their personal, social functioning, academic and work, significantly impact, impact many aspects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Depression can significantly impact many aspects of an individual’s life, including their personal and social functioning, academic and work performance, and overall quality of life. Many researchers within the field of affective computing are adopting deep learning technology to explore potential patterns related to the detection of depression. However, because of subjects’ privacy protection concerns, that data in this area is still scarce, presenting a challenge for the deep discriminative models used in detecting depression. To navigate these obstacles, a large-scale multimodal vlog dataset (LMVD), for depression recognition in the wild is built. In LMVD, which has 1823 samples with 214 hours of the 1475 participants captured from four multimedia platforms (Sina Weibo, Bilibili, Tiktok, and YouTube). A novel architecture termed MDDformer to learn the non-verbal behaviors of individuals is proposed. Extensive validations are performed on the LMVD dataset, demonstrating superior performance for depression detection. We anticipate that the LMVD will contribute a valuable function to the depression detection community. The data and code will released at the link: this https URL.

[CV-159] Neural Graphics Texture Compression Supporting Random Acces

链接: https://arxiv.org/abs/2407.00021
作者: Farzad Farhadzadeh,Qiqi Hou,Hoang Le,Amir Said,Randall Rauwendaal,Alex Bourd,Fatih Porikli
关键词: tremendous growth, including resolution, Neural Image Compression, matched by advances, Advances
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
*备注: ECCV submission

点击查看摘要

Abstract:Advances in rendering have led to tremendous growth in texture assets, including resolution, complexity, and novel textures components, but this growth in data volume has not been matched by advances in its compression. Meanwhile Neural Image Compression (NIC) has advanced significantly and shown promising results, but the proposed methods cannot be directly adapted to neural texture compression. First, texture compression requires on-demand and real-time decoding with random access during parallel rendering (e.g. block texture decompression on GPUs). Additionally, NIC does not support multi-resolution reconstruction (mip-levels), nor does it have the ability to efficiently jointly compress different sets of texture channels. In this work, we introduce a novel approach to texture set compression that integrates traditional GPU texture representation and NIC techniques, designed to enable random access and support many-channel texture sets. To achieve this goal, we propose an asymmetric auto-encoder framework that employs a convolutional encoder to capture detailed information in a bottleneck-latent space, and at decoder side we utilize a fully connected network, whose inputs are sampled latent features plus positional information, for a given texture coordinate and mip level. This latent data is defined to enable simplified access to multi-resolution data by simply changing the scanning strides. Experimental results demonstrate that this approach provides much better results than conventional texture compression, and significant improvement over the latest method using neural networks.

[CV-160] Visual Language Model based Cross-modal Semantic Communication Systems

链接: https://arxiv.org/abs/2407.00020
作者: Feibo Jiang,Chuanguo Tang,Li Dong,Kezhi Wang,Kun Yang,Cunhua Pan
关键词: Shannon physical capacity, physical capacity limits, Cross-modal Semantic Communication, transcending the Shannon, Shannon physical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:Semantic Communication (SC) has emerged as a novel communication paradigm in recent years, successfully transcending the Shannon physical capacity limits through innovative semantic transmission concepts. Nevertheless, extant Image Semantic Communication (ISC) systems face several challenges in dynamic environments, including low semantic density, catastrophic forgetting, and uncertain Signal-to-Noise Ratio (SNR). To address these challenges, we propose a novel Vision-Language Model-based Cross-modal Semantic Communication (VLM-CSC) system. The VLM-CSC comprises three novel components: (1) Cross-modal Knowledge Base (CKB) is used to extract high-density textual semantics from the semantically sparse image at the transmitter and reconstruct the original image based on textual semantics at the receiver. The transmission of high-density semantics contributes to alleviating bandwidth pressure. (2) Memory-assisted Encoder and Decoder (MED) employ a hybrid long/short-term memory mechanism, enabling the semantic encoder and decoder to overcome catastrophic forgetting in dynamic environments when there is a drift in the distribution of semantic features. (3) Noise Attention Module (NAM) employs attention mechanisms to adaptively adjust the semantic coding and the channel coding based on SNR, ensuring the robustness of the CSC system. The experimental simulations validate the effectiveness, adaptability, and robustness of the CSC system.

[CV-161] NeuroMoCo: A Neuromorphic Momentum Contrast Learning Method for Spiking Neural Networks

链接: https://arxiv.org/abs/2406.06305
作者: Yuqi Ma,Huamin Wang,Hangchi Shen,Xuemei Chen,Shukai Duan,Shiping Wen
关键词: brain-inspired spiking neural, spiking neural networks, attracted great research, great research attention, research attention owing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 32 pages,4 figures,4 tables

点击查看摘要

Abstract:Recently, brain-inspired spiking neural networks (SNNs) have attracted great research attention owing to their inherent bio-interpretability, event-triggered properties and powerful perception of spatiotemporal information, which is beneficial to handling event-based neuromorphic datasets. In contrast to conventional static image datasets, event-based neuromorphic datasets present heightened complexity in feature extraction due to their distinctive time series and sparsity characteristics, which influences their classification accuracy. To overcome this challenge, a novel approach termed Neuromorphic Momentum Contrast Learning (NeuroMoCo) for SNNs is introduced in this paper by extending the benefits of self-supervised pre-training to SNNs to effectively stimulate their potential. This is the first time that self-supervised learning (SSL) based on momentum contrastive learning is realized in SNNs. In addition, we devise a novel loss function named MixInfoNCE tailored to their temporal characteristics to further increase the classification accuracy of neuromorphic datasets, which is verified through rigorous ablation experiments. Finally, experiments on DVS-CIFAR10, DVS128Gesture and N-Caltech101 have shown that NeuroMoCo of this paper establishes new state-of-the-art (SOTA) benchmarks: 83.6% (Spikformer-2-256), 98.62% (Spikformer-2-256), and 84.4% (SEW-ResNet-18), respectively.

[CV-162] xLSTM-UNet can be an Effective 2D 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

链接: https://arxiv.org/abs/2407.01530
作者: Tianrun Chen,Chaotao Ding,Lanyun Zhu,Tao Xu,Deyi Ji,Ying Zang,Zejian Li
关键词: dependencies remains constrained, Vision Transformers, Neural Language Processing, Convolutional Neural Networks, manage long-range dependencies
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) have been pivotal in biomedical image segmentation, yet their ability to manage long-range dependencies remains constrained by inherent locality and computational overhead. To overcome these challenges, in this technical report, we first propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (xLSTM) as its backbone for medical image segmentation. xLSTM is a recently proposed as the successor of Long Short-Term Memory (LSTM) networks and have demonstrated superior performance compared to Transformers and State Space Models (SSMs) like Mamba in Neural Language Processing (NLP) and image classification (as demonstrated in Vision-LSTM, or ViL implementation). Here, xLSTM-UNet we designed extend the success in biomedical image segmentation domain. By integrating the local feature extraction strengths of convolutional layers with the long-range dependency capturing abilities of xLSTM, xLSTM-UNet offers a robust solution for comprehensive image analysis. We validate the efficacy of xLSTM-UNet through experiments. Our findings demonstrate that xLSTM-UNet consistently surpasses the performance of leading CNN-based, Transformer-based, and Mamba-based segmentation networks in multiple datasets in biomedical segmentation including organs in abdomen MRI, instruments in endoscopic images, and cells in microscopic images. With comprehensive experiments performed, this technical report highlights the potential of xLSTM-based architectures in advancing biomedical image analysis in both 2D and 3D. The code, models, and datasets are publicly available at \hrefthis http URLthis http URL

[CV-163] Centerline Boundary Dice Loss for Vascular Segmentation

链接: https://arxiv.org/abs/2407.01517
作者: Pengcheng Shi,Jiesi Hu,Yanwu Yang,Zilve Gao,Wei Liu,Ting Ma
关键词: medical imaging plays, functional assessments, medical imaging, imaging plays, plays a crucial
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by MICCAI 2024

点击查看摘要

Abstract:Vascular segmentation in medical imaging plays a crucial role in analysing morphological and functional assessments. Traditional methods, like the centerline Dice (clDice) loss, ensure topology preservation but falter in capturing geometric details, especially under translation and deformation. The combination of clDice with traditional Dice loss can lead to diameter imbalance, favoring larger vessels. Addressing these challenges, we introduce the centerline boundary Dice (cbDice) loss function, which harmonizes topological integrity and geometric nuances, ensuring consistent segmentation across various vessel sizes. cbDice enriches the clDice approach by including boundary-aware aspects, thereby improving geometric detail recognition. It matches the performance of the boundary difference over union (B-DoU) loss through a mask-distance-based approach, enhancing traslation sensitivity. Crucially, cbDice incorporates radius information from vascular skeletons, enabling uniform adaptation to vascular diameter changes and maintaining balance in branch growth and fracture impacts. Furthermore, we conducted a theoretical analysis of clDice variants (cl-X-Dice). We validated cbDice’s efficacy on three diverse vascular segmentation datasets, encompassing both 2D and 3D, and binary and multi-class segmentation. Particularly, the method integrated with cbDice demonstrated outstanding performance on the MICCAI 2023 TopCoW Challenge dataset. Our code is made publicly available at: this https URL.

[CV-164] Neurovascular Segmentation in sOCT with Deep Learning and Synthetic Training Data

链接: https://arxiv.org/abs/2407.01419
作者: Etienne Chollet,Yaël Balbastre,Chiara Mauri,Caroline Magnain,Bruce Fischl,Hui Wang
关键词: Microvascular anatomy, neurological disorders, Microvascular, comprehensive three-dimensional vascular, three-dimensional vascular network
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 10 figures

点击查看摘要

Abstract:Microvascular anatomy is known to be involved in various neurological disorders. However, understanding these disorders is hindered by the lack of imaging modalities capable of capturing the comprehensive three-dimensional vascular network structure at microscopic resolution. With a lateral resolution of = 20 \textmum and ability to reconstruct large tissue blocks up to tens of cubic centimeters, serial-section optical coherence tomography (sOCT) is well suited for this task. This method uses intrinsic optical properties to visualize the vessels and therefore does not possess a specific contrast, which complicates the extraction of accurate vascular models. The performance of traditional vessel segmentation methods is heavily degraded in the presence of substantial noise and imaging artifacts and is sensitive to domain shifts, while convolutional neural networks (CNNs) require extensive labeled data and are also sensitive the precise intensity characteristics of the data that they are trained on. Building on the emerging field of synthesis-based training, this study demonstrates a synthesis engine for neurovascular segmentation in sOCT images. Characterized by minimal priors and high variance sampling, our highly generalizable method tested on five distinct sOCT acquisitions eliminates the need for manual annotations while attaining human-level precision. Our approach comprises two phases: label synthesis and label-to-image transformation. We demonstrate the efficacy of the former by comparing it to several more realistic sets of training labels, and the latter by an ablation study of synthetic noise and artifact models.

[CV-165] Cross-Slice Attention and Evidential Critical Loss for Uncertainty-Aware Prostate Cancer Detection

链接: https://arxiv.org/abs/2407.01146
作者: Alex Ling Yu Hung,Haoxin Zheng,Kai Zhao,Kaifeng Pang,Demetri Terzopoulos,Kyunghyun Sung
关键词: Current deep learning-based, albeit disregarding volumetric, typically analyze medical, analyze medical images, disregarding volumetric information
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current deep learning-based models typically analyze medical images in either 2D or 3D albeit disregarding volumetric information or suffering sub-optimal performance due to the anisotropic resolution of MR data. Furthermore, providing an accurate uncertainty estimation is beneficial to clinicians, as it indicates how confident a model is about its prediction. We propose a novel 2.5D cross-slice attention model that utilizes both global and local information, along with an evidential critical loss, to perform evidential deep learning for the detection in MR images of prostate cancer, one of the most common cancers and a leading cause of cancer-related death in men. We perform extensive experiments with our model on two different datasets and achieve state-of-the-art performance in prostate cancer detection along with improved epistemic uncertainty estimation. The implementation of the model is available at this https URL.

[CV-166] Learning 3D Gaussians for Extremely Sparse-View Cone-Beam CT Reconstruction

链接: https://arxiv.org/abs/2407.01090
作者: Yiqun Lin,Hualiang Wang,Jixiang Chen,Xiaomeng Li
关键词: Cone-Beam Computed Tomography, Computed Tomography, exposure raises concerns, radiation exposure raises, Cone-Beam Computed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to MICCAI 2024. Project link: this https URL

点击查看摘要

Abstract:Cone-Beam Computed Tomography (CBCT) is an indispensable technique in medical imaging, yet the associated radiation exposure raises concerns in clinical practice. To mitigate these risks, sparse-view reconstruction has emerged as an essential research direction, aiming to reduce the radiation dose by utilizing fewer projections for CT reconstruction. Although implicit neural representations have been introduced for sparse-view CBCT reconstruction, existing methods primarily focus on local 2D features queried from sparse projections, which is insufficient to process the more complicated anatomical structures, such as the chest. To this end, we propose a novel reconstruction framework, namely DIF-Gaussian, which leverages 3D Gaussians to represent the feature distribution in the 3D space, offering additional 3D spatial information to facilitate the estimation of attenuation coefficients. Furthermore, we incorporate test-time optimization during inference to further improve the generalization capability of the model. We evaluate DIF-Gaussian on two public datasets, showing significantly superior reconstruction performance than previous state-of-the-art methods.

[CV-167] Analysis of Modern Computer Vision Models for Blood Cell Classification

链接: https://arxiv.org/abs/2407.00759
作者: Alexander Kim(1),Ryan Kim(2) ((1) University of Illinois Urbana-Champaign, (2) William Fremd High School)
关键词: white blood cells, related blood components, white blood, blood cells, related blood
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:The accurate classification of white blood cells and related blood components is crucial for medical diagnoses. While traditional manual examinations and automated hematology analyzers have been widely used, they are often slow and prone to errors. Recent advancements in deep learning have shown promise for addressing these limitations. Earlier studies have demonstrated the viability of convolutional neural networks such as DenseNet, ResNet, and VGGNet for this task. Building on these foundations, our work employs more recent and efficient models to achieve rapid and accurate results. Specifically, this study used state-of-the-art architectures, including MaxVit, EfficientVit, EfficientNet, EfficientNetV2, and MobileNetV3. This study aimed to evaluate the performance of these models in WBC classification, potentially offering a more efficient and reliable alternative to current methods. Our approach not only addresses the speed and accuracy concerns of traditional techniques but also explores the applicability of innovative deep learning models in hematological analysis.

[CV-168] ASPS: Augmented Segment Anything Model for Polyp Segmentation

链接: https://arxiv.org/abs/2407.00718
作者: Huiqian Li,Dingwen Zhang,Jieru Yao,Longfei Han,Zhongyu Li,Junwei Han
关键词: colorectal cancer diagnosis, Polyp segmentation, Polyp segmentation plays, cancer diagnosis, Polyp
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI2024

点击查看摘要

Abstract:Polyp segmentation plays a pivotal role in colorectal cancer diagnosis. Recently, the emergence of the Segment Anything Model (SAM) has introduced unprecedented potential for polyp segmentation, leveraging its powerful pre-training capability on large-scale datasets. However, due to the domain gap between natural and endoscopy images, SAM encounters two limitations in achieving effective performance in polyp segmentation. Firstly, its Transformer-based structure prioritizes global and low-frequency information, potentially overlooking local details, and introducing bias into the learned features. Secondly, when applied to endoscopy images, its poor out-of-distribution (OOD) performance results in substandard predictions and biased confidence output. To tackle these challenges, we introduce a novel approach named Augmented SAM for Polyp Segmentation (ASPS), equipped with two modules: Cross-branch Feature Augmentation (CFA) and Uncertainty-guided Prediction Regularization (UPR). CFA integrates a trainable CNN encoder branch with a frozen ViT encoder, enabling the integration of domain-specific knowledge while enhancing local features and high-frequency details. Moreover, UPR ingeniously leverages SAM’s IoU score to mitigate uncertainty during the training procedure, thereby improving OOD performance and domain generalization. Extensive experimental results demonstrate the effectiveness and utility of the proposed method in improving SAM’s performance in polyp segmentation. Our code is available at this https URL.

[CV-169] A Review of Image Processing Methods in Prostate Ultrasound

链接: https://arxiv.org/abs/2407.00678
作者: Haiqiao Wang,Hong Wu,Zhuoyuan Wang,Peiyan Yue,Dong Ni,Pheng-Ann Heng,Yi Wang
关键词: reducing mortality rates, poses a significant, men health, mortality rates, Prostate cancer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Prostate cancer (PCa) poses a significant threat to men’s health, with early diagnosis being crucial for improving prognosis and reducing mortality rates. Transrectal ultrasound (TRUS) plays a vital role in the diagnosis and image-guided intervention of this http URL facilitate physicians with more accurate and efficient computer-assisted diagnosis and interventions, many image processing algorithms in TRUS have been proposed and achieved state-of-the-art performance in several tasks, including prostate gland segmentation, prostate image registration, PCa classification and detection, and interventional needle detection.The rapid development of these algorithms over the past two decades necessitates a comprehensive summary. In consequence, this survey provides a systematic analysis of this field, outlining the evolution of image processing methods in the context of TRUS image analysis and meanwhile highlighting their relevant contributions. Furthermore, this survey discusses current challenges and suggests future research directions to possibly advance this field further.

[CV-170] HATs: Hierarchical Adaptive Taxonomy Segmentation for Panoramic Pathology Image Analysis

链接: https://arxiv.org/abs/2407.00596
作者: Ruining Deng,Quan Liu,Can Cui,Tianyuan Yao,Juming Xiong,Shunxing Bao,Hao Li,Mengmeng Yin,Yu Wang,Shilin Zhao,Yucheng Tang,Haichun Yang,Yuankai Huo
关键词: variably scaled anatomy, remarkable challenge due, computational pathology presents, Panoramic image segmentation, scaled anatomy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2402.19286

点击查看摘要

Abstract:Panoramic image segmentation in computational pathology presents a remarkable challenge due to the morphologically complex and variably scaled anatomy. For instance, the intricate organization in kidney pathology spans multiple layers, from regions like the cortex and medulla to functional units such as glomeruli, tubules, and vessels, down to various cell types. In this paper, we propose a novel Hierarchical Adaptive Taxonomy Segmentation (HATs) method, which is designed to thoroughly segment panoramic views of kidney structures by leveraging detailed anatomical insights. Our approach entails (1) the innovative HATs technique which translates spatial relationships among 15 distinct object classes into a versatile “plug-and-play” loss function that spans across regions, functional units, and cells, (2) the incorporation of anatomical hierarchies and scale considerations into a unified simple matrix representation for all panoramic entities, (3) the adoption of the latest AI foundation model (EfficientSAM) as a feature extraction tool to boost the model’s adaptability, yet eliminating the need for manual prompt generation in conventional segment anything model (SAM). Experimental findings demonstrate that the HATs method offers an efficient and effective strategy for integrating clinical insights and imaging precedents into a unified segmentation model across more than 15 categories. The official implementation is publicly available at this https URL.

[CV-171] Fully invertible hyperbolic neural networks for segmenting large-scale surface and sub-surface data

链接: https://arxiv.org/abs/2407.00595
作者: Bas Peters,Eldad Haber,Keegan Lensink
关键词: fully invertible networks, invertible networks, surface data segmentation, fully invertible, invertible
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages, 13 figures

点击查看摘要

Abstract:The large spatial/temporal/frequency scale of geoscience and remote-sensing datasets causes memory issues when using convolutional neural networks for (sub-) surface data segmentation. Recently developed fully reversible or fully invertible networks can mostly avoid memory limitations by recomputing the states during the backward pass through the network. This results in a low and fixed memory requirement for storing network states, as opposed to the typical linear memory growth with network depth. This work focuses on a fully invertible network based on the telegraph equation. While reversibility saves the major amount of memory used in deep networks by the data, the convolutional kernels can take up most memory if fully invertible networks contain multiple invertible pooling/coarsening layers. We address the explosion of the number of convolutional kernels by combining fully invertible networks with layers that contain the convolutional kernels in a compressed form directly. A second challenge is that invertible networks output a tensor the same size as its input. This property prevents the straightforward application of invertible networks to applications that map between different input-output dimensions, need to map to outputs with more channels than present in the input data, or desire outputs that decrease/increase the resolution compared to the input data. However, we show that by employing invertible networks in a non-standard fashion, we can still use them for these tasks. Examples in hyperspectral land-use classification, airborne geophysical surveying, and seismic imaging illustrate that we can input large data volumes in one chunk and do not need to work on small patches, use dimensionality reduction, or employ methods that classify a patch to a single central pixel.

[CV-172] Accelerating Longitudinal MRI using Prior Informed Latent Diffusion

链接: https://arxiv.org/abs/2407.00537
作者: Yonatan Urman,Zachary Shah,Ashwin Kumar,Bruno P.Soares,Kawin Setsompop
关键词: soft-tissue imaging modality, ionization-free soft-tissue imaging, imaging modality, widely used ionization-free, ionization-free soft-tissue
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:MRI is a widely used ionization-free soft-tissue imaging modality, often employed repeatedly over a patient’s lifetime. However, prolonged scanning durations, among other issues, can limit availability and accessibility. In this work, we aim to substantially reduce scan times by leveraging prior scans of the same patient. These prior scans typically contain considerable shared information with the current scan, thereby enabling higher acceleration rates when appropriately utilized. We propose a prior informed reconstruction method with a trained diffusion model in conjunction with data-consistency steps. Our method can be trained with unlabeled image data, eliminating the need for a dataset of either k-space measurements or paired longitudinal scans as is required of other learning-based methods. We demonstrate superiority of our method over previously suggested approaches in effectively utilizing prior information without over-biasing prior consistency, which we validate on both an open-source dataset of healthy patients as well as several longitudinal cases of clinical interest.

[CV-173] UADSN: Uncertainty-Aware Dual-Stream Network for Facial Nerve Segmentation

链接: https://arxiv.org/abs/2407.00297
作者: Guanghao Zhu,Lin Liu,Jing Zhang,Xiaohui Du,Ruqian Hao,Juanxiu Liu
关键词: cochlear implantation surgery, preoperative path planning, Facial nerve, implantation surgery, crucial for preoperative
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Facial nerve segmentation is crucial for preoperative path planning in cochlear implantation surgery. Recently, researchers have proposed some segmentation methods, such as atlas-based and deep learning-based methods. However, since the facial nerve is a tubular organ with a diameter of only 1.0-1.5mm, it is challenging to locate and segment the facial nerve in CT scans. In this work, we propose an uncertainty-aware dualstream network (UADSN). UADSN consists of a 2D segmentation stream and a 3D segmentation stream. Predictions from two streams are used to identify uncertain regions, and a consistency loss is employed to supervise the segmentation of these regions. In addition, we introduce channel squeeze spatial excitation modules into the skip connections of U-shaped networks to extract meaningful spatial information. In order to consider topologypreservation, a clDice loss is introduced into the supervised loss function. Experimental results on the facial nerve dataset demonstrate the effectiveness of UADSN and our submodules.

[CV-174] IVCA: Inter-Relation-Aware Video Complexity Analyzer

链接: https://arxiv.org/abs/2407.00280
作者: Junqi Liao,Yao Li,Zhuoyuan Li,Li Li,Dong Liu
关键词: video streaming applications, real-time analysis requirements, video complexity analyzer, video streaming, streaming applications
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: The report for the solution of second prize winner in ICIP 2024 Grand Challenge on Video Complexity (Team: USTC-iVC_Team1, USTC-iVC_Team2)

点击查看摘要

Abstract:To meet the real-time analysis requirements of video streaming applications, we propose an inter-relation-aware video complexity analyzer (IVCA) as an extension to VCA. The IVCA addresses the limitation of VCA by considering inter-frame relations, namely motion and reference structure. First, we enhance the accuracy of temporal features by introducing feature-domain motion estimation into the IVCA. Next, drawing inspiration from the hierarchical reference structure in codecs, we design layer-aware weights to adjust the majorities of frame complexity in different layers. Additionally, we expand the scope of temporal features by considering frames that be referred to, rather than relying solely on the previous frame. Experimental results show the significant improvement in complexity estimation accuracy achieved by IVCA, with minimal time complexity increase.

[CV-175] Generative Iris Prior Embedded Transformer for Iris Restoration

链接: https://arxiv.org/abs/2407.00261
作者: Yubo Huang,Jia Wang,Peipei Li,Liuyu Xiang,Peigang Li,Zhaofeng He
关键词: complexly degraded iris, aiming to improve, challenging problem, Iris, degraded iris images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Our code is available at this https URL

点击查看摘要

Abstract:Iris restoration from complexly degraded iris images, aiming to improve iris recognition performance, is a challenging problem. Due to the complex degradation, directly training a convolutional neural network (CNN) without prior cannot yield satisfactory results. In this work, we propose a generative iris prior embedded Transformer model (Gformer), in which we build a hierarchical encoder-decoder network employing Transformer block and generative iris prior. First, we tame Transformer blocks to model long-range dependencies in target images. Second, we pretrain an iris generative adversarial network (GAN) to obtain the rich iris prior, and incorporate it into the iris restoration process with our iris feature modulator. Our experiments demonstrate that the proposed Gformer outperforms state-of-the-art methods. Besides, iris recognition performance has been significantly improved after applying Gformer.

[CV-176] DCSM 2.0: Deep Conditional Shape Models for Data Efficient Segmentation

链接: https://arxiv.org/abs/2407.00186
作者: Athira J Jacob,Puneet Sharma,Daniel Rueckert
关键词: image analyses workflows, medical image analyses, Conditional Shape Models, Deep Conditional Shape, analyses workflows
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Best oral paper award at ISBI 2024

点击查看摘要

Abstract:Segmentation is often the first step in many medical image analyses workflows. Deep learning approaches, while giving state-of-the-art accuracies, are data intensive and do not scale well to low data regimes. We introduce Deep Conditional Shape Models 2.0, which uses an edge detector, along with an implicit shape function conditioned on edge maps, to leverage cross-modality shape information. The shape function is trained exclusively on a source domain (contrasted CT) and applied to the target domain of interest (3D echocardiography). We demonstrate data efficiency in the target domain by varying the amounts of training data used in the edge detection stage. We observe that DCSM 2.0 outperforms the baseline at all data levels in terms of Hausdorff distances, and while using 50% or less of the training data in terms of average mesh distance, and at 10% or less of the data with the dice coefficient. The method scales well to low data regimes, with gains of up to 5% in dice coefficient, 2.58 mm in average surface distance and 21.02 mm in Hausdorff distance when using just 2% (22 volumes) of the training data.

[CV-177] Scalable Trustworthy Generative Model for Virtual Multi-Staining from HE Whole Slide Images

链接: https://arxiv.org/abs/2407.00098
作者: Mehdi Ounissi,Ilias Sarbout,Jean-Pierre Hugot,Christine Martinez-Vinson,Dominique Berrebi,Daniel Racoceanu
关键词: require extensive time, raise environmental concerns, expensive chemicals, Chemical staining methods, extensive time
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Chemical staining methods are dependable but require extensive time, expensive chemicals, and raise environmental concerns. These challenges highlight the need for alternative solutions like virtual staining, which accelerates the diagnostic process and enhances stain application flexibility. Generative AI technologies are pivotal in addressing these issues. However, the high-stakes nature of healthcare decisions, especially in computational pathology, complicates the adoption of these tools due to their opaque processes. Our work introduces the use of generative AI for virtual staining, aiming to enhance performance, trustworthiness, scalability, and adaptability in computational pathology. The methodology centers on a singular HE encoder supporting multiple stain decoders. This design focuses on critical regions in the latent space of HE, enabling precise synthetic stain generation. Our method, tested to generate 8 different stains from a single HE slide, offers scalability by loading only necessary model components during production. We integrate label-free knowledge in training, using loss functions and regularization to minimize artifacts, thus improving paired/unpaired virtual staining accuracy. To build trust, we use real-time self-inspection with discriminators for each stain type, providing pathologists with confidence heat-maps. Automatic quality checks on new HE slides ensure conformity to the trained distribution, ensuring accurate synthetic stains. Recognizing pathologists’ challenges with new technologies, we have developed an open-source, cloud-based system, that allows easy virtual staining of HE slides through a browser, addressing hardware/software issues and facilitating real-time user feedback. We also curated a novel dataset of 8 paired HE/stains related to pediatric Crohn’s disease, comprising 480 WSIs to further stimulate computational pathology research.

[CV-178] Comparing fine-grained and coarse-grained object detection for ecology

链接: https://arxiv.org/abs/2407.00018
作者: Jess Tam,Justin Kay
关键词:
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Populations and Evolution (q-bio.PE)
*备注: 6 pages, 4 figures, accepted to be presented as a poster presentation at a conference workshop (11th Fine-Grained Visual Categorisation 2024)

点击查看摘要

[CV-179] Odd-One-Out: Anomaly Detection by Comparing with Neighbors

链接: https://arxiv.org/abs/2406.20099
作者: Ankan Bhunia,Changjian Li,Hakan Bilen
关键词:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Codes Dataset at this https URL

点击查看摘要

机器学习

[LG-0] Sparse Diffusion Policy: A Sparse Reusable and Flexible Policy for Robot Learning

链接: https://arxiv.org/abs/2407.01531
作者: Yixiao Wang,Yifei Zhang,Mingxiao Huo,Ran Tian,Xiang Zhang,Yichen Xie,Chenfeng Xu,Pengliang Ji,Wei Zhan,Mingyu Ding,Masayoshi Tomizuka
关键词: demands efficient strategies, increasing complexity, Sparse Diffusion Policy, Diffusion Policy, tasks
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). By adopting Mixture of Experts (MoE) within a transformer-based diffusion policy, SDP selectively activates experts and skills, enabling efficient and task-specific learning without retraining the entire model. SDP not only reduces the burden of active parameters but also facilitates the seamless integration and reuse of experts across various tasks. Extensive experiments on diverse tasks in both simulations and real world show that SDP 1) excels in multitask scenarios with negligible increases in active parameters, 2) prevents forgetting in continual learning of new tasks, and 3) enables efficient task transfer, offering a promising solution for advanced robotic applications. Demos and codes can be found in this https URL.

[LG-1] On the Abuse and Detection of Polyglot Files

链接: https://arxiv.org/abs/2407.01529
作者: Luke Koch,Sean Oesch,Amul Chaulagain,Jared Dixon,Matthew Dixon,Mike Huettal,Amir Sadovnik,Cory Watson,Brian Weber,Jacob Hartman,Richard Patulski
关键词: Polyglot files, polyglot, files, detection, wild
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 18 pages, 11 figures

点击查看摘要

Abstract:A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding 30 polyglot samples and 15 attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild – the first of its kind – we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of 0.999 with an F1 score of 99.20 % for polyglot detection and 99.47 % for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized 100 % of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.

[LG-2] Scalable Nested Optimization for Deep Learning

链接: https://arxiv.org/abs/2407.01526
作者: Jonathan Lorraine
关键词: updating a single, single loss, Gradient-based optimization, single set, minimize a single
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: View more research details at this https URL

点击查看摘要

Abstract:Gradient-based optimization has been critical to the success of machine learning, updating a single set of parameters to minimize a single loss. A growing number of applications rely on a generalization of this, where we have a bilevel or nested optimization of which subsets of parameters update on different objectives nested inside each other. We focus on motivating examples of hyperparameter optimization and generative adversarial networks. However, naively applying classical methods often fails when we look at solving these nested problems on a large scale. In this thesis, we build tools for nested optimization that scale to deep learning setups.

[LG-3] Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

链接: https://arxiv.org/abs/2407.01521
作者: Bingliang Zhang,Wenda Chu,Julius Berner,Chenlin Meng,Anima Anandkumar,Yang Song
关键词: solving Bayesian inverse, learned data priors, Bayesian inverse problems, solving Bayesian, recently achieved success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inverse problems, such as phase retrieval. To address this challenge, we propose a new method called Decoupled Annealing Posterior Sampling (DAPS) that relies on a novel noise annealing process. Specifically, we decouple consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior as we reduce noise levels. This approach enables the exploration of a larger solution space, improving the success rate for accurate reconstructions. We demonstrate that DAPS significantly improves sample quality and stability across multiple image restoration tasks, particularly in complicated nonlinear inverse problems. For example, we achieve a PSNR of 30.72dB on the FFHQ 256 dataset for phase retrieval, which is an improvement of 9.12dB compared to existing methods.

[LG-4] owards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

链接: https://arxiv.org/abs/2407.01518
作者: Hao Dong,Eleni Chatzi,Olga Fink
关键词: Multimodal Open-Set Domain, open-set domain generalization, open-set domain, involves recognizing, Multimodal Jigsaw Puzzles
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ECCV 2024, code: this https URL

点击查看摘要

Abstract:The task of open-set domain generalization (OSDG) involves recognizing novel classes within unseen domains, which becomes more challenging with multiple modalities as input. Existing works have only addressed unimodal OSDG within the meta-learning framework, without considering multimodal scenarios. In this work, we introduce a novel approach to address Multimodal Open-Set Domain Generalization (MM-OSDG) for the first time, utilizing self-supervision. To this end, we introduce two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles. These tasks facilitate the learning of multimodal representative features, thereby enhancing generalization and open-class detection capabilities. Additionally, we propose a novel entropy weighting mechanism to balance the loss across different modalities. Furthermore, we extend our approach to tackle also the Multimodal Open-Set Domain Adaptation (MM-OSDA) problem, especially in scenarios where unlabeled data from the target domain is available. Extensive experiments conducted under MM-OSDG, MM-OSDA, and Multimodal Closed-Set DG settings on the EPIC-Kitchens and HAC datasets demonstrate the efficacy and versatility of the proposed approach. Our source code is available at this https URL.

[LG-5] Open-TeleVision: Teleoperation with Immersive Active Visual Feedback

链接: https://arxiv.org/abs/2407.01512
作者: Xuxin Cheng,Jialong Li,Shiqi Yang,Ge Yang,Xiaolong Wang
关键词: on-robot data essential, collecting on-robot data, powerful method, Teleoperation serves, teleoperation system
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:Teleoperation serves as a powerful method for collecting on-robot data essential for robot learning from demonstrations. The intuitiveness and ease of use of the teleoperation system are crucial for ensuring high-quality, diverse, and scalable data. To achieve this, we propose an immersive teleoperation system Open-TeleVision that allows operators to actively perceive the robot’s surroundings in a stereoscopic manner. Additionally, the system mirrors the operator’s arm and hand movements on the robot, creating an immersive experience as if the operator’s mind is transmitted to a robot embodiment. We validate the effectiveness of our system by collecting data and training imitation learning policies on four long-horizon, precise tasks (Can Sorting, Can Insertion, Folding, and Unloading) for 2 different humanoid robots and deploy them in the real world. The system is open-sourced at: this https URL

[LG-6] AI Agents That Matter

链接: https://arxiv.org/abs/2407.01502
作者: Sayash Kapoor,Benedikt Stroebl,Zachary S. Siegel,Nitya Nadgir,Arvind Narayanan
关键词: research direction, exciting new research, accuracy, agents, agent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that hinder their usefulness in real-world applications. First, there is a narrow focus on accuracy without attention to other metrics. As a result, SOTA agents are needlessly complex and costly, and the community has reached mistaken conclusions about the sources of accuracy gains. Our focus on cost in addition to accuracy motivates the new goal of jointly optimizing the two metrics. We design and implement one such optimization, showing its potential to greatly reduce cost while maintaining accuracy. Second, the benchmarking needs of model and downstream developers have been conflated, making it hard to identify which agent would be best suited for a particular application. Third, many agent benchmarks have inadequate holdout sets, and sometimes none at all. This has led to agents that are fragile because they take shortcuts and overfit to the benchmark in various ways. We prescribe a principled framework for avoiding overfitting. Finally, there is a lack of standardization in evaluation practices, leading to a pervasive lack of reproducibility. We hope that the steps we introduce for addressing these shortcomings will spur the development of agents that are useful in the real world and not just accurate on benchmarks.

[LG-7] Online Learning of Temporal Dependencies for Sustainable Foraging Problem

链接: https://arxiv.org/abs/2407.01501
作者: John Payne,Aishwaryaprajna,Peter R. Lewis
关键词: dynamic environment testbed, Long Short-Term Memory, dynamic environment, environment testbed, testbed for exploring
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 6 pages, 13 figures, Submitted to the Second International Workshop on Sustainability and Scalability of Self-Organisation (SaSSO 2024) decision pending

点击查看摘要

Abstract:The sustainable foraging problem is a dynamic environment testbed for exploring the forms of agent cognition in dealing with social dilemmas in a multi-agent setting. The agents need to resist the temptation of individual rewards through foraging and choose the collective long-term goal of sustainability. We investigate methods of online learning in Neuro-Evolution and Deep Recurrent Q-Networks to enable agents to attempt the problem one-shot as is often required by wicked social problems. We further explore if learning temporal dependencies with Long Short-Term Memory may be able to aid the agents in developing sustainable foraging strategies in the long term. It was found that the integration of Long Short-Term Memory assisted agents in developing sustainable strategies for a single agent, however failed to assist agents in managing the social dilemma that arises in the multi-agent scenario.

[LG-8] Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting

链接: https://arxiv.org/abs/2407.01499
作者: Scott H. Hawley
关键词: balance output quality, featuring diverse architectures, witnessed significant progress, Hourglass Diffusion Transformer, Recent years
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 6 pages text + 2 pages references, 10 figures

点击查看摘要

Abstract:Recent years have witnessed significant progress in generative models for music, featuring diverse architectures that balance output quality, diversity, speed, and user control. This study explores a user-friendly graphical interface enabling the drawing of masked regions for inpainting by an Hourglass Diffusion Transformer (HDiT) model trained on MIDI piano roll images. To enhance note generation in specified areas, masked regions can be “repainted” with extra noise. The non-latent HDiTs linear scaling with pixel count allows efficient generation in pixel space, providing intuitive and interpretable controls such as masking throughout the network and removing the need to operate in compressed latent spaces such as those provided by pretrained autoencoders. We demonstrate that, in addition to inpainting of melodies, accompaniment, and continuations, the use of repainting can help increase note density yielding musical structures closely matching user specifications such as rising, falling, or diverging melody and/or accompaniment, even when these lie outside the typical training data distribution. We achieve performance on par with prior results while operating at longer context windows, with no autoencoder, and can enable complex geometries for inpainting masks, increasing the options for machine-assisted composers to control the generated music.

[LG-9] Fast Iterative Solver For Neural Network Method: II. 1D Diffusion-Reaction Problems And Data Fitting

链接: https://arxiv.org/abs/2407.01496
作者: Zhiqiang Cai,Anastassia Doktorova,Robert D. Falgout,César Herrera
关键词: data fitting problems, least-squares data fitting, mass matrix, method introduced recently, damped block Newton
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper expands the damped block Newton (dBN) method introduced recently in [4] for 1D diffusion-reaction equations and least-squares data fitting problems. To determine the linear parameters (the weights and bias of the output layer) of the neural network (NN), the dBN method requires solving systems of linear equations involving the mass matrix. While the mass matrix for local hat basis functions is tri-diagonal and well-conditioned, the mass matrix for NNs is dense and ill-conditioned. For example, the condition number of the NN mass matrix for quasi-uniform meshes is at least \cal O(n^4) . We present a factorization of the mass matrix that enables solving the systems of linear equations in \cal O(n) operations. To determine the non-linear parameters (the weights and bias of the hidden layer), one step of a damped Newton method is employed at each iteration. A Gauss-Newton method is used in place of Newton for the instances in which the Hessian matrices are singular. This modified dBN is referred to as dBGN. For both methods, the computational cost per iteration is \cal O(n) . Numerical results demonstrate the ability dBN and dBGN to efficiently achieve accurate results and outperform BFGS for select examples.

[LG-10] LLM See LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

链接: https://arxiv.org/abs/2407.01490
作者: Luísa Shimabucoro,Sebastian Ruder,Julia Kreutzer,Marzieh Fadaee,Sara Hooker
关键词: synthetic data, synthetic data raises, data, synthetic, widespread adoption
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread adoption of synthetic data raises new questions about how models generating the data can influence other large language models (LLMs) via distilled data. To start, our work exhaustively characterizes the impact of passive inheritance of model properties by systematically studying the consequences of synthetic data integration. We provide one of the most comprehensive studies to-date of how the source of synthetic data shapes models’ internal biases, calibration and generations’ textual attributes and preferences. We find that models are surprisingly sensitive towards certain attributes even when the synthetic data prompts appear “neutral”. which invites the question whether this sensitivity can be exploited for good. Our findings invite the question can we explicitly steer the models towards the properties we want at test time by exploiting the data generation process? This would have historically been considered infeasible due to the cost of collecting data with a specific characteristic or objective in mind. However, improvement in the quality of synthetic data, as well as a shift towards general-purpose models designed to follow a diverse way of instructions, means this question is timely. We propose active inheritance as a term to describe intentionally constraining synthetic data according to a non-differentiable objective. We demonstrate how active inheritance can steer the generation profiles of models towards desirable non-differentiable attributes, e.g. high lexical diversity or low toxicity. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2407.01490 [cs.CL] (or arXiv:2407.01490v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.01490 Focus to learn more arXiv-issued DOI via DataCite

[LG-11] Agentless: Demystifying LLM-based Software Engineering Agents

链接: https://arxiv.org/abs/2407.01489
作者: Chunqiu Steven Xia,Yinlin Deng,Soren Dunn,Lingming Zhang
关键词: including code synthesis, large language models, software development tasks, Recent advancements, software development
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic two-phase process of localization followed by repair, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (27.33%) and lowest cost (\ 0.34) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.

[LG-12] ree Search for Language Model Agents

链接: https://arxiv.org/abs/2407.01476
作者: Jing Yu Koh,Stephen McAleer,Daniel Fried,Ruslan Salakhutdinov
关键词: Autonomous agents powered, Autonomous agents, perform decision-making tasks, demonstrated promise, search
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 11 pages. Models and code available at this https URL

点击查看摘要

Abstract:Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at this https URL.

[LG-13] Exploring FPGA designs for MX and beyond

链接: https://arxiv.org/abs/2407.01475
作者: Ebby Samson,Naveen Mellempudi,Wayne Luk,George A. Constantinides
关键词: Open Compute Project, Compute Project, companies recently worked, Open Compute, low-precision computation
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:A number of companies recently worked together to release the new Open Compute Project MX standard for low-precision computation, aimed at efficient neural network implementation. In this paper, we describe and evaluate the first open-source FPGA implementation of the arithmetic defined in the standard. Our designs fully support all the standard’s concrete formats for conversion into and out of MX formats and for the standard-defined arithmetic operations, as well as arbitrary fixed-point and floating-point formats. Certain elements of the standard are left as implementation-defined, and we present the first concrete FPGA-inspired choices for these elements, which we outline in the paper. Our library of optimized hardware components is available open source, and can be used to build larger systems. For this purpose, we also describe and release an open-source Pytorch library for quantization into the new standard, integrated with the Brevitas library so that the community can develop novel neural network designs quantized with MX formats in mind. We demonstrate the usability and efficacy of our libraries via the implementation of example neural networks such as ResNet-18 on the ImageNet ILSVRC12 dataset. Our testing shows that MX is very effective for formats such as INT5 or FP6 which are not natively supported on GPUs. This gives FPGAs an advantage as they have the flexibility to implement a custom datapath and take advantage of the smaller area footprints offered by these formats.

[LG-14] he Balanced-Pairwise-Affinities Feature Transform

链接: https://arxiv.org/abs/2407.01467
作者: Daniel Shalam,Simon Korman
关键词: facilitate downstream matching, grouping related tasks, designed to upgrade, items to facilitate, facilitate downstream
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2204.03065

点击查看摘要

Abstract:The Balanced-Pairwise-Affinities (BPA) feature transform is designed to upgrade the features of a set of input items to facilitate downstream matching or grouping related tasks. The transformed set encodes a rich representation of high order relations between the input features. A particular min-cost-max-flow fractional matching problem, whose entropy regularized version can be approximated by an optimal transport (OT) optimization, leads to a transform which is efficient, differentiable, equivariant, parameterless and probabilistically interpretable. While the Sinkhorn OT solver has been adapted extensively in many contexts, we use it differently by minimizing the cost between a set of features to itself and using the transport plan’s rows as the new representation. Empirically, the transform is highly effective and flexible in its use and consistently improves networks it is inserted into, in a variety of tasks and training schemes. We demonstrate state-of-the-art results in few-shot classification, unsupervised image clustering and person re-identification. Code is available at \urlthis http URL.

[LG-15] Graph Neural Network as Computationally Efficient Emulator of Ice-sheet and Sea-level System Model (ISSM)

链接: https://arxiv.org/abs/2407.01464
作者: Younghyun Koo,Maryam Rahnemoonfar
关键词: Sea-level System Model, Stokes equations relevant, System Model, Ice-sheet and Sea-level, Sea-level System
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注: 5 pages, 4 figures, submitted to the 2024 IEEE International Geoscience and Remote Sensing Symposium. arXiv admin note: text overlap with arXiv:2402.05291

点击查看摘要

Abstract:The Ice-sheet and Sea-level System Model (ISSM) provides solutions for Stokes equations relevant to ice sheet dynamics by employing finite element and fine mesh adaption. However, since its finite element method is compatible only with Central Processing Units (CPU), the ISSM has limits on further economizing computational time. Thus, by taking advantage of Graphics Processing Units (GPUs), we design a graph convolutional network (GCN) as a fast emulator for ISSM. The GCN is trained and tested using the 20-year transient ISSM simulations in the Pine Island Glacier (PIG). The GCN reproduces ice thickness and velocity with a correlation coefficient greater than 0.998, outperforming the traditional convolutional neural network (CNN). Additionally, GCN shows 34 times faster computational speed than the CPU-based ISSM modeling. The GPU-based GCN emulator allows us to predict how the PIG will change in the future under different melting rate scenarios with high fidelity and much faster computational time.

[LG-16] On Implications of Scaling Laws on Feature Superposition

链接: https://arxiv.org/abs/2407.01459
作者: Pavan Katta
关键词: theoretical note argues, scaling laws, simultaneously true, results from scaling, theoretical note
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 4 pages, 1 figure

点击查看摘要

Abstract:Using results from scaling laws, this theoretical note argues that the following two statements cannot be simultaneously true: 1. Superposition hypothesis where sparse features are linearly represented across a layer is a complete theory of feature representation. 2. Features are universal, meaning two models trained on the same data and achieving equal performance will learn identical features.

[LG-17] Contractual Reinforcement Learning: Pulling Arms with Invisible Hands

链接: https://arxiv.org/abs/2407.01458
作者: Jibang Wu,Siyu Chen,Mengdi Wang,Huazheng Wang,Haifeng Xu
关键词: enforce data collection, today large scale, large scale machine, direct content creation, machine learning tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:The agency problem emerges in today’s large scale machine learning tasks, where the learners are unable to direct content creation or enforce data collection. In this work, we propose a theoretical framework for aligning economic interests of different stakeholders in the online learning problems through contract design. The problem, termed \emphcontractual reinforcement learning, naturally arises from the classic model of Markov decision processes, where a learning principal seeks to optimally influence the agent’s action policy for their common interests through a set of payment rules contingent on the realization of next state. For the planning problem, we design an efficient dynamic programming algorithm to determine the optimal contracts against the far-sighted agent. For the learning problem, we introduce a generic design of no-regret learning algorithms to untangle the challenges from robust design of contracts to the balance of exploration and exploitation, reducing the complexity analysis to the construction of efficient search algorithms. For several natural classes of problems, we design tailored search algorithms that provably achieve \tildeO(\sqrtT) regret. We also present an algorithm with \tildeO(T^2/3) for the general problem that improves the existing analysis in online contract design with mild technical assumptions.

[LG-18] Information-Theoretic Foundations for Neural Scaling Laws

链接: https://arxiv.org/abs/2407.01456
作者: Hong Jun Jeon,Benjamin Van Roy
关键词: Neural scaling laws, scaling laws, scaling laws aim, training dataset size, Neural scaling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2212.01365

点击查看摘要

Abstract:Neural scaling laws aim to characterize how out-of-sample error behaves as a function of model and training dataset size. Such scaling laws guide allocation of a computational resources between model and data processing to minimize error. However, existing theoretical support for neural scaling laws lacks rigor and clarity, entangling the roles of information and optimization. In this work, we develop rigorous information-theoretic foundations for neural scaling laws. This allows us to characterize scaling laws for data generated by a two-layer neural network of infinite width. We observe that the optimal relation between data and model size is linear, up to logarithmic factors, corroborating large-scale empirical investigations. Concise yet general results of the kind we establish may bring clarity to this topic and inform future investigations.

[LG-19] FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources

链接: https://arxiv.org/abs/2407.01445
作者: Xiyuan Wei,Fanjiang Ye,Ori Yonay,Xingyu Chen,Baixi Sun,Dingwen Tao,Tianbao Yang
关键词: Contrastive Language-Image Pretraining, large batch size, Existing studies, Language-Image Pretraining, data involve hundreds
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages

点击查看摘要

Abstract:Existing studies of training state-of-the-art Contrastive Language-Image Pretraining (CLIP) models on large-scale data involve hundreds of or even thousands of GPUs due to the requirement of a large batch size. However, such a large amount of resources is not accessible to most people. While advanced compositional optimization techniques for optimizing global contrastive losses have been demonstrated effective for removing the requirement of large batch size, their performance on large-scale data remains underexplored and not optimized. To bridge the gap, this paper explores several aspects of CLIP training with limited resources (e.g., up to tens of GPUs). First, we introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques while designed and optimized for the distributed setting. Our framework is equipped with an efficient gradient reduction strategy to reduce communication overhead. Second, to further boost training efficiency, we investigate three components of the framework from an optimization perspective: the schedule of the inner learning rate, the update rules of the temperature parameter and the model parameters, respectively. Experiments on different strategies for each component shed light on how to conduct CLIP training more efficiently. Finally, we benchmark the performance of FastCLIP and the state-of-the-art training baseline (OpenCLIP) on different compute scales up to 32 GPUs on 8 nodes, and three data scales ranging from 2.7 million, 9.1 million to 315 million image-text pairs to demonstrate the significant improvement of FastCLIP in the resource-limited setting. We release the code of FastCLIP at this https URL .

[LG-20] GAT-Steiner: Rectilinear Steiner Minimal Tree Prediction Using GNNs

链接: https://arxiv.org/abs/2407.01440
作者: Bugra Onal,Eren Dogan,Muhammad Hadir Khan,Matthew R. Guthaus
关键词: Steiner Minimum Tree, Rectilinear Steiner Minimum, Minimum Tree, Rectilinear Steiner, Steiner Minimum
类目: Machine Learning (cs.LG)
*备注: Preprint for The 2024 IEEE/ACM International Conference on Computer-Aided Design (ICCAD 2024)

点击查看摘要

Abstract:The Rectilinear Steiner Minimum Tree (RSMT) problem is a fundamental problem in VLSI placement and routing and is known to be NP-hard. Traditional RSMT algorithms spend a significant amount of time on finding Steiner points to reduce the total wire length or use heuristics to approximate producing sub-optimal results. We show that Graph Neural Networks (GNNs) can be used to predict optimal Steiner points in RSMTs with high accuracy and can be parallelized on GPUs. In this paper, we propose GAT-Steiner, a graph attention network model that correctly predicts 99.846% of the nets in the ISPD19 benchmark with an average increase in wire length of only 0.480% on suboptimal wire length nets. On randomly generated benchmarks, GAT-Steiner correctly predicts 99.942% with an average increase in wire length of only 0.420% on suboptimal wire length nets.

[LG-21] Needle in the Haystack for Memory Based Large Language Models

链接: https://arxiv.org/abs/2407.01437
作者: Subhajit Chaudhury,Soham Dan,Payel Das,Georgios Kollias,Elliot Nelson
关键词: augmented Large Language, Large Language Model, Large Language, memory augmented Large, augmented Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:In this paper, we demonstrate the benefits of using memory augmented Large Language Model (LLM) architecture in improving the recall abilities of facts from a potentially long context. As a case study we test LARIMAR, a recently proposed LLM architecture which augments a LLM decoder with an external associative memory, on several long-context recall tasks, including passkey and needle-in-the-haystack tests. We demonstrate that the external memory can be adapted at test time to handle contexts much longer than those seen during training, while keeping readouts from the memory recognizable to the trained decoder and without increasing GPU memory footprint. Compared to alternative architectures for long-context recall tasks with models of a comparable parameter count, LARIMAR is able to maintain strong performance without any task-specific training.

[LG-22] POST: Email Archival Processing and Flagging Stack for Incident Responders

链接: https://arxiv.org/abs/2407.01433
作者: Jeffrey Fairbanks
关键词: points of compromise, main points, security and awareness, awareness being estimated, Natural Language Processing
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. For further information or questions please reach out to fairbanks6@llnl.gov

点击查看摘要

Abstract:Phishing is one of the main points of compromise, with email security and awareness being estimated at \ 50-100B in 2022. There is great need for email forensics capability to quickly search for malicious content. A novel solution POST is proposed. POST is an API driven serverless email archival, processing, and flagging workflow for both large and small organizations that collects and parses all email, flags emails using state of the art Natural Language Processing and Machine Learning, allows full email searching on every aspect of an email, and provides a cost savings of up to 68.6%.

[LG-23] FairLay-ML: Intuitive Debugging of Fairness in Data-Driven Social-Critical Software

链接: https://arxiv.org/abs/2407.01423
作者: Normen Yu,Luciana Carreon,Gang Tan,Saeid Tizpaz-Niari
关键词: Data-driven software solutions, Data-driven software, data-driven software developers, data-driven solutions, significant socio-economic
类目: oftware Engineering (cs.SE); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Under Review in the ICSME 2024 Tool Demonstration Track

点击查看摘要

Abstract:Data-driven software solutions have significantly been used in critical domains with significant socio-economic, legal, and ethical implications. The rapid adoptions of data-driven solutions, however, pose major threats to the trustworthiness of automated decision-support software. A diminished understanding of the solution by the developer and historical/current biases in the data sets are primary challenges. To aid data-driven software developers and end-users, we present \toolname, a debugging tool to test and explain the fairness implications of data-driven solutions. \toolname visualizes the logic of datasets, trained models, and decisions for a given data point. In addition, it trains various models with varying fairness-accuracy trade-offs. Crucially, \toolname incorporates counterfactual fairness testing that finds bugs beyond the development datasets. We conducted two studies through \toolname that allowed us to measure false positives/negatives in prevalent counterfactual testing and understand the human perception of counterfactual test cases in a class survey. \toolname and its benchmarks are publicly available at~\urlthis https URL. The live version of the tool is available at~\urlhttps://fairlayml-v2.streamlit.app/. We provide a video demo of the tool at this https URL Comments: Under Review in the ICSME 2024 Tool Demonstration Track Subjects: Software Engineering (cs.SE); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2407.01423 [cs.SE] (or arXiv:2407.01423v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2407.01423 Focus to learn more arXiv-issued DOI via DataCite

[LG-24] RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing

链接: https://arxiv.org/abs/2407.01418
作者: Bo Ai,Stephen Tian,Haochen Shi,Yixuan Wang,Cheston Tan,Yunzhu Li,Jiajun Wu
关键词: feedback is critical, critical for understanding, rigid and deformable, tactile-informed dynamics model, Tactile feedback
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Robotics: Science and Systems (RSS), 2024. Project page: this https URL

点击查看摘要

Abstract:Tactile feedback is critical for understanding the dynamics of both rigid and deformable objects in many manipulation tasks, such as non-prehensile manipulation and dense packing. We introduce an approach that combines visual and tactile sensing for robotic manipulation by learning a neural, tactile-informed dynamics model. Our proposed framework, RoboPack, employs a recurrent graph neural network to estimate object states, including particles and object-level latent physics information, from historical visuo-tactile observations and to perform future state predictions. Our tactile-informed dynamics model, learned from real-world data, can solve downstream robotics tasks with model-predictive control. We demonstrate our approach on a real robot equipped with a compliant Soft-Bubble tactile sensor on non-prehensile manipulation and dense packing tasks, where the robot must infer the physics properties of objects from direct and indirect interactions. Trained on only an average of 30 minutes of real-world interaction data per task, our model can perform online adaptation and make touch-informed predictions. Through extensive evaluations in both long-horizon dynamics prediction and real-world manipulation, our method demonstrates superior effectiveness compared to previous learning-based and physics-based simulation systems.

[LG-25] Semantic Compositions Enhance Vision-Language Contrastive Learning

链接: https://arxiv.org/abs/2407.01408
作者: Maxwell Aladago,Lorenzo Torresani,Soroush Vosoughi
关键词: vision-language contrastive learning, leverage within-batch non-matching, within-batch non-matching pairs, contrastive learning, matched image-caption pairs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-C for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification and cross-modal retrieval. The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.

[LG-26] Optimization of Retrieval-Augmented Generation Context with Outlier Detection

链接: https://arxiv.org/abs/2407.01403
作者: Vitaly Bulgakov
关键词: prompt context required, Large Language Model, retrieved LLM responses, question-answering systems, reduce the size
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we focus on methods to reduce the size and improve the quality of the prompt context required for question-answering systems. Attempts to increase the number of retrieved chunked documents and thereby enlarge the context related to the query can significantly complicate the processing and decrease the performance of a Large Language Model (LLM) when generating responses to queries. It is well known that a large set of documents retrieved from a database in response to a query may contain irrelevant information, which often leads to hallucinations in the resulting answers. Our goal is to select the most semantically relevant documents, treating the discarded ones as outliers. We propose and evaluate several methods for identifying outliers by creating features that utilize the distances of embedding vectors, retrieved from the vector database, to both the centroid and the query vectors. The methods were evaluated by comparing the similarities of the retrieved LLM responses to ground-truth answers obtained using the OpenAI GPT-4o model. It was found that the greatest improvements were achieved with increasing complexity of the questions and answers.

[LG-27] Superconstant Inapproximability of Decision Tree Learning

链接: https://arxiv.org/abs/2407.01402
作者: Caleb Koch,Carmen Strassle,Li-Yang Tan
关键词: properly PAC learning, PAC learning decision, properly PAC, PAC learning, decision trees
类目: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 29 pages, 5 figures, COLT 2024

点击查看摘要

Abstract:We consider the task of properly PAC learning decision trees with queries. Recent work of Koch, Strassle, and Tan showed that the strictest version of this task, where the hypothesis tree T is required to be optimally small, is NP-hard. Their work leaves open the question of whether the task remains intractable if T is only required to be close to optimal, say within a factor of 2, rather than exactly optimal. We answer this affirmatively and show that the task indeed remains NP-hard even if T is allowed to be within any constant factor of optimal. More generally, our result allows for a smooth tradeoff between the hardness assumption and the inapproximability factor. As Koch et al.'s techniques do not appear to be amenable to such a strengthening, we first recover their result with a new and simpler proof, which we couple with a new XOR lemma for decision trees. While there is a large body of work on XOR lemmas for decision trees, our setting necessitates parameters that are extremely sharp, and are not known to be attainable by existing XOR lemmas. Our work also carries new implications for the related problem of Decision Tree Minimization. Comments: 29 pages, 5 figures, COLT 2024 Subjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2407.01402 [cs.CC] (or arXiv:2407.01402v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2407.01402 Focus to learn more arXiv-issued DOI via DataCite

[LG-28] Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

链接: https://arxiv.org/abs/2407.01394
作者: Pooya Fayyazsanavi,Antonios Anastasopoulos,Jana Košecká
关键词: spoken text presents, text presents unique, presents unique challenges, unique challenges owing, expression nuances
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on \em Gloss2Text translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities improving significantly the performance of state-of-the-art approaches. Through extensive experiments and ablation studies on the PHOENIX Weather 2014T dataset, our approach surpasses state-of-the-art performance in \em Gloss2Text translation, indicating its efficacy in addressing sign language translation and suggesting promising avenues for future research and development.

[LG-29] Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

链接: https://arxiv.org/abs/2407.01392
作者: Boyuan Chen,Diego Marti Monso,Yilun Du,Max Simchowitz,Russ Tedrake,Vincent Sitzmann
关键词: per-token noise levels, presents Diffusion Forcing, independent per-token noise, paper presents Diffusion, Diffusion Forcing
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing’s variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution. Project website: https://boyuan.space/diffusion-forcing

[LG-30] Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression

链接: https://arxiv.org/abs/2407.01378
作者: Wenchen Han,Shay Vargaftik,Michael Mitzenmacher,Brad Karp,Ran Ben Basat
关键词: today large-scale distributed, large-scale distributed machine, distributed machine learning, aggregation has long, long been identified
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Gradient aggregation has long been identified as a major bottleneck in today’s large-scale distributed machine learning training systems. One promising solution to mitigate such bottlenecks is gradient compression, directly reducing communicated gradient data volume. However, in practice, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. In this work, we identify several common issues in previous gradient compression systems and evaluation methods. These issues include excessive computational overheads; incompatibility with all-reduce; and inappropriate evaluation metrics, such as not using an end-to-end metric or using a 32-bit baseline instead of a 16-bit baseline. We propose several general design and evaluation techniques to address these issues and provide guidelines for future work. Our preliminary evaluation shows that our techniques enhance the system’s performance and provide a clearer understanding of the end-to-end utility of gradient compression methods. Comments: 9 pages, 3 figures Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2407.01378 [cs.LG] (or arXiv:2407.01378v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2407.01378 Focus to learn more arXiv-issued DOI via DataCite

[LG-31] Badllama 3: removing safety finetuning from Llama 3 in minutes

链接: https://arxiv.org/abs/2407.01376
作者: Dmitrii Volkov
关键词: extensive LLM safety, LLM safety fine-tuning, extensive LLM, model weights, LLM safety
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

[LG-32] Binary Losses for Density Ratio Estimation

链接: https://arxiv.org/abs/2407.01371
作者: Werner Zellinger
关键词: learning and statistics, loss functions, central problem, problem in machine, machine learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating the ratio of two probability densities from finitely many observations of the densities, is a central problem in machine learning and statistics. A large class of methods constructs estimators from binary classifiers which distinguish observations from the two densities. However, the error of these constructions depends on the choice of the binary loss function, raising the question of which loss function to choose based on desired error properties. In this work, we start from prescribed error measures in a class of Bregman divergences and characterize all loss functions that lead to density ratio estimators with a small error. Our characterization provides a simple recipe for constructing loss functions with certain properties, such as loss functions that prioritize an accurate estimation of large values. This contrasts with classical loss functions, such as the logistic loss or boosting loss, which prioritize accurate estimation of small values. We provide numerical illustrations with kernel methods and test their performance in applications of parameter selection for deep domain adaptation.

[LG-33] PARAFAC2: Tracking evolving patterns in (incomplete) temporal data

链接: https://arxiv.org/abs/2407.01356
作者: Christos Chatzis,Carla Schenker,Max Pfeffer,Evrim Acar
关键词: Tensor factorizations, task of uncovering, temporal smoothness regularization, Alternating Direction Method, uncovering patterns
类目: Machine Learning (cs.LG)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:Tensor factorizations have been widely used for the task of uncovering patterns in various domains. Often, the input is time-evolving, shifting the goal to tracking the evolution of underlying patterns instead. To adapt to this more complex setting, existing methods incorporate temporal regularization but they either have overly constrained structural requirements or lack uniqueness which is crucial for interpretation. In this paper, in order to capture the underlying evolving patterns, we introduce t(emporal)PARAFAC2 which utilizes temporal smoothness regularization on the evolving factors. We propose an algorithmic framework that employs Alternating Optimization (AO) and the Alternating Direction Method of Multipliers (ADMM) to fit the model. Furthermore, we extend the algorithmic framework to the case of partially observed data. Our numerical experiments on both simulated and real datasets demonstrate the effectiveness of the temporal smoothness regularization, in particular, in the case of data with missing entries. We also provide an extensive comparison of different approaches for handling missing data within the proposed framework.

[LG-34] Coordination Failure in Cooperative Offline MARL

链接: https://arxiv.org/abs/2407.01343
作者: Callum Rhys Tilbury,Claude Formanek,Louise Beyers,Jonathan P. Shock,Arnu Pretorius
关键词: optimal multi-agent control, learn optimal multi-agent, leverages static datasets, multi-agent reinforcement learning, experience to learn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Accepted at the Workshop on Aligning Reinforcement Learning Experimentalists and Theorists (ARLET) at the International Conference on Machine Learning, 2024

点击查看摘要

Abstract:Offline multi-agent reinforcement learning (MARL) leverages static datasets of experience to learn optimal multi-agent control. However, learning from static data presents several unique challenges to overcome. In this paper, we focus on coordination failure and investigate the role of joint actions in multi-agent policy gradients with offline data, focusing on a common setting we refer to as the ‘Best Response Under Data’ (BRUD) approach. By using two-player polynomial games as an analytical tool, we demonstrate a simple yet overlooked failure mode of BRUD-based algorithms, which can lead to catastrophic coordination failure in the offline setting. Building on these insights, we propose an approach to mitigate such failure, by prioritising samples from the dataset based on joint-action similarity during policy learning and demonstrate its effectiveness in detailed experiments. More generally, however, we argue that prioritised dataset sampling is a promising area for innovation in offline MARL that can be combined with other effective approaches such as critic and policy regularisation. Importantly, our work shows how insights drawn from simplified, tractable games can lead to useful, theoretically grounded insights that transfer to more complex contexts. A core dimension of offering is an interactive notebook, from which almost all of our results can be reproduced, in a browser.

[LG-35] Deep Reinforcement Learning for Adverse Garage Scenario Generation

链接: https://arxiv.org/abs/2407.01333
作者: Kai Li
关键词: billion miles, ensure their safety, miles to ensure, simulation testing, Autonomous vehicles
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 14 pages, 17 figures

点击查看摘要

Abstract:Autonomous vehicles need to travel over 11 billion miles to ensure their safety. Therefore, the importance of simulation testing before real-world testing is self-evident. In recent years, the release of 3D simulators for autonomous driving, represented by Carla and CarSim, marks the transition of autonomous driving simulation testing environments from simple 2D overhead views to complex 3D models. During simulation testing, experimenters need to build static scenes and dynamic traffic flows, pedestrian flows, and other experimental elements to construct experimental scenarios. When building static scenes in 3D simulators, experimenters often need to manually construct 3D models, set parameters and attributes, which is time-consuming and labor-intensive. This thesis proposes an automated program generation framework. Based on deep reinforcement learning, this framework can generate different 2D ground script codes, on which 3D model files and map model files are built. The generated 3D ground scenes are displayed in the Carla simulator, where experimenters can use this scene for navigation algorithm simulation testing.

[LG-36] Restyling Unsupervised Concept Based Interpretable Networks with Generative Models

链接: https://arxiv.org/abs/2407.01331
作者: Jayneel Parekh,Quentin Bouniot,Pavlo Mozharovskyi,Alasdair Newson,Florence d’Alché-Buc
关键词: Developing inherently interpretable, Developing inherently, inherently interpretable models, recent years, gained prominence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page available at this https URL

点击查看摘要

Abstract:Developing inherently interpretable models for prediction has gained prominence in recent years. A subclass of these models, wherein the interpretable network relies on learning high-level concepts, are valued because of closeness of concept representations to human communication. However, the visualization and understanding of the learnt unsupervised dictionary of concepts encounters major limitations, specially for large-scale images. We propose here a novel method that relies on mapping the concept features to the latent space of a pretrained generative model. The use of a generative model enables high quality visualization, and naturally lays out an intuitive and interactive procedure for better interpretation of the learnt concepts. Furthermore, leveraging pretrained generative models has the additional advantage of making the training of the system more efficient. We quantitatively ascertain the efficacy of our method in terms of accuracy of the interpretable prediction network, fidelity of reconstruction, as well as faithfulness and consistency of learnt concepts. The experiments are conducted on multiple image recognition benchmarks for large-scale images. Project page available at this https URL

[LG-37] Gradient-based Class Weighting for Unsupervised Domain Adaptation in Dense Prediction Visual Tasks

链接: https://arxiv.org/abs/2407.01327
作者: Roberto Alcover-Couso,Marcos Escudero-Viñolo,Juan C. SanMiguel,Jesus Bescós
关键词: unsupervised domain adaptation, significant class imbalance, class imbalance remains, addressing the challenge, open issue
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In unsupervised domain adaptation (UDA), where models are trained on source data (e.g., synthetic) and adapted to target data (e.g., real-world) without target annotations, addressing the challenge of significant class imbalance remains an open issue. Despite considerable progress in bridging the domain gap, existing methods often experience performance degradation when confronted with highly imbalanced dense prediction visual tasks like semantic and panoptic segmentation. This discrepancy becomes especially pronounced due to the lack of equivalent priors between the source and target domains, turning class imbalanced techniques used for other areas (e.g., image classification) ineffective in UDA scenarios. This paper proposes a class-imbalance mitigation strategy that incorporates class-weights into the UDA learning losses, but with the novelty of estimating these weights dynamically through the loss gradient, defining a Gradient-based class weighting (GBW) learning. GBW naturally increases the contribution of classes whose learning is hindered by large-represented classes, and has the advantage of being able to automatically and quickly adapt to the iteration training outcomes, avoiding explicitly curricular learning patterns common in loss-weighing strategies. Extensive experimentation validates the effectiveness of GBW across architectures (convolutional and transformer), UDA strategies (adversarial, self-training and entropy minimization), tasks (semantic and panoptic segmentation), and datasets (GTA and Synthia). Analysing the source of advantage, GBW consistently increases the recall of low represented classes.

[LG-38] Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning

链接: https://arxiv.org/abs/2407.01320
作者: Haobo Song,Hao Zhao,Soumajit Majumder,Tao Lin
关键词: large pre-trained foundation, Fine-tuning large pre-trained, pre-trained foundation models, large pre-trained, pre-trained foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted at ICLR 2024. Code at this https URL

点击查看摘要

Abstract:Fine-tuning large pre-trained foundation models, such as the 175B GPT-3, has attracted more attention for downstream tasks recently. While parameter-efficient fine-tuning methods have been proposed and proven effective without retraining all model parameters, their performance is limited by the capacity of incremental modules, especially under constrained parameter budgets. \ To overcome this challenge, we propose CapaBoost, a simple yet effective strategy that enhances model capacity by leveraging low-rank updates through parallel weight modules in target layers. By applying static random masks to the shared weight matrix, CapaBoost constructs a diverse set of weight matrices, effectively increasing the rank of incremental weights without adding parameters. Notably, our approach can be seamlessly integrated into various existing parameter-efficient fine-tuning methods. We extensively validate the efficacy of CapaBoost through experiments on diverse downstream tasks, including natural language understanding, question answering, and image classification. Our results demonstrate significant improvements over baselines, without incurring additional computation or storage costs. Our code is available at \urlthis https URL.

[LG-39] Evaluating Model Performance Under Worst-case Subpopulations

链接: https://arxiv.org/abs/2407.01316
作者: Mike Li,Hongseok Namkoong,Shangzhou Xia
关键词: training population, models degrades, performance, robustness, assessing distributional robustness
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注: Earlier version appeared in the proceedings of Advances in Neural Information Processing Systems 34 (NeurIPS 2021): this https URL

点击查看摘要

Abstract:The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.

[LG-40] Multi-State-Action Tokenisation in Decision Transformers for Multi-Discrete Action Spaces

链接: https://arxiv.org/abs/2407.01310
作者: Perusha Moodley,Pramod Kaushik,Dhillu Thambi,Mark Trovinger,Praveen Paruchuri,Xia Hong,Benjamin Rosman
关键词: Decision Transformer architectures, Decision Transformer, enhanced Decision Transformer, existing Decision Transformer, multi-discrete action spaces
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Decision Transformers, in their vanilla form, struggle to perform on image-based environments with multi-discrete action spaces. Although enhanced Decision Transformer architectures have been developed to improve performance, these methods have not specifically addressed this problem of multi-discrete action spaces which hampers existing Decision Transformer architectures from learning good representations. To mitigate this, we propose Multi-State Action Tokenisation (M-SAT), an approach for tokenising actions in multi-discrete action spaces that enhances the model’s performance in such environments. Our approach involves two key changes: disentangling actions to the individual action level and tokenising the actions with auxiliary state information. These two key changes also improve individual action level interpretability and visibility within the attention layers. We demonstrate the performance gains of M-SAT on challenging ViZDoom environments with multi-discrete action spaces and image-based state spaces, including the Deadly Corridor and My Way Home scenarios, where M-SAT outperforms the baseline Decision Transformer without any additional data or heavy computational overheads. Additionally, we find that removing positional encoding does not adversely affect M-SAT’s performance and, in some cases, even improves it.

[LG-41] Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability

链接: https://arxiv.org/abs/2407.01306
作者: Chenxi Li,Abhinav Kumar,Zhen Guo,Jie Hou,Reza Tourani
关键词: deep learning applications, address privacy vulnerabilities, personalized data underscore, Membership Inference Attacks, privacy vulnerabilities
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 20 pages, 10 figures, 4 tables

点击查看摘要

Abstract:The increasing prominence of deep learning applications and reliance on personalized data underscore the urgent need to address privacy vulnerabilities, particularly Membership Inference Attacks (MIAs). Despite numerous MIA studies, significant knowledge gaps persist, particularly regarding the impact of hidden features (in isolation) on attack efficacy and insufficient justification for the root causes of attacks based on raw data features. In this paper, we aim to address these knowledge gaps by first exploring statistical approaches to identify the most informative neurons and quantifying the significance of the hidden activations from the selected neurons on attack accuracy, in isolation and combination. Additionally, we propose an attack-driven explainable framework by integrating the target and attack models to identify the most influential features of raw data that lead to successful membership inference attacks. Our proposed MIA shows an improvement of up to 26% on state-of-the-art MIA.

[LG-42] Collaborative Performance Prediction for Large Language Models

链接: https://arxiv.org/abs/2407.01300
作者: Qiyuan Zhang,Fuyuan Lyu,Xue Liu,Chen Ma
关键词: NLP research, Comprehensively understanding, challenge in NLP, large language models, diverse downstream tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework, Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from online platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance, an area previously overlooked.

[LG-43] A Collaborative Human-Centred Taxonomy of AI Algorithmic and Automation Harms

链接: https://arxiv.org/abs/2407.01294
作者: Gavin Abercrombie,Djalel Benbouzid,Paolo Giudici,Delaram Golpayegani,Julio Hernandez,Pierre Noro,Harshvardhan Pandit,Eva Paraschou,Charlie Pownall,Jyoti Prajapati,Mark A. Sayre,Ushnish Sengupta,Arthit Suriyawongful,Ruby Thelot,Sofia Vei,Laura Waltersdorfer
关键词: introduces a collaborative, algorithmic and automation, paper introduces, human-centered taxonomy, existing taxonomies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This paper introduces a collaborative, human-centered taxonomy of AI, algorithmic and automation harms. We argue that existing taxonomies, while valuable, can be narrow, unclear, typically cater to practitioners and government, and often overlook the needs of the wider public. Drawing on existing taxonomies and a large repository of documented incidents, we propose a taxonomy that is clear and understandable to a broad set of audiences, as well as being flexible, extensible, and interoperable. Through iterative refinement with topic experts and crowdsourced annotation testing, we propose a taxonomy that can serve as a powerful tool for civil society organisations, educators, policymakers, product teams and the general public. By fostering a greater understanding of the real-world harms of AI and related technologies, we aim to increase understanding, empower NGOs and individuals to identify and report violations, inform policy discussions, and encourage responsible technology development and deployment.

[LG-44] Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

链接: https://arxiv.org/abs/2407.01291
作者: Kenichi Fujita,Takanori Ashihara,Marc Delcroix,Yusuke Ijima
关键词: demonstrated high fidelity, based on large-scale, demonstrated high, high fidelity, fidelity in reproducing
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages,3 figures, Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40% of parameters at 1.9 times faster inference speed. Audio samples are available on our demo page (this https URL).

[LG-45] Hypformer: Exploring Efficient Hyperbolic Transformer Fully in Hyperbolic Space

链接: https://arxiv.org/abs/2407.01290
作者: Menglin Yang,Harshit Verma,Delvin Ce Zhang,Jiahong Liu,Irwin King,Rex Ying
关键词: modeling complex structured, hyperbolic Transformer, shown significant potential, Hyperbolic, complex structured data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: KDD 2024

点击查看摘要

Abstract:Hyperbolic geometry have shown significant potential in modeling complex structured data, particularly those with underlying tree-like and hierarchical structures. Despite the impressive performance of various hyperbolic neural networks across numerous domains, research on adapting the Transformer to hyperbolic space remains limited. Previous attempts have mainly focused on modifying self-attention modules in the Transformer. However, these efforts have fallen short of developing a complete hyperbolic Transformer. This stems primarily from: (i) the absence of well-defined modules in hyperbolic space, including linear transformation layers, LayerNorm layers, activation functions, dropout operations, etc. (ii) the quadratic time complexity of the existing hyperbolic self-attention module w.r.t the number of input tokens, which hinders its scalability. To address these challenges, we propose, Hypformer, a novel hyperbolic Transformer based on the Lorentz model of hyperbolic geometry. In Hypformer, we introduce two foundational blocks that define the essential modules of the Transformer in hyperbolic space. Furthermore, we develop a linear self-attention mechanism in hyperbolic space, enabling hyperbolic Transformer to process billion-scale graph data and long-sequence inputs for the first time. Our experimental results confirm the effectiveness and efficiency of Hypformer across various datasets, demonstrating its potential as an effective and scalable solution for large-scale data representation and large models.

[LG-46] We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

链接: https://arxiv.org/abs/2407.01284
作者: Runqi Qiao,Qiuna Tan,Guanting Dong,Minhui Wu,Chong Sun,Xiaoshuai Song,Zhuoma GongQue,Shanglin Lei,Zhe Wei,Miaoxuan Zhang,Runfeng Qiao,Yifan Zhang,Xiao Zong,Yida Xu,Muxi Diao,Zhimin Bao,Chen Li,Honggang Zhang
关键词: Large Multimodal Models, Multimodal Models, Large Multimodal, received widespread attention, Visual mathematical reasoning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: Work in progress

点击查看摘要

Abstract:Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs’ reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at this https URL.

[LG-47] Energy-Aware Decentralized Learning with Intermittent Model Training

链接: https://arxiv.org/abs/2407.01283
作者: Akash Dhasade,Paolo Dini,Elia Guerra,Anne-Marie Kermarrec,Marco Miozzo,Rafael Pires,Rishi Sharma,Martijn de Vos
关键词: offers a powerful, central server, powerful framework, sharing raw data, Decentralized learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Decentralized learning (DL) offers a powerful framework where nodes collaboratively train models without sharing raw data and without the coordination of a central server. In the iterative rounds of DL, models are trained locally, shared with neighbors in the topology, and aggregated with other models received from neighbors. Sharing and merging models contribute to convergence towards a consensus model that generalizes better across the collective data captured at training time. In addition, the energy consumption while sharing and merging model parameters is negligible compared to the energy spent during the training phase. Leveraging this fact, we present SkipTrain, a novel DL algorithm, which minimizes energy consumption in decentralized learning by strategically skipping some training rounds and substituting them with synchronization rounds. These training-silent periods, besides saving energy, also allow models to better mix and finally produce models with superior accuracy than typical DL algorithms that train at every round. Our empirical evaluations with 256 nodes demonstrate that SkipTrain reduces energy consumption by 50% and increases model accuracy by up to 12% compared to D-PSGD, the conventional DL algorithm.

[LG-48] Bridging Smoothness and Approximation: Theoretical Insights into Over-Smoothing in Graph Neural Networks

链接: https://arxiv.org/abs/2407.01281
作者: Guangrui Yang,Jianfei Li,Ming Li,Han Feng,Ding-Xuan Zhou
关键词: Graph Convolutional Networks, approximation, approximation theory, functions defined, GCNs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:In this paper, we explore the approximation theory of functions defined on graphs. Our study builds upon the approximation results derived from the K -functional. We establish a theoretical framework to assess the lower bounds of approximation for target functions using Graph Convolutional Networks (GCNs) and examine the over-smoothing phenomenon commonly observed in these networks. Initially, we introduce the concept of a K -functional on graphs, establishing its equivalence to the modulus of smoothness. We then analyze a typical type of GCN to demonstrate how the high-frequency energy of the output decays, an indicator of over-smoothing. This analysis provides theoretical insights into the nature of over-smoothing within GCNs. Furthermore, we establish a lower bound for the approximation of target functions by GCNs, which is governed by the modulus of smoothness of these functions. This finding offers a new perspective on the approximation capabilities of GCNs. In our numerical experiments, we analyze several widely applied GCNs and observe the phenomenon of energy decay. These observations corroborate our theoretical results on exponential decay order.

[LG-49] Complementary Fusion of Deep Network and Tree Model for ETA Prediction

链接: https://arxiv.org/abs/2407.01262
作者: YuRui Huang,Jie Zhang,HengDa Bao,Yang Yang,Jian Yang
关键词: Estimated time, time of arrival, intelligent transportation systems, important factor, transportation system
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimated time of arrival (ETA) is a very important factor in the transportation system. It has attracted increasing attentions and has been widely used as a basic service in navigation systems and intelligent transportation systems. In this paper, we propose a novel solution to the ETA estimation problem, which is an ensemble on tree models and neural networks. We proved the accuracy and robustness of the solution on the A/B list and finally won first place in the SIGSPATIAL 2021 GISCUP competition.

[LG-50] Introducing a Physics-informed Deep Learning Framework for Bridge Scour Prediction

链接: https://arxiv.org/abs/2407.01258
作者: Negin Yousefpour,Bo Wang
关键词: neural network algorithms, physics-informed neural network, paper introduces scour, introduces scour physics-informed, historical scour monitoring
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces scour physics-informed neural network algorithms (SPINNs), a hybrid physics-data-driven framework for bridge scour prediction using deep learning developed based on historical scour monitoring data. SPINNs integrate physics-based empirical equations into neural networks as supplementary loss components. We examined three architectures: LSTM, CNN, and NLinear. While CNN and LSTM have shown competitive real-time scour forecasting in previous studies, NLinear with a simple architecture demonstrated the highest accuracy and significantly lower computational cost. Despite varying error reduction margins across different base models and bridges, SPINNs showed promising scour prediction and generally outperformed pure data-driven models. In some bridge cases, SPINN reduced forecasting errors by up to 50%. In this study, we also explored generalised models for bridge clusters by aggregating training datasets from multiple bridges in Alaska. Bridge/site-specific SPINNs incorporating HEC18 and time-dependent empirical equations provided more accurate predictions than SPINNs with generalised time-dependent equations. The three empirical equations derived from SPINN training in this study showed reasonable accuracy in estimating maximum scour depth. These deep learning derived empirical models can provide more accurate and reliable scour predictions than traditional HEC-18, particularly in scenarios lacking site-specific scour data. Comparing the HEC-18 model with both SPINNs and pure deep learning models highlights a substantial improvement in scour prediction accuracy, indicating a promising future for these hybrid machine learning methodologies for bridge scour design and maintenance.

[LG-51] Metric-Entropy Limits on Nonlinear Dynamical System Learning

链接: https://arxiv.org/abs/2407.01250
作者: Yang Pan,Clemens Hutter,Helmut Bölcskei
关键词: input-output traces, paper is concerned, fundamental limits, learning nonlinear systems, dynamical system learning
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:This paper is concerned with the fundamental limits of nonlinear dynamical system learning from input-output traces. Specifically, we show that recurrent neural networks (RNNs) are capable of learning nonlinear systems that satisfy a Lipschitz property and forget past inputs fast enough in a metric-entropy optimal manner. As the sets of sequence-to-sequence maps realized by the dynamical systems we consider are significantly more massive than function classes generally considered in deep neural network approximation theory, a refined metric-entropy characterization is needed, namely in terms of order, type, and generalized dimension. We compute these quantities for the classes of exponentially-decaying and polynomially-decaying Lipschitz fading-memory systems and show that RNNs can achieve them.

[LG-52] Bayesian grey-box identification of nonlinear convection effects in heat transfer dynamics

链接: https://arxiv.org/abs/2407.01226
作者: Wouter M. Kouw,Caspar Gruijthuijsen,Lennart Blanken,Enzo Evers,Timothy Rogers
关键词: heat transfer dynamics, Gaussian process, transfer dynamics, propose a computational, heat transfer
类目: ystems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures. Published as part of the proceedings of the IEEE Conference on Control Technology and Applications 2024

点击查看摘要

Abstract:We propose a computational procedure for identifying convection in heat transfer dynamics. The procedure is based on a Gaussian process latent force model, consisting of a white-box component (i.e., known physics) for the conduction and linear convection effects and a Gaussian process that acts as a black-box component for the nonlinear convection effects. States are inferred through Bayesian smoothing and we obtain approximate posterior distributions for the kernel covariance function’s hyperparameters using Laplace’s method. The nonlinear convection function is recovered from the Gaussian process states using a Bayesian regression model. We validate the procedure by simulation error using the identified nonlinear convection function, on both data from a simulated system and measurements from a physical assembly.

[LG-53] Revisiting Random Walks for Learning on Graphs

链接: https://arxiv.org/abs/2407.01214
作者: Jinwoo Kim,Olga Zaghen,Ayhan Suleymanzade,Youngmin Ryou,Seunghoon Hong
关键词: directly make vertex-level, random walk neural, walk neural networks, random walk, walk neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 41 pages, 11 figures

点击查看摘要

Abstract:We revisit a simple idea for machine learning on graphs, where a random walk on a graph produces a machine-readable record, and this record is processed by a deep neural network to directly make vertex-level or graph-level predictions. We refer to these stochastic machines as random walk neural networks, and show that we can design them to be isomorphism invariant while capable of universal approximation of graph functions in probability. A useful finding is that almost any kind of record of random walk guarantees probabilistic invariance as long as the vertices are anonymized. This enables us to record random walks in plain text and adopt a language model to read these text records to solve graph tasks. We further establish a parallelism to message passing neural networks using tools from Markov chain theory, and show that over-smoothing in message passing is alleviated by construction in random walk neural networks, while over-squashing manifests as probabilistic under-reaching. We show that random walk neural networks based on pre-trained language models can solve several hard problems on graphs, such as separating strongly regular graphs where the 3-WL test fails, counting substructures, and transductive classification on arXiv citation network without training. Code is available at this https URL.

[LG-54] Efficient Cutting Tool Wear Segmentation Based on Segment Anything Model

链接: https://arxiv.org/abs/2407.01211
作者: Zongshuo Li,Ding Huo,Markus Meurer,Thomas Bergs
关键词: final geometric precision, wear conditions impact, Tool wear conditions, Tool wear, geometric precision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Tool wear conditions impact the surface quality of the workpiece and its final geometric precision. In this research, we propose an efficient tool wear segmentation approach based on Segment Anything Model, which integrates U-Net as an automated prompt generator to streamline the processes of tool wear detection. Our evaluation covered three Point-of-Interest generation methods and further investigated the effects of variations in training dataset sizes and U-Net training intensities on resultant wear segmentation outcomes. The results consistently highlight our approach’s advantage over U-Net, emphasizing its ability to achieve accurate wear segmentation even with limited training datasets. This feature underscores its potential applicability in industrial scenarios where datasets may be limited.

[LG-55] Deep Learning Approach for Enhanced Transferability and Learning Capacity in Tool Wear Estimation

链接: https://arxiv.org/abs/2407.01200
作者: Zongshuo Li,Markus Meurer,Thomas Bergs
关键词: monitoring systems obtain, systems obtain valuable, obtain valuable information, contemporary manufacturing, monitoring systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:As an integral part of contemporary manufacturing, monitoring systems obtain valuable information during machining to oversee the condition of both the process and the machine. Recently, diverse algorithms have been employed to detect tool wear using single or multiple sources of measurements. In this study, a deep learning approach is proposed for estimating tool wear, considering cutting parameters. The model’s accuracy and transferability in tool wear estimation were assessed with milling experiments conducted under varying cutting parameters. The results indicate that the proposed method outperforms conventional methods in terms of both transferability and rapid learning capabilities.

[LG-56] Deep Learning Based Tool Wear Estimation Considering Cutting Conditions

链接: https://arxiv.org/abs/2407.01199
作者: Zongshuo Li,Markus Meurer,Thomas Bergs
关键词: tool wear estimation, wear conditions impact, wear estimation accuracy, Tool wear conditions, Tool wear
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Tool wear conditions impact the final quality of the workpiece. In this study, we propose a deep learning approach based on a convolutional neural network that incorporates cutting conditions as extra model inputs, aiming to improve tool wear estimation accuracy and fulfill industrial demands for zero-shot transferability. Through a series of milling experiments under various cutting parameters, we evaluate the model’s performance in terms of tool wear estimation accuracy and its transferability to new fixed or variable cutting parameters. The results consistently highlight our approach’s advantage over conventional models that omit cutting conditions, maintaining superior performance irrespective of the stability of the wear development or the limitation of the training dataset. This finding underscores its potential applicability in industrial scenarios.

[LG-57] A Learned Generalized Geodesic Distance Function-Based Approach for Node Feature Augmentation on Graphs

链接: https://arxiv.org/abs/2407.01194
作者: Amitoz Azad,Yuan Fang
关键词: Generalized Geodesic Distances, Learned Generalized Geodesic, Generalized Geodesic, Geodesic distances, computer vision
类目: Machine Learning (cs.LG)
*备注: Accepted at KDD 2024 Research Track

点击查看摘要

Abstract:Geodesic distances on manifolds have numerous applications in image processing, computer graphics and computer vision. In this work, we introduce an approach called `LGGD’ (Learned Generalized Geodesic Distances). This method involves generating node features by learning a generalized geodesic distance function through a training pipeline that incorporates training data, graph topology and the node content features. The strength of this method lies in the proven robustness of the generalized geodesic distances to noise and outliers. Our contributions encompass improved performance in node classification tasks, competitive results with state-of-the-art methods on real-world graph datasets, the demonstration of the learnability of parameters within the generalized geodesic equation on graph, and dynamic inclusion of new labels.

[LG-58] textMemory3: Language Modeling with Explicit Memory

链接: https://arxiv.org/abs/2407.01178
作者: Hongkang Yang,Zehao Lin,Wenjin Wang,Hao Wu,Zhiyu Li,Bo Tang,Wenqiang Wei,Jinbo Wang,Zeyun Tang,Shichao Song,Chenyang Xi,Yu Yu,Kai Chen,Feiyu Xiong,Linpeng Tang,Weinan E
关键词: large language models, meaningful computation, large language, costly process, process that transports
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining “abstract knowledge”. As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named \textMemory^3 , since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.

[LG-59] Neural Conditional Probability for Inference

链接: https://arxiv.org/abs/2407.01171
作者: Vladimir R. Kostic,Karim Lounici,Gregoire Pacreau,Pietro Novelli,Giacomo Turri,Massimiliano Pontil
关键词: Neural Conditional Probability, introduce NCP, learning conditional distributions, inference tasks, Conditional
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce NCP (Neural Conditional Probability), a novel operator-theoretic approach for learning conditional distributions with a particular focus on inference tasks. NCP can be used to build conditional confidence regions and extract important statistics like conditional quantiles, mean, and covariance. It offers streamlined learning through a single unconditional training phase, facilitating efficient inference without the need for retraining even when conditioning changes. By tapping into the powerful approximation capabilities of neural networks, our method efficiently handles a wide variety of complex probability distributions, effectively dealing with nonlinear relationships between input and output variables. Theoretical guarantees ensure both optimization consistency and statistical accuracy of the NCP method. Our experiments show that our approach matches or beats leading methods using a simple Multi-Layer Perceptron (MLP) with two hidden layers and GELU activations. This demonstrates that a minimalistic architecture with a theoretically grounded loss function can achieve competitive results without sacrificing performance, even in the face of more complex architectures.

[LG-60] Benchmarking Predictive Coding Networks – Made Simple

链接: https://arxiv.org/abs/2407.01163
作者: Luca Pinchetti,Chang Qi,Oleh Lokshyn,Gaspard Olivers,Cornelius Emde,Mufeng Tang,Amine M’Charrak,Simon Frieder,Bayar Menzat,Rafal Bogacz,Thomas Lukasiewicz,Tommaso Salvatori
关键词: predictive coding networks, machine learning, predictive coding, coding networks, networks in machine
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 33 pages, 25 figures

点击查看摘要

Abstract:In this work, we tackle the problems of efficiency and scalability for predictive coding networks in machine learning. To do so, we first propose a library called PCX, whose focus lies on performance and simplicity, and provides a user-friendly, deep-learning oriented interface. Second, we use PCX to implement a large set of benchmarks for the community to use for their experiments. As most works propose their own tasks and architectures, do not compare one against each other, and focus on small-scale tasks, a simple and fast open-source library adopted by the whole community would address all of these concerns. Third, we perform extensive benchmarks using multiple algorithms, setting new state-of-the-art results in multiple tasks and datasets, as well as highlighting limitations inherent to PC that should be addressed. Thanks to the efficiency of PCX, we are able to analyze larger architectures than commonly used, providing baselines to galvanize community efforts towards one of the main open problems in the field: scalability. The code for PCX is available at \textitthis https URL.

[LG-61] Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models

链接: https://arxiv.org/abs/2407.01157
作者: Shaeke Salman,Md Montasir Bin Shams,Xiuwen Liu
关键词: unprecedented zero-shot capabilities, exhibit unprecedented zero-shot, shared embedding space, models exhibit unprecedented, zero-shot capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 14 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:2401.15568 , arXiv:2402.08473

点击查看摘要

Abstract:Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities can be misaligned. In this paper, we extend and utilize a recently developed effective gradient-based procedure that allows us to match the embedding of a given text by minimally modifying an image. Using the procedure, we show that we can align the embeddings of distinguishable texts to any image through unnoticeable adversarial attacks in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and at the same time visually indistinguishable images can be matched to the embeddings of very different texts. Our technique achieves 100% success rate when it is applied to text datasets and images from multiple sources. Without overcoming the vulnerability, multimodal models cannot robustly align inputs from different modalities in a semantically meaningful way. \textbfWarning: the text data used in this paper are toxic in nature and may be offensive to some readers.

[LG-62] CPT: Consistent Proxy Tuning for Black-box Optimization

链接: https://arxiv.org/abs/2407.01155
作者: Yuanyang He,Zitong Huang,Xinxing Xu,Rick Siow Mong Goh,Salman Khan,Wangmeng Zuo,Yong Liu,Chun-Mei Feng
关键词: attracted recent attention, recent attention due, advanced proprietary models, attracted recent, recent attention
类目: Machine Learning (cs.LG)
*备注: 10 pages,2 figures plus supplementary materials

点击查看摘要

Abstract:Black-box tuning has attracted recent attention due to that the structure or inner parameters of advanced proprietary models are not accessible. Proxy-tuning provides a test-time output adjustment for tuning black-box language models. It applies the difference of the output logits before and after tuning a smaller white-box “proxy” model to improve the black-box model. However, this technique serves only as a decoding-time algorithm, leading to an inconsistency between training and testing which potentially limits overall performance. To address this problem, we introduce Consistent Proxy Tuning (CPT), a simple yet effective black-box tuning method. Different from Proxy-tuning, CPT additionally exploits the frozen large black-box model and another frozen small white-box model, ensuring consistency between training-stage optimization objective and test-time proxies. This consistency benefits Proxy-tuning and enhances model performance. Note that our method focuses solely on logit-level computation, which makes it model-agnostic and applicable to any task involving logit classification. Extensive experimental results demonstrate the superiority of our CPT in both black-box tuning of Large Language Models (LLMs) and Vision-Language Models (VLMs) across various datasets. The code is available at this https URL.

[LG-63] Wind Estimation in Unmanned Aerial Vehicles with Causal Machine Learning

链接: https://arxiv.org/abs/2407.01154
作者: Abdulaziz Alwalan,Miguel Arana-Catania
关键词: causal machine learning, machine learning approach, machine learning, combines machine learning, machine learning times
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 11 pages, 9 figures, 10 tables. To be presented in 15th International Conference on Mechanical and Aerospace Engineering (ICMAE)

点击查看摘要

Abstract:In this work we demonstrate the possibility of estimating the wind environment of a UAV without specialised sensors, using only the UAV’s trajectory, applying a causal machine learning approach. We implement the causal curiosity method which combines machine learning times series classification and clustering with a causal framework. We analyse three distinct wind environments: constant wind, shear wind, and turbulence, and explore different optimisation strategies for optimal UAV manoeuvres to estimate the wind conditions. The proposed approach can be used to design optimal trajectories in challenging weather conditions, and to avoid specialised sensors that add to the UAV’s weight and compromise its functionality.

[LG-64] Calibrated Large Language Models for Binary Question Answering

链接: https://arxiv.org/abs/2407.01122
作者: Patrizio Giovannotti,Alexander Gammerman
关键词: large language models, binary text classification, text classification tasks, classification tasks remains, remains a challenge
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted to COPA 2024 (13th Symposium on Conformal and Probabilistic Prediction with Applications)

点击查看摘要

Abstract:Quantifying the uncertainty of predictions made by large language models (LLMs) in binary text classification tasks remains a challenge. Calibration, in the context of LLMs, refers to the alignment between the model’s predicted probabilities and the actual correctness of its predictions. A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn–Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels. Our experiments on the BoolQ dataset using the Llama 2 model demonstrate that IVAP consistently outperforms the commonly used temperature scaling method for various label token choices, achieving well-calibrated probabilities while maintaining high predictive quality. Our findings contribute to the understanding of calibration techniques for LLMs and provide a practical solution for obtaining reliable uncertainty estimates in binary question answering tasks, enhancing the interpretability and trustworthiness of LLM predictions.

[LG-65] Enabling Mixed Effects Neural Networks for Diverse Clustered Data Using Monte Carlo Methods

链接: https://arxiv.org/abs/2407.01115
作者: Andrej Tschalzev,Paul Nitschke,Lukas Kirchdorfer,Stefan Lüdtke,Christian Bartelt,Heiner Stuckenschmidt
关键词: disregarding correlations arising, effects neural networks, input data samples, Neural networks, mixed effects neural
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural networks often assume independence among input data samples, disregarding correlations arising from inherent clustering patterns in real-world datasets (e.g., due to different sites or repeated measurements). Recently, mixed effects neural networks (MENNs) which separate cluster-specific ‘random effects’ from cluster-invariant ‘fixed effects’ have been proposed to improve generalization and interpretability for clustered data. However, existing methods only allow for approximate quantification of cluster effects and are limited to regression and binary targets with only one clustering feature. We present MC-GMENN, a novel approach employing Monte Carlo methods to train Generalized Mixed Effects Neural Networks. We empirically demonstrate that MC-GMENN outperforms existing mixed effects deep learning models in terms of generalization performance, time complexity, and quantification of inter-cluster variance. Additionally, MC-GMENN is applicable to a wide range of datasets, including multi-class classification tasks with multiple high-cardinality categorical features. For these datasets, we show that MC-GMENN outperforms conventional encoding and embedding methods, simultaneously offering a principled methodology for interpreting the effects of clustering patterns.

[LG-66] Proximity Matters: Local Proximity Preserved Balancing for Treatment Effect Estimation

链接: https://arxiv.org/abs/2407.01111
作者: Hao Wang,Zhichao Chen,Yuan Shen,Jiajun Fan,Zhaoran Liu,Degui Yang,Xinggao Liu,Haoxuan Li
关键词: Heterogeneous treatment effect, poses significant challenges, significant challenges due, Heterogeneous treatment, observational data poses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Code is available at https://anonymous.4open.science/status/ncr-B697

点击查看摘要

Abstract:Heterogeneous treatment effect (HTE) estimation from observational data poses significant challenges due to treatment selection bias. Existing methods address this bias by minimizing distribution discrepancies between treatment groups in latent space, focusing on global alignment. However, the fruitful aspect of local proximity, where similar units exhibit similar outcomes, is often overlooked. In this study, we propose Proximity-aware Counterfactual Regression (PCR) to exploit proximity for representation balancing within the HTE estimation context. Specifically, we introduce a local proximity preservation regularizer based on optimal transport to depict the local proximity in discrepancy calculation. Furthermore, to overcome the curse of dimensionality that renders the estimation of discrepancy ineffective, exacerbated by limited data availability for HTE estimation, we develop an informative subspace projector, which trades off minimal distance precision for improved sample complexity. Extensive experiments demonstrate that PCR accurately matches units across different treatment groups, effectively mitigates treatment selection bias, and significantly outperforms competitors. Code is available at https://anonymous.4open.science/status/ncr-B697.

[LG-67] SecGenAI: Enhancing Security of Cloud-based Generative AI Applications within Australian Critical Technologies of National Interest

链接: https://arxiv.org/abs/2407.01110
作者: Christoforus Yoga Haryanto,Minh Hieu Vu,Trung Duc Nguyen,Emily Lomempow,Yulia Nurliana,Sona Taheri
关键词: Australia critical technologies, technologies offers transformative, offers transformative opportunities, unique security challenges, advancement of Generative
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, 9 tables, submitted to the 2024 11th International Conference on Soft Computing Machine Intelligence (ISCMI 2024)

点击查看摘要

Abstract:The rapid advancement of Generative AI (GenAI) technologies offers transformative opportunities within Australia’s critical technologies of national interest while introducing unique security challenges. This paper presents SecGenAI, a comprehensive security framework for cloud-based GenAI applications, with a focus on Retrieval-Augmented Generation (RAG) systems. SecGenAI addresses functional, infrastructure, and governance requirements, integrating end-to-end security analysis to generate specifications emphasizing data privacy, secure deployment, and shared responsibility models. Aligned with Australian Privacy Principles, AI Ethics Principles, and guidelines from the Australian Cyber Security Centre and Digital Transformation Agency, SecGenAI mitigates threats such as data leakage, adversarial attacks, and model inversion. The framework’s novel approach combines advanced machine learning techniques with robust security measures, ensuring compliance with Australian regulations while enhancing the reliability and trustworthiness of GenAI systems. This research contributes to the field of intelligent systems by providing actionable strategies for secure GenAI implementation in industry, fostering innovation in AI applications, and safeguarding national interests.

[LG-68] Eliminating Position Bias of Language Models: A Mechanistic Approach

链接: https://arxiv.org/abs/2407.01100
作者: Ziqi Wang,Hanlin Zhang,Xiner Li,Kuan-Hao Huang,Chi Han,Shuiwang Ji,Sham M. Kakade,Hao Peng,Heng Ji
关键词: Position bias, prevalent issue, issue of modern, Position, bias
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. Specifically, we find that causal attention generally causes models to favor distant content, while relative positional encodings like RoPE prefer nearby ones based on the analysis of retrieval-augmented question answering (QA). Further, our empirical study on object detection reveals that position bias is also present in vision-language models (VLMs). Based on the above analyses, we propose to ELIMINATE position bias caused by different input segment orders (e.g., options in LM-as-a-judge, retrieved documents in QA) in a TRAINING-FREE ZERO-SHOT manner. Our method changes the causal attention to bidirectional attention between segments and utilizes model attention values to decide the relative orders of segments instead of using the order provided in input prompts, therefore enabling Position-INvariant inferencE (PINE) at the segment level. By eliminating position bias, models achieve better performance and reliability in downstream tasks where position bias widely exists, such as LM-as-a-judge and retrieval-augmented QA. Notably, PINE is especially useful when adapting LMs for evaluating reasoning pairs: it consistently provides 8 to 10 percentage points performance gains in most cases, and makes Llama-3-70B-Instruct perform even better than GPT-4-0125-preview on the RewardBench reasoning subset. Comments: 18 pages, 5 figures Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2407.01100 [cs.CL] (or arXiv:2407.01100v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.01100 Focus to learn more arXiv-issued DOI via DataCite

[LG-69] Kolmogorov-Arnold Convolutions: Design Principles and Empirical Studies

链接: https://arxiv.org/abs/2407.01092
作者: Ivan Drokin
关键词: sparked significant interest, Kolmogorov-Arnold Networks, scientific community, sparked significant, significant interest
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of Kolmogorov-Arnold Networks (KANs) has sparked significant interest and debate within the scientific community. This paper explores the application of KANs in the domain of computer vision (CV). We examine the convolutional version of KANs, considering various nonlinearity options beyond splines, such as Wavelet transforms and a range of polynomials. We propose a parameter-efficient design for Kolmogorov-Arnold convolutional layers and a parameter-efficient finetuning algorithm for pre-trained KAN models, as well as KAN convolutional versions of self-attention and focal modulation layers. We provide empirical evaluations conducted on MNIST, CIFAR10, CIFAR100, Tiny ImageNet, ImageNet1k, and HAM10000 datasets for image classification tasks. Additionally, we explore segmentation tasks, proposing U-Net-like architectures with KAN convolutions, and achieving state-of-the-art results on BUSI, GlaS, and CVC datasets. We summarized all of our findings in a preliminary design guide of KAN convolutional models for computer vision tasks. Furthermore, we investigate regularization techniques for KANs. All experimental code and implementations of convolutional layers and models, pre-trained on ImageNet1k weights are available on GitHub via this this https URL

[LG-70] Rethinking LLM-based Preference Evaluation

链接: https://arxiv.org/abs/2407.01085
作者: Zhengyu Hu,Linxin Song,Jieyu Zhang,Zheyuan Xiao,Jingang Wang,Zhenyu Chen,Jieyu Zhao,Hui Xiong
关键词: large language model, based preference evaluation, large language, widely adopted, adopted to compare
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recently, large language model (LLM)-based preference evaluation has been widely adopted to compare pairs of model responses. However, a severe bias towards lengthy responses has been observed, raising concerns about the reliability of this evaluation method. In this work, we designed a series of controlled experiments to study the major impacting factors of the metric of LLM-based preference evaluation, i.e., win rate, and conclude that the win rate is affected by two axes of model response: desirability and information mass, where the former is length-independent and related to trustworthiness, and the latter is length-dependent and can be represented by conditional entropy. We find that length impacts the existing evaluations by influencing information mass. However, a reliable evaluation metric should not only assess content quality but also ensure that the assessment is not confounded by extraneous factors such as response length. Therefore, we propose a simple yet effective adjustment, AdapAlpaca, to the existing practice of win rate measurement. Specifically, by adjusting the lengths of reference answers to match the test model’s answers within the same interval, we debias information mass relative to length, ensuring a fair model evaluation.

[LG-71] Human-like object concept representations emerge naturally in multimodal large language models

链接: https://arxiv.org/abs/2407.01067
作者: Changde Du,Kaicheng Fu,Bincheng Wen,Yi Sun,Jie Peng,Wei Wei,Ying Gao,Shengpei Wang,Chuncheng Zhang,Jinpeng Li,Shuang Qiu,Le Chang,Huiguang He
关键词: offering crucial insights, Large Language Models, intrigued cognitive scientists, long intrigued cognitive, scientists and neuroscientists
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The conceptualization and categorization of natural objects in the human mind have long intrigued cognitive scientists and neuroscientists, offering crucial insights into human perception and cognition. Recently, the rapid development of Large Language Models (LLMs) has raised the attractive question of whether these models can also develop human-like object representations through exposure to vast amounts of linguistic and multimodal data. In this study, we combined behavioral and neuroimaging analysis methods to uncover how the object concept representations in LLMs correlate with those of humans. By collecting large-scale datasets of 4.7 million triplet judgments from LLM and Multimodal LLM (MLLM), we were able to derive low-dimensional embeddings that capture the underlying similarity structure of 1,854 natural objects. The resulting 66-dimensional embeddings were found to be highly stable and predictive, and exhibited semantic clustering akin to human mental representations. Interestingly, the interpretability of the dimensions underlying these embeddings suggests that LLM and MLLM have developed human-like conceptual representations of natural objects. Further analysis demonstrated strong alignment between the identified model embeddings and neural activity patterns in many functionally defined brain ROIs (e.g., EBA, PPA, RSC and FFA). This provides compelling evidence that the object representations in LLMs, while not identical to those in the human, share fundamental commonalities that reflect key schemas of human conceptual knowledge. This study advances our understanding of machine intelligence and informs the development of more human-like artificial cognitive systems.

[LG-72] Improve ROI with Causal Learning and Conformal Prediction

链接: https://arxiv.org/abs/2407.01065
作者: Meng Ai,Zhuo Chen,Jibin Wang,Jing Shang,Tao Tao,Zhen Li
关键词: intelligent decision-making utilizing, Direct ROI Prediction, Treatment Assignment Problem, Cost-aware Binary Treatment, Binary Treatment Assignment
类目: Machine Learning (cs.LG)
*备注: Accepted by ICDE 2024; Link: this https URL

点击查看摘要

Abstract:In the commercial sphere, such as operations and maintenance, advertising, and marketing recommendations, intelligent decision-making utilizing data mining and neural network technologies is crucial, especially in resource allocation to optimize ROI. This study delves into the Cost-aware Binary Treatment Assignment Problem (C-BTAP) across different industries, with a focus on the state-of-the-art Direct ROI Prediction (DRP) method. However, the DRP model confronts issues like covariate shift and insufficient training data, hindering its real-world effectiveness. Addressing these challenges is essential for ensuring dependable and robust predictions in varied operational contexts. This paper presents a robust Direct ROI Prediction (rDRP) method, designed to address challenges in real-world deployment of neural network-based uplift models, particularly under conditions of covariate shift and insufficient training data. The rDRP method, enhancing the standard DRP model, does not alter the model’s structure or require retraining. It utilizes conformal prediction and Monte Carlo dropout for interval estimation, adapting to model uncertainty and data distribution shifts. A heuristic calibration method, inspired by a Kaggle competition, combines point and interval estimates. The effectiveness of these approaches is validated through offline tests and online A/B tests in various settings, demonstrating significant improvements in target rewards compared to the state-of-the-art method. Comments: Accepted by ICDE 2024; Link: this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2407.01065 [cs.LG] (or arXiv:2407.01065v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2407.01065 Focus to learn more arXiv-issued DOI via DataCite

[LG-73] Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks

链接: https://arxiv.org/abs/2407.01054
作者: Beatrice Alessandra Motetti,Matteo Risso,Alessio Burrello,Enrico Macii,Massimo Poncino,Daniele Jahier Pagliari
关键词: pose significant challenges, deep neural networks, pose significant, edge devices, resource requirements
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The resource requirements of deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization, which lead to latency and memory occupation improvements. These optimization techniques are usually applied independently. We propose a novel methodology to apply them jointly via a lightweight gradient-based search, and in a hardware-aware manner, greatly reducing the time required to generate Pareto-optimal DNNs in terms of accuracy versus cost (i.e., latency or memory). We test our approach on three edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny ImageNet. When targeting the optimization of the memory footprint, we are able to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the baseline networks with all weights quantized at 8 and 2-bit, respectively. Our method surpasses a previous state-of-the-art approach with up to 56.17% size reduction at iso-accuracy. With respect to the sequential application of state-of-the-art pruning and mixed-precision optimizations, we obtain comparable or superior results, but with a significantly lowered training time. In addition, we show how well-tailored cost models can improve the cost versus accuracy trade-offs when targeting specific hardware for deployment.

[LG-74] SE(3)-Hyena Operator for Scalable Equivariant Learning

链接: https://arxiv.org/abs/2407.01049
作者: Artem Moskalev,Mangal Prakash,Rui Liao,Tommaso Mansi
关键词: crucial for accurate, accurate predictions, Hyena operator, Hyena, global geometric context
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modeling global geometric context while maintaining equivariance is crucial for accurate predictions in many fields such as biology, chemistry, or vision. Yet, this is challenging due to the computational demands of processing high-dimensional data at scale. Existing approaches such as equivariant self-attention or distance-based message passing, suffer from quadratic complexity with respect to sequence length, while localized methods sacrifice global information. Inspired by the recent success of state-space and long-convolutional models, in this work, we introduce SE(3)-Hyena operator, an equivariant long-convolutional model based on the Hyena operator. The SE(3)-Hyena captures global geometric context at sub-quadratic complexity while maintaining equivariance to rotations and translations. Evaluated on equivariant associative recall and n-body modeling, SE(3)-Hyena matches or outperforms equivariant self-attention while requiring significantly less memory and computational resources for long sequences. Our model processes the geometric context of 20k tokens x3.5 times faster than the equivariant transformer and allows x175 longer a context within the same memory budget.

[LG-75] Neural Networks Trained by Weight Permutation are Universal Approximators

链接: https://arxiv.org/abs/2407.01033
作者: Yongqiang Cai,Gaohang Chen,Zhonghua Qiao
关键词: universal approximation property, universal approximation, approximation property, property is fundamental, success of neural
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The universal approximation property is fundamental to the success of neural networks, and has traditionally been achieved by training networks without any constraints on their parameters. However, recent experimental research proposed a novel permutation-based training method, which exhibited a desired classification performance without modifying the exact weight values. In this paper, we provide a theoretical guarantee of this permutation training method by proving its ability to guide a ReLU network to approximate one-dimensional continuous functions. Our numerical results further validate this method’s efficiency in regression tasks with various initializations. The notable observations during weight permutation suggest that permutation training can provide an innovative tool for describing network learning behavior.

[LG-76] Overcoming Common Flaws in the Evaluation of Selective Classification Systems

链接: https://arxiv.org/abs/2407.01032
作者: Jeremias Traub,Till J. Bungert,Carsten T. Lüth,Michael Baumgartner,Klaus H. Maier-Hein,Lena Maier-Hein,Paul F Jaeger
关键词: reject low-confidence predictions, promises reliable translation, machine-learning based classification, based classification systems, low-confidence predictions
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the \mathrmAUROC in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ( \mathrmAUGRC ), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of \mathrmAUGRC on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

[LG-77] PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs

链接: https://arxiv.org/abs/2407.01031
作者: Dan Peng,Zhihui Fu,Jun Wang
关键词: Recent advancements, large language models, impressive capabilities, advancements in large, large language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: Accepted to the ACL 2024 Workshop on Privacy in Natural Language Processing (PrivateNLP)

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have indeed showcased their impressive capabilities. On mobile devices, the wealth of valuable, non-public data generated daily holds great promise for locally fine-tuning personalized LLMs, while maintaining privacy through on-device processing. However, the constraints of mobile device resources pose challenges to direct on-device LLM fine-tuning, mainly due to the memory-intensive nature of derivative-based optimization required for saving gradients and optimizer states. To tackle this, we propose employing derivative-free optimization techniques to enable on-device fine-tuning of LLM, even on memory-limited mobile devices. Empirical results demonstrate that the RoBERTa-large model and OPT-1.3B can be fine-tuned locally on the OPPO Reno 6 smartphone using around 4GB and 6.5GB of memory respectively, using derivative-free optimization techniques. This highlights the feasibility of on-device LLM fine-tuning on mobile devices, paving the way for personalized LLMs on resource-constrained devices while safeguarding data privacy.

[LG-78] DistML.js: Installation-free Distributed Deep Learning Framework for Web Browsers

链接: https://arxiv.org/abs/2407.01023
作者: Masatoshi Hidaka,Tomohiro Hashimoto,Yuto Nishizawa,Tatsuya Harada
关键词: web browsers, library designed, machine learning models, model training, learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present “DistML.js”, a library designed for training and inference of machine learning models within web browsers. Not only does DistML.js facilitate model training on local devices, but it also supports distributed learning through communication with servers. Its design and define-by-run API for deep learning model construction resemble PyTorch, thereby reducing the learning curve for prototyping. Matrix computations involved in model training and inference are executed on the backend utilizing WebGL, enabling high-speed calculations. We provide a comprehensive explanation of DistML.js’s design, API, and implementation, alongside practical applications including data parallelism in learning. The source code is publicly available at this https URL.

[LG-79] Swish-T:Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance

链接: https://arxiv.org/abs/2407.01012
作者: Youngmin Seo,Jinha Kim,Unsang Park
关键词: activation function Swish, existing non-monotonic activation, original Swish, original Swish function, Swish-T
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:We propose the Swish-T family, an enhancement of the existing non-monotonic activation function Swish. Swish-T is defined by adding a Tanh bias to the original Swish function. This modification creates a family of Swish-T variants, each designed to excel in different tasks, showcasing specific advantages depending on the application context. The Tanh bias allows for broader acceptance of negative values during initial training stages, offering a smoother non-monotonic curve than the original Swish. We ultimately propose the Swish-T _\textbfC function, while Swish-T and Swish-T _\textbfB , byproducts of Swish-T _\textbfC , also demonstrate satisfactory performance. Furthermore, our ablation study shows that using Swish-T _\textbfC as a non-parametric function can still achieve high performance. The superiority of the Swish-T family has been empirically demonstrated across various models and benchmark datasets, including MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100. The code is publicly available at "this https URL.

[LG-80] MARLP: Time-series Forecasting Control for Agricultural Managed Aquifer Recharge

链接: https://arxiv.org/abs/2407.01005
作者: Yuning Chen,Kang Yang,Zhiyu An,Brady Holder,Luke Paloutzian,Khaled Bali,Wan Du
关键词: sustainable agriculture, rapid decline, decline in groundwater, world poses, poses a significant
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted by KDD 2024

点击查看摘要

Abstract:The rapid decline in groundwater around the world poses a significant challenge to sustainable agriculture. To address this issue, agricultural managed aquifer recharge (Ag-MAR) is proposed to recharge the aquifer by artificially flooding agricultural lands using surface water. Ag-MAR requires a carefully selected flooding schedule to avoid affecting the oxygen absorption of crop roots. However, current Ag-MAR scheduling does not take into account complex environmental factors such as weather and soil oxygen, resulting in crop damage and insufficient recharging amounts. This paper proposes MARLP, the first end-to-end data-driven control system for Ag-MAR. We first formulate Ag-MAR as an optimization problem. To that end, we analyze four-year in-field datasets, which reveal the multi-periodicity feature of the soil oxygen level trends and the opportunity to use external weather forecasts and flooding proposals as exogenous clues for soil oxygen prediction. Then, we design a two-stage forecasting framework. In the first stage, it extracts both the cross-variate dependency and the periodic patterns from historical data to conduct preliminary forecasting. In the second stage, it uses weather-soil and flooding-soil causality to facilitate an accurate prediction of soil oxygen levels. Finally, we conduct model predictive control (MPC) for Ag-MAR flooding. To address the challenge of large action spaces, we devise a heuristic planning module to reduce the number of flooding proposals to enable the search for optimal solutions. Real-world experiments show that MARLP reduces the oxygen deficit ratio by 86.8% while improving the recharging amount in unit time by 35.8%, compared with the previous four years.

[LG-81] CURLS: Causal Rule Learning for Subgroups with Significant Treatment Effect

链接: https://arxiv.org/abs/2407.01004
作者: Jiehui Zhou,Linxiao Yang,Xingyu Liu,Xinyue Gu,Liang Sun,Wei Chen
关键词: estimating heterogeneous treatment, heterogeneous treatment effects, estimating heterogeneous, personalized advertising, critical for identifying
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:In causal inference, estimating heterogeneous treatment effects (HTE) is critical for identifying how different subgroups respond to interventions, with broad applications in fields such as precision medicine and personalized advertising. Although HTE estimation methods aim to improve accuracy, how to provide explicit subgroup descriptions remains unclear, hindering data interpretation and strategic intervention management. In this paper, we propose CURLS, a novel rule learning method leveraging HTE, which can effectively describe subgroups with significant treatment effects. Specifically, we frame causal rule learning as a discrete optimization problem, finely balancing treatment effect with variance and considering the rule interpretability. We design an iterative procedure based on the minorize-maximization algorithm and solve a submodular lower bound as an approximation for the original. Quantitative experiments and qualitative case studies verify that compared with state-of-the-art methods, CURLS can find subgroups where the estimated and true effects are 16.1% and 13.8% higher and the variance is 12.0% smaller, while maintaining similar or better estimation accuracy and rule interpretability. Code is available at this https URL.

[LG-82] Flood Prediction Using Classical and Quantum Machine Learning Models

链接: https://arxiv.org/abs/2407.01001
作者: Marek Grzesiak,Param Thakkar
关键词: Germany Wupper River, QML models offer, competitive training times, offer competitive training, scalability results show
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:This study investigates the potential of quantum machine learning to improve flood forecasting we focus on daily flood events along Germany’s Wupper River in 2023 our approach combines classical machine learning techniques with QML techniques this hybrid model leverages quantum properties like superposition and entanglement to achieve better accuracy and efficiency classical and QML models are compared based on training time accuracy and scalability results show that QML models offer competitive training times and improved prediction accuracy this research signifies a step towards utilizing quantum technologies for climate change adaptation we emphasize collaboration and continuous innovation to implement this model in real-world flood management ultimately enhancing global resilience against floods

[LG-83] Can Small Language Models Learn Unlearn and Retain Noise Patterns?

链接: https://arxiv.org/abs/2407.00996
作者: Nicy Scaria,Silvester John Joseph Kennedy,Deepak Subramani
关键词: Small Language Models, large language models, Language Models, Small Language, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Small Language Models (SLMs) are generally considered to be more compact versions of large language models (LLMs), typically having fewer than 7 billion parameters. This study investigates the ability of small language models to learn, retain, and subsequently eliminate noise that is typically not found on the internet, where most pretraining datasets are sourced. For this, four pre-trained SLMs were utilized: Olmo 1B, Qwen1.5 1.8B, Gemma 2B, and Phi2 2.7B. The models were instruction-tuned without noise and tested for task execution with in-context learning. Afterward, noise patterns were introduced to evaluate the models’ learning and unlearning capabilities. We evaluated the models’ performance at various training levels. Phi consistently excelled with word-level noise but performed the worst with character-level noise. Despite being the smallest with approximately 1 billion parameters, Olmo performed consistently well on tasks.

[LG-84] Hybrid RAG-empowered Multi-modal LLM for Secure Healthcare Data Management: A Diffusion-based Contract Theory Approach

链接: https://arxiv.org/abs/2407.00978
作者: Cheng Su,Jinbo Wen,Jiawen Kang,Yonghua Wang,Hudan Pan,M. Shamim Hossain
关键词: Large Language Models, Multi-modal Large Language, evolving healthcare landscape, rapidly evolving healthcare, data
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Secure data management and effective data sharing have become paramount in the rapidly evolving healthcare landscape. The advancement of generative artificial intelligence has positioned Multi-modal Large Language Models (MLLMs) as crucial tools for managing healthcare data. MLLMs can support multi-modal inputs and generate diverse types of content by leveraging large-scale training on vast amounts of multi-modal data. However, critical challenges persist in developing medical MLLMs, including healthcare data security and freshness issues, affecting the output quality of MLLMs. In this paper, we propose a hybrid Retrieval-Augmented Generation (RAG)-empowered medical MLLMs framework for healthcare data management. This framework leverages a hierarchical cross-chain architecture to facilitate secure data training. Moreover, it enhances the output quality of MLLMs through hybrid RAG, which employs multi-modal metrics to filter various unimodal RAG results and incorporates these retrieval results as additional inputs to MLLMs. Additionally, we employ age of information to indirectly evaluate the data freshness impact of MLLMs and utilize contract theory to incentivize healthcare data holders to share fresh data, mitigating information asymmetry in data sharing. Finally, we utilize a generative diffusion model-based reinforcement learning algorithm to identify the optimal contract for efficient data sharing. Numerical results demonstrate the effectiveness of the proposed schemes, which achieve secure and efficient healthcare data management.

[LG-85] How Does Overparameterization Affect Features?

链接: https://arxiv.org/abs/2407.00968
作者: Ahmet Cagri Duzgun,Samy Jelassi,Yuanzhi Li
关键词: training loss, deep learning, fit their training, crucial factor, success of deep
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Overparameterization, the condition where models have more parameters than necessary to fit their training loss, is a crucial factor for the success of deep learning. However, the characteristics of the features learned by overparameterized networks are not well understood. In this work, we explore this question by comparing models with the same architecture but different widths. We first examine the expressivity of the features of these models, and show that the feature space of overparameterized networks cannot be spanned by concatenating many underparameterized features, and vice versa. This reveals that both overparameterized and underparameterized networks acquire some distinctive features. We then evaluate the performance of these models, and find that overparameterized networks outperform underparameterized networks, even when many of the latter are concatenated. We corroborate these findings using a VGG-16 and ResNet18 on CIFAR-10 and a Transformer on the MNLI classification dataset. Finally, we propose a toy setting to explain how overparameterized networks can learn some important features that the underparamaterized networks cannot learn.

[LG-86] Smoothed Analysis for Learning Concepts with Low Intrinsic Dimension

链接: https://arxiv.org/abs/2407.00966
作者: Gautam Chandrasekaran,Adam Klivans,Vasilis Kontonis,Raghu Meka,Konstantinos Stavropoulos
关键词: arbitrary joint distribution, arbitrary joint, output a hypothesis, random Gaussian perturbation, fitting concept
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注: To appear in COLT 2024

点击查看摘要

Abstract:In traditional models of supervised learning, the goal of a learner – given examples from an arbitrary joint distribution on \mathbbR^d \times \pm 1\ – is to output a hypothesis that is competitive (to within \epsilon ) of the best fitting concept from some class. In order to escape strong hardness results for learning even simple concept classes, we introduce a smoothed-analysis framework that requires a learner to compete only with the best classifier that is robust to small random Gaussian perturbation. This subtle change allows us to give a wide array of learning results for any concept that (1) depends on a low-dimensional subspace (aka multi-index model) and (2) has a bounded Gaussian surface area. This class includes functions of halfspaces and (low-dimensional) convex sets, cases that are only known to be learnable in non-smoothed settings with respect to highly structured distributions such as Gaussians. Surprisingly, our analysis also yields new results for traditional non-smoothed frameworks such as learning with margin. In particular, we obtain the first algorithm for agnostically learning intersections of k -halfspaces in time k^poly(\frac\log k\epsilon \gamma) where \gamma is the margin parameter. Before our work, the best-known runtime was exponential in k (Arriaga and Vempala, 1999). Comments: To appear in COLT 2024 Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC) Cite as: arXiv:2407.00966 [cs.LG] (or arXiv:2407.00966v1 [cs.LG] for this version)

[LG-87] Universal Approximation Theory: The basic theory for large language models

链接: https://arxiv.org/abs/2407.00958
作者: Wei Wang,Qing Li
关键词: artificial intelligence, innovations like ChatGPT, area of focus, focus in artificial, introduction of groundbreaking
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language models have emerged as a critical area of focus in artificial intelligence, particularly with the introduction of groundbreaking innovations like ChatGPT. Large-scale Transformer networks have quickly become the leading approach for advancing natural language processing algorithms. Built on the Transformer architecture, these models enable interactions that closely mimic human communication and, equipped with extensive knowledge, can even assist in guiding human tasks. Despite their impressive capabilities and growing complexity, a key question remains-the theoretical foundations of large language models (LLMs). What makes Transformer so effective for powering intelligent language applications, such as translation and coding? What underlies LLMs’ ability for In-Context Learning (ICL)? How does the LoRA scheme enhance the fine-tuning of LLMs? And what supports the practicality of pruning LLMs? To address these critical questions and explore the technological strategies within LLMs, we leverage the Universal Approximation Theory (UAT) to offer a theoretical backdrop, shedding light on the mechanisms that underpin these advancements.

[LG-88] A Closer Look at Deep Learning on Tabular Data

链接: https://arxiv.org/abs/2407.00956
作者: Han-Jia Ye,Si-Yang Liu,Hao-Run Cai,Qi-Le Zhou,De-Chuan Zhan
关键词: Deep Neural Network, deep tabular methods, Tabular, deep tabular, methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data is prevalent across various domains in machine learning. Although Deep Neural Network (DNN)-based methods have shown promising performance comparable to tree-based ones, in-depth evaluation of these methods is challenging due to varying performance ranks across diverse datasets. In this paper, we propose a comprehensive benchmark comprising 300 tabular datasets, covering a wide range of task types, size distributions, and domains. We perform an extensive comparison between state-of-the-art deep tabular methods and tree-based methods, revealing the average rank of all methods and highlighting the key factors that influence the success of deep tabular methods. Next, we analyze deep tabular methods based on their training dynamics, including changes in validation metrics and other statistics. For each dataset-method pair, we learn a mapping from both the meta-features of datasets and the first part of the validation curve to the final validation set performance and even the evolution of validation curves. This mapping extracts essential meta-features that influence prediction accuracy, helping the analysis of tabular methods from novel aspects. Based on the performance of all methods on this large benchmark, we identify two subsets of 45 datasets each. The first subset contains datasets that favor either tree-based methods or DNN-based methods, serving as effective analysis tools to evaluate strategies (e.g., attribute encoding strategies) for improving deep tabular models. The second subset contains datasets where the ranks of methods are consistent with the overall benchmark, acting as a probe for tabular analysis. These ``tiny tabular benchmarks’’ will facilitate further studies on tabular data.

[LG-89] SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models

链接: https://arxiv.org/abs/2407.00952
作者: Zheng Lin,Xuanjie Hu,Yuxin Zhang,Zhe Chen,Zihan Fang,Xianhao Chen,Ang Li,Praneeth Vepakomma,Yue Gao
关键词: LLM fine-tuning, LLM fine-tuning paradigm, LLM, large language models, handling high-complexity models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:The scalability of large language models (LLMs) in handling high-complexity models and large-scale datasets has led to tremendous successes in pivotal domains. While there is an urgent need to acquire more training data for LLMs, a concerning reality is the depletion of high-quality public datasets within a few years. In view of this, the federated learning (FL) LLM fine-tuning paradigm recently has been proposed to facilitate collaborative LLM fine-tuning on distributed private data, where multiple data owners collaboratively fine-tune a shared LLM without sharing raw data. However, the staggering model size of LLMs imposes heavy computing and communication burdens on clients, posing significant barriers to the democratization of the FL LLM fine-tuning paradigm. To address this issue, split learning (SL) has emerged as a promising solution by offloading the primary training workload to a server via model partitioning while exchanging activation/activation’s gradients with smaller data sizes rather than the entire LLM. Unfortunately, research on the SL LLM fine-tuning paradigm is still in its nascent stage. To fill this gap, in this paper, we propose the first SL LLM fine-tuning framework, named SplitLoRA. SplitLoRA is built on the split federated learning (SFL) framework, amalgamating the advantages of parallel training from FL and model splitting from SL and thus greatly enhancing the training efficiency. It is worth noting that SplitLoRA is the inaugural open-source benchmark for SL LLM fine-tuning, providing a foundation for research efforts dedicated to advancing SL LLM fine-tuning. Extensive simulations validate that SplitLoRA achieves target accuracy in significantly less time than state-of-the-art LLM fine-tuning frameworks, demonstrating the superior training performance of SplitLoRA. The project page is available at this https URL.

[LG-90] Causal Bandits: The Pareto Optimal Frontier of Adaptivity a Reduction to Linear Bandits and Limitations around Unknown Marginals

链接: https://arxiv.org/abs/2407.00950
作者: Ziyi Liu,Idan Attias,Daniel M. Roy
关键词: multi-armed bandit problems, presence or absence, causal bandits, multi-armed bandit, causal bandits demonstrates
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:In this work, we investigate the problem of adapting to the presence or absence of causal structure in multi-armed bandit problems. In addition to the usual reward signal, we assume the learner has access to additional variables, observed in each round after acting. When these variables d -separate the action from the reward, existing work in causal bandits demonstrates that one can achieve strictly better (minimax) rates of regret (Lu et al., 2020). Our goal is to adapt to this favorable “conditionally benign” structure, if it is present in the environment, while simultaneously recovering worst-case minimax regret, if it is not. Notably, the learner has no prior knowledge of whether the favorable structure holds. In this paper, we establish the Pareto optimal frontier of adaptive rates. We prove upper and matching lower bounds on the possible trade-offs in the performance of learning in conditionally benign and arbitrary environments, resolving an open question raised by Bilodeau et al. (2022). Furthermore, we are the first to obtain instance-dependent bounds for causal bandits, by reducing the problem to the linear bandit setting. Finally, we examine the common assumption that the marginal distributions of the post-action contexts are known and show that a nontrivial estimate is necessary for better-than-worst-case minimax rates.

[LG-91] he House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

链接: https://arxiv.org/abs/2407.00948
作者: Tanush Chopra,Michael Li
关键词: large language models, language models, large language, evaluating strategic deception, fair play
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research conducted at the Deception Detection Hackathon 2024 hosted by Apart Apollo Research

点击查看摘要

Abstract:We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the “house.” Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

[LG-92] Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

链接: https://arxiv.org/abs/2407.00945
作者: Enshu Liu,Junyi Zhu,Zinan Lin,Xuefei Ning,Matthew B. Blaschko,Shengen Yan,Guohao Dai,Huazhong Yang,Yu Wang
关键词: processing power, energy consumption, posing significant deployment, rapid advancement, billions to trillions
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has led to architectures with billions to trillions of parameters, posing significant deployment challenges due to their substantial demands on memory, processing power, and energy consumption. Sparse Mixture-of-Experts (SMoE) architectures have emerged as a solution, activating only a subset of parameters per token, thereby achieving faster inference while maintaining performance. However, SMoE models still face limitations in broader deployment due to their large parameter counts and significant GPU memory requirements. In this work, we introduce a gradient-free evolutionary strategy named EEP (Efficient Expert Pruning) to enhance the pruning of experts in SMoE models. EEP relies solely on model inference (i.e., no gradient computation) and achieves greater sparsity while maintaining or even improving performance on downstream tasks. EEP can be used to reduce both the total number of experts (thus saving GPU memory) and the number of active experts (thus accelerating inference). For example, we demonstrate that pruning up to 75% of experts in Mixtral 8\times7 B-Instruct results in a substantial reduction in parameters with minimal performance loss. Remarkably, we observe improved performance on certain tasks, such as a significant increase in accuracy on the SQuAD dataset (from 53.4% to 75.4%), when pruning half of the experts. With these results, EEP not only lowers the barrier to deploying SMoE models,but also challenges the conventional understanding of model pruning by showing that fewer experts can lead to better task-specific performance without any fine-tuning. Code is available at this https URL.

[LG-93] Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining

链接: https://arxiv.org/abs/2407.00935
作者: Qi Zhang,Tianqi Du,Haotian Huang,Yifei Wang,Yisen Wang
关键词: masked SSL, SSL, generative self-supervised learning, autoregressive SSL, generative SSL paradigms
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL. Through establishing theoretical frameworks, we elucidate the strengths and limitations of autoregressive and masked SSL within the primary evaluation tasks of classification and content generation. Our findings demonstrate that in classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections compared to the fixed position of target tokens in autoregressive SSL, which yields superior clustering performance. In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts in masked SSL (vs. flexible lengths of conditional texts in autoregressive SSL) hinders its generation performance. To leverage each other’s strengths and mitigate weaknesses, we propose diversity-enhanced autoregressive and variable-length masked objectives, which substantially improve the classification performance of autoregressive SSL and the generation performance of masked SSL. Code is available at this https URL.

[LG-94] FoldGPT: Simple and Effective Large Language Model Compression Scheme

链接: https://arxiv.org/abs/2407.00928
作者: Songwei Liu,Chao Zeng,Lianqiang Li,Chenqian Yan,Lean Fu,Xing Mei,Fangmin Chen
关键词: escalating data security, data security concerns, deploying large language, mobile devices continues, large language models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of most layers exhibit significant similarity. Moreover, this similarity becomes more pronounced as the model size increases, indicating substantial redundancy in the depth direction of the LLMs. Based on this observation, we propose an efficient model volume compression strategy, termed FoldGPT, which combines block removal and block parameter sharing.This strategy consists of three parts: (1) Based on the learnable gating parameters, we determine the block importance ranking while modeling the coupling effect between blocks. Then we delete some redundant layers based on the given removal rate. (2) For the retained blocks, we apply a specially designed group parameter sharing strategy, where blocks within the same group share identical weights, significantly compressing the number of parameters and slightly reducing latency overhead. (3) After sharing these Blocks, we “cure” the mismatch caused by sparsity with a minor amount of fine-tuning and introduce a tail-layer distillation strategy to improve the performance. Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression, demonstrating the feasibility of achieving model lightweighting through straightforward block removal and parameter sharing.

[LG-95] Learnability of Parameter-Bounded Bayes Nets

链接: https://arxiv.org/abs/2407.00927
作者: Arnab Bhattacharyya,Davin Choo,Sutanu Gayen,Dimitrios Myrisiotis
关键词: Bayes net, parameter-bounded Bayes net, capture dependency relations, efficiently represent joint, represent joint probability
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
*备注: 15 pages, 2 figures

点击查看摘要

Abstract:Bayes nets are extensively used in practice to efficiently represent joint probability distributions over a set of random variables and capture dependency relations. In a seminal paper, Chickering et al. (JMLR 2004) showed that given a distribution P , that is defined as the marginal distribution of a Bayes net, it is \mathsfNP -hard to decide whether there is a parameter-bounded Bayes net that represents P . They called this problem LEARN. In this work, we extend the \mathsfNP -hardness result of LEARN and prove the \mathsfNP -hardness of a promise search variant of LEARN, whereby the Bayes net in question is guaranteed to exist and one is asked to find such a Bayes net. We complement our hardness result with a positive result about the sample complexity that is sufficient to recover a parameter-bounded Bayes net that is close (in TV distance) to a given distribution P , that is represented by some parameter-bounded Bayes net, generalizing a degree-bounded sample complexity result of Brustle et al. (EC 2020).

[LG-96] Robust and Reliable Early-Stage Website Fingerprinting Attacks via Spatial-Temporal Distribution Analysis

链接: https://arxiv.org/abs/2407.00918
作者: Xinhao Deng,Qi Li,Ke Xu
关键词: compromising user privacy, Website Fingerprinting, traffic, compromising user, user privacy
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2024

点击查看摘要

Abstract:Website Fingerprinting (WF) attacks identify the websites visited by users by performing traffic analysis, compromising user privacy. Particularly, DL-based WF attacks demonstrate impressive attack performance. However, the effectiveness of DL-based WF attacks relies on the collected complete and pure traffic during the page loading, which impacts the practicality of these attacks. The WF performance is rather low under dynamic network conditions and various WF defenses, particularly when the analyzed traffic is only a small part of the complete traffic. In this paper, we propose Holmes, a robust and reliable early-stage WF attack. Holmes utilizes temporal and spatial distribution analysis of website traffic to effectively identify websites in the early stages of page loading. Specifically, Holmes develops adaptive data augmentation based on the temporal distribution of website traffic and utilizes a supervised contrastive learning method to extract the correlations between the early-stage traffic and the pre-collected complete traffic. Holmes accurately identifies traffic in the early stages of page loading by computing the correlation of the traffic with the spatial distribution information, which ensures robust and reliable detection according to early-stage traffic. We extensively evaluate Holmes using six datasets. Compared to nine existing DL-based WF attacks, Holmes improves the F1-score of identifying early-stage traffic by an average of 169.18%. Furthermore, we replay the traffic of visiting real-world dark web websites. Holmes successfully identifies dark web websites when the ratio of page loading on average is only 21.71%, with an average precision improvement of 169.36% over the existing WF attacks.

[LG-97] Learnability in Online Kernel Selection with Memory Constraint via Data-dependent Regret Analysis

链接: https://arxiv.org/abs/2407.00916
作者: Junfan Li,Shizhong Liao
关键词: Online kernel selection, kernel selection, memory constraint, Online kernel, online kernel methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online kernel selection is a fundamental problem of online kernel methods. In this paper, we study online kernel selection with memory constraint in which the memory of kernel selection and online prediction procedures is limited to a fixed budget. An essential question is what is the intrinsic relationship among online learnability, memory constraint and data complexity? To answer the question, it is necessary to show the trade-offs between regret bound and memory constraint. Previous work gives a worst-case lower bound depending on the data size,and shows learning is impossible within a small memory constraint. In contrast, we present a different result by providing data-dependent upper bounds depending on two data complexities, namely kernel alignment and the cumulative losses of competitive hypothesis. We propose an algorithmic framework giving data-dependent upper bounds for two types of loss functions. For the hinge loss function, our algorithm achieves an expected upper bound depending on kernel alignment. For smooth loss functions,our algorithm achieves a high-probability upper bound depending on the cumulative losses of competitive hypothesis. We also prove a matching lower bound for smooth loss functions. Our results show that if the two data complexities are sub-linear, then learning is possible within a small memory constraint. Our algorithmic framework depends on a new buffer maintaining framework and a reduction from online kernel selection to prediction with expert advice.Finally, we empirically verify the prediction performance of our algorithms on benchmark datasets.

[LG-98] SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent Signatures

链接: https://arxiv.org/abs/2407.00913
作者: Oguzhan Baser,Kaan Kale,Sandeep P. Chinchali
关键词: Advancements in DeepFake, voice authentication systems, authentication systems, leading to unauthorized, spread of misinformation
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 4 figures, Proc. INTERSPEECH 2024

点击查看摘要

Abstract:Advancements in DeepFake (DF) audio models pose a significant threat to voice authentication systems, leading to unauthorized access and the spread of misinformation. We introduce a defense mechanism, SecureSpectra, addressing DF threats by embedding orthogonal, irreversible signatures within audio. SecureSpectra leverages the inability of DF models to replicate high-frequency content, which we empirically identify across diverse datasets and DF models. Integrating differential privacy into the pipeline protects signatures from reverse engineering and strikes a delicate balance between enhanced security and minimal performance compromises. Our evaluations on Mozilla Common Voice, LibriSpeech, and VoxCeleb datasets showcase SecureSpectra’s superior performance, outperforming recent works by up to 71% in detection accuracy. We open-source SecureSpectra to benefit the research community.

[LG-99] Deep Image-to-Recipe Translation

链接: https://arxiv.org/abs/2407.00911
作者: Jiangqin Ma,Bilal Mawji,Franz Williams
关键词: profound level, reflecting the intricate, intricate connection, Eat, cherished food memories
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The modern saying, “You Are What You Eat” resonates on a profound level, reflecting the intricate connection between our identities and the food we consume. Our project, Deep Image-to-Recipe Translation, is an intersection of computer vision and natural language generation that aims to bridge the gap between cherished food memories and the art of culinary creation. Our primary objective involves predicting ingredients from a given food image. For this task, we first develop a custom convolutional network and then compare its performance to a model that leverages transfer learning. We pursue an additional goal of generating a comprehensive set of recipe steps from a list of ingredients. We frame this process as a sequence-to-sequence task and develop a recurrent neural network that utilizes pre-trained word embeddings. We address several challenges of deep learning including imbalanced datasets, data cleaning, overfitting, and hyperparameter selection. Our approach emphasizes the importance of metrics such as Intersection over Union (IoU) and F1 score in scenarios where accuracy alone might be misleading. For our recipe prediction model, we employ perplexity, a commonly used and important metric for language models. We find that transfer learning via pre-trained ResNet-50 weights and GloVe embeddings provide an exceptional boost to model performance, especially when considering training resource constraints. Although we have made progress on the image-to-recipe translation, there is an opportunity for future exploration with advancements in model architectures, dataset scalability, and enhanced user interaction.

[LG-100] GSO-YOLO: Global Stability Optimization YOLO for Construction Site Detection

链接: https://arxiv.org/abs/2407.00906
作者: Yuming Zhang,Dongzhi Guan,Shouxin Zhang,Junhao Su,Yunzhi Han,Jiabin Liu
关键词: causing economic damage, economic damage due, construction sites, plagued the industry, posing risks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safety issues at construction sites have long plagued the industry, posing risks to worker safety and causing economic damage due to potential hazards. With the advancement of artificial intelligence, particularly in the field of computer vision, the automation of safety monitoring on construction sites has emerged as a solution to this longstanding issue. Despite achieving impressive performance, advanced object detection methods like YOLOv8 still face challenges in handling the complex conditions found at construction sites. To solve these problems, this study presents the Global Stability Optimization YOLO (GSO-YOLO) model to address challenges in complex construction sites. The model integrates the Global Optimization Module (GOM) and Steady Capture Module (SCM) to enhance global contextual information capture and detection stability. The innovative AIoU loss function, which combines CIoU and EIoU, improves detection accuracy and efficiency. Experiments on datasets like SODA, MOCS, and CIS show that GSO-YOLO outperforms existing methods, achieving SOTA performance.

[LG-101] From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

链接: https://arxiv.org/abs/2407.00902
作者: Nan Xu,Fei Wang,Sheng Zhang,Hoifung Poon,Muhao Chen
关键词: Large Language models, Large Language, multiple image-text pairs, similar ICL abilities, capabilities of Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Considering such modality impact, we further utilize modality-driven demonstration strategies to boost ICL performance. We also identify that demonstration selection is closely related to the models’ ability to capture task inductive biases from multimodal ICL. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks even if those tasks are not seen in or even contradict pretraining data.

[LG-102] ZeroDDI: A Zero-Shot Drug-Drug Interaction Event Prediction Method with Semantic Enhanced Learning and Dual-Modal Uniform Alignment

链接: https://arxiv.org/abs/2407.00891
作者: Ziyan Wang,Zhankun Xiong,Feng Huang,Xuan Liu,Wen Zhang
关键词: Drug-drug interactions, DDI events, DDIE, Drug-drug, DDIE representation learning
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Accepted by IJCAI2024

点击查看摘要

Abstract:Drug-drug interactions (DDIs) can result in various pharmacological changes, which can be categorized into different classes known as DDI events (DDIEs). In recent years, previously unobserved/unseen DDIEs have been emerging, posing a new classification task when unseen classes have no labelled instances in the training stage, which is formulated as a zero-shot DDIE prediction (ZS-DDIE) task. However, existing computational methods are not directly applicable to ZS-DDIE, which has two primary challenges: obtaining suitable DDIE representations and handling the class imbalance issue. To overcome these challenges, we propose a novel method named ZeroDDI for the ZS-DDIE task. Specifically, we design a biological semantic enhanced DDIE representation learning module, which emphasizes the key biological semantics and distills discriminative molecular substructure-related semantics for DDIE representation learning. Furthermore, we propose a dual-modal uniform alignment strategy to distribute drug pair representations and DDIE semantic representations uniformly in a unit sphere and align the matched ones, which can mitigate the issue of class imbalance. Extensive experiments showed that ZeroDDI surpasses the baselines and indicate that it is a promising tool for detecting unseen DDIEs. Our code has been released in this https URL.

[LG-103] Papez: Resource-Efficient Speech Separation with Auditory Working Memory

链接: https://arxiv.org/abs/2407.00888
作者: Hyunseok Oh,Juheon Yi,Youngki Lee
关键词: Transformer-based models recently, extreme computational load, computational load makes, single-channel speech separation, models recently reached
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages. Accepted by ICASSP 2023

点击查看摘要

Abstract:Transformer-based models recently reached state-of-the-art single-channel speech separation accuracy; However, their extreme computational load makes it difficult to deploy them in resource-constrained mobile or IoT devices. We thus present Papez, a lightweight and computation-efficient single-channel speech separation model. Papez is based on three key techniques. We first replace the inter-chunk Transformer with small-sized auditory working memory. Second, we adaptively prune the input tokens that do not need further processing. Finally, we reduce the number of parameters through the recurrent transformer. Our extensive evaluation shows that Papez achieves the best resource and accuracy tradeoffs with a large margin. We publicly share our source code at \textttthis https URL

[LG-104] Mechanistic Interpretation through Contextual Decomposition in Transformers

链接: https://arxiv.org/abs/2407.00886
作者: Aliyah R. Hsu,Yeshwanth Cherapanamjeri,Anobel Y. Odisho,Peter R. Carroll,Bin Yu
关键词: black boxes due, complex nonlinear relationships, regarded as black, black boxes, boxes due
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers exhibit impressive capabilities but are often regarded as black boxes due to challenges in understanding the complex nonlinear relationships between features. Interpreting machine learning models is of paramount importance to mitigate risks, and mechanistic interpretability is in particular of current interest as it opens up a window for guiding manual modifications and reverse-engineering solutions. In this work, we introduce contextual decomposition for transformers (CD-T), extending a prior work on CD for RNNs and CNNs, to address mechanistic interpretation computationally efficiently. CD-T is a flexible interpretation method for transformers. It can capture contributions of combinations of input features or source internal components (e.g. attention heads, feed-forward networks) to (1) final predictions or (2) the output of any target internal component. Using CD-T, we propose a novel algorithm for circuit discovery. On a real-world pathology report classification task: we show CD-T distills a more faithful circuit of attention heads with improved computational efficiency (speed up 2x) than a prior benchmark, path patching. As a versatile interpretation method, CD-T also exhibits exceptional capabilities for local interpretations. CD-T is shown to reliably find words and phrases of contrasting sentiment/topic on SST-2 and AGNews datasets. Through human experiments, we demonstrate CD-T enables users to identify the more accurate of two models and to better trust a model’s outputs compared to alternative interpretation methods such as SHAP and LIME.

[LG-105] A Robust Power Model Training Framework for Cloud Native Runtime Energy Metric Exporter

链接: https://arxiv.org/abs/2407.00878
作者: Sunyanan Choochotkaew,Chen Wang,Huamin Chen,Tatsuhiro Chiba,Marcelo Amaral,Eun Kyung Lee,Tamar Eilam
关键词: Estimating power consumption, modern Cloud environments, Estimating power, power consumption, power
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: This is a full-version (8-page) paper of our previous publication in IEEE MASCOTS 2023, which has been accepted as a 4-page short paper ( this https URL )

点击查看摘要

Abstract:Estimating power consumption in modern Cloud environments is essential for carbon quantification toward green computing. Specifically, it is important to properly account for the power consumed by each of the running applications, which are packaged as containers. This paper examines multiple challenges associated with this goal. The first challenge is that multiple customers are sharing the same hardware platform (multi-tenancy), where information on the physical servers is mostly obscured. The second challenge is the overhead in power consumption that the Cloud platform control plane induces. This paper addresses these challenges and introduces a novel pipeline framework for power model training. This allows versatile power consumption approximation of individual containers on the basis of available performance counters and other metrics. The proposed model utilizes machine learning techniques to predict the power consumed by the control plane and associated processes, and uses it for isolating the power consumed by the user containers, from the server power consumption. To determine how well the prediction results in an isolation, we introduce a metric termed isolation goodness. Applying the proposed power model does not require online power measurements, nor does it need information on the physical servers, configuration, or information on other tenants sharing the same machine. The results of cross-workload, cross-platform experiments demonstrated the higher accuracy of the proposed model when predicting power consumption of unseen containers on unknown platforms, including on virtual machines.

[LG-106] Silver Linings in the Shadows: Harnessing Membership Inference for Machine Unlearning

链接: https://arxiv.org/abs/2407.00866
作者: Nexhi Sula,Abhinav Kumar,Jie Hou,Han Wang,Reza Tourani
关键词: ensuring user privacy, machine learning, machine learning models, secure machine learning, machine learning framework
类目: Machine Learning (cs.LG)
*备注: 17 pages, 14 figures, 6 tables

点击查看摘要

Abstract:With the continued advancement and widespread adoption of machine learning (ML) models across various domains, ensuring user privacy and data security has become a paramount concern. In compliance with data privacy regulations, such as GDPR, a secure machine learning framework should not only grant users the right to request the removal of their contributed data used for model training but also facilitates the elimination of sensitive data fingerprints within machine learning models to mitigate potential attack - a process referred to as machine unlearning. In this study, we present a novel unlearning mechanism designed to effectively remove the impact of specific data samples from a neural network while considering the performance of the unlearned model on the primary task. In achieving this goal, we crafted a novel loss function tailored to eliminate privacy-sensitive information from weights and activation values of the target model by combining target classification loss and membership inference loss. Our adaptable framework can easily incorporate various privacy leakage approximation mechanisms to guide the unlearning process. We provide empirical evidence of the effectiveness of our unlearning approach with a theoretical upper-bound analysis through a membership inference mechanism as a proof of concept. Our results showcase the superior performance of our approach in terms of unlearning efficacy and latency as well as the fidelity of the primary task, across four datasets and four deep learning architectures.

[LG-107] owards Understanding Sensitive and Decisive Patterns in Explainable AI: A Case Study of Model Interpretation in Geometric Deep Learning

链接: https://arxiv.org/abs/2407.00849
作者: Jiajun Zhu,Siqi Miao,Rex Ying,Pan Li
关键词: gained increasing attention, patterns, decisive patterns, increasing attention, accountability are crucial
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The interpretability of machine learning models has gained increasing attention, particularly in scientific domains where high precision and accountability are crucial. This research focuses on distinguishing between two critical data patterns – sensitive patterns (model-related) and decisive patterns (task-related) – which are commonly used as model interpretations but often lead to confusion. Specifically, this study compares the effectiveness of two main streams of interpretation methods: post-hoc methods and self-interpretable methods, in detecting these patterns. Recently, geometric deep learning (GDL) has shown superior predictive performance in various scientific applications, creating an urgent need for principled interpretation methods. Therefore, we conduct our study using several representative GDL applications as case studies. We evaluate thirteen interpretation methods applied to three major GDL backbone models, using four scientific datasets to assess how well these methods identify sensitive and decisive patterns. Our findings indicate that post-hoc methods tend to provide interpretations better aligned with sensitive patterns, whereas certain self-interpretable methods exhibit strong and stable performance in detecting decisive patterns. Additionally, our study offers valuable insights into improving the reliability of these interpretation methods. For example, ensembling post-hoc interpretations from multiple models trained on the same task can effectively uncover the task’s decisive patterns.

[LG-108] MUSE-Net: Missingness-aware mUlti-branching Self-attention Encoder for Irregular Longitudinal Electronic Health Records

链接: https://arxiv.org/abs/2407.00840
作者: Zekai Wang,Tieming Liu,Bing Yao
关键词: clinical decision making, enhance clinical decision, electronic health records, made vast amounts, clinical data readily
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The era of big data has made vast amounts of clinical data readily available, particularly in the form of electronic health records (EHRs), which provides unprecedented opportunities for developing data-driven diagnostic tools to enhance clinical decision making. However, the application of EHRs in data-driven modeling faces challenges such as irregularly spaced multi-variate time series, issues of incompleteness, and data imbalance. Realizing the full data potential of EHRs hinges on the development of advanced analytical models. In this paper, we propose a novel Missingness-aware mUlti-branching Self-attention Encoder (MUSE-Net) to cope with the challenges in modeling longitudinal EHRs for data-driven disease prediction. The MUSE-Net leverages a multi-task Gaussian process (MGP) with missing value masks for data imputation, a multi-branching architecture to address the data imbalance problem, and a time-aware self-attention encoder to account for the irregularly spaced time interval in longitudinal EHRs. We evaluate the proposed MUSE-Net using both synthetic and real-world datasets. Experimental results show that our MUSE-Net outperforms existing methods that are widely used to investigate longitudinal signals.

[LG-109] Kernel Neural Operators (KNOs) for Scalable Memory-efficient Geometrically-flexible Operator Learning

链接: https://arxiv.org/abs/2407.00809
作者: Matthew Lowery,John Turnage,Zachary Morrow,John D. Jakeman,Akil Narayan,Shandian Zhe,Varun Shankar
关键词: Kernel Neural Operator, kernel-based integral operators, deep kernel-based integral, operator learning, existing neural operators
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 10 pages + 5 page appendix, 8 figures

点击查看摘要

Abstract:This paper introduces the Kernel Neural Operator (KNO), a novel operator learning technique that uses deep kernel-based integral operators in conjunction with quadrature for function-space approximation of operators (maps from functions to functions). KNOs use parameterized, closed-form, finitely-smooth, and compactly-supported kernels with trainable sparsity parameters within the integral operators to significantly reduce the number of parameters that must be learned relative to existing neural operators. Moreover, the use of quadrature for numerical integration endows the KNO with geometric flexibility that enables operator learning on irregular geometries. Numerical results demonstrate that on existing benchmarks the training and test accuracy of KNOs is higher than popular operator learning techniques while using at least an order of magnitude fewer trainable parameters. KNOs thus represent a new paradigm of low-memory, geometrically-flexible, deep operator learning, while retaining the implementation simplicity and transparency of traditional kernel methods from both scientific computing and machine learning.

[LG-110] Benchmarks for Reinforcement Learning with Biased Offline Data and Imperfect Simulators

链接: https://arxiv.org/abs/2407.00806
作者: Ori Linial,Guy Tennenholtz,Uri Shalit
关键词: autonomous vehicles, true for autonomous, Offline Reinforcement Learning, reinforcement learning, agent act
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many reinforcement learning (RL) applications one cannot easily let the agent act in the world; this is true for autonomous vehicles, healthcare applications, and even some recommender systems, to name a few examples. Offline RL provides a way to train agents without real-world exploration, but is often faced with biases due to data distribution shifts, limited coverage, and incomplete representation of the environment. To address these issues, practical applications have tried to combine simulators with grounded offline data, using so-called hybrid methods. However, constructing a reliable simulator is in itself often challenging due to intricate system complexities as well as missing or incomplete information. In this work, we outline four principal challenges for combining offline data with imperfect simulators in RL: simulator modeling error, partial observability, state and action discrepancies, and hidden confounding. To help drive the RL community to pursue these problems, we construct ``Benchmarks for Mechanistic Offline Reinforcement Learning’’ (B4MRL), which provide dataset-simulator benchmarks for the aforementioned challenges. Our results suggest the key necessity of such benchmarks for future research.

[LG-111] Model-Free Active Exploration in Reinforcement Learning

链接: https://arxiv.org/abs/2407.00801
作者: Alessio Russo,Alexandre Proutiere
关键词: Reinforcement Learning, Learning and present, instance-specific lower bound, lower bound, Reinforcement
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. We adopt an information-theoretical viewpoint and start from the instance-specific lower bound of the number of samples that have to be collected to identify a nearly-optimal policy. Deriving this lower bound along with the optimal exploration strategy entails solving an intricate optimization problem and requires a model of the system. In turn, most existing sample optimal exploration algorithms rely on estimating the model. We derive an approximation of the instance-specific lower bound that only involves quantities that can be inferred using model-free approaches. Leveraging this approximation, we devise an ensemble-based model-free exploration strategy applicable to both tabular and continuous Markov decision processes. Numerical results demonstrate that our strategy is able to identify efficient policies faster than state-of-the-art exploration approaches

[LG-112] Enhancing Travel Decision-Making: A Contrastive Learning Approach for Personalized Review Rankings in Accommodations

链接: https://arxiv.org/abs/2407.00787
作者: Reda Igebaria,Eran Fainman,Sarai Mizrachi,Moran Beladev,Fengjun Wang
关键词: User-generated reviews significantly, influence consumer decisions, significantly influence consumer, reviews significantly influence, User-generated reviews
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:User-generated reviews significantly influence consumer decisions, particularly in the travel domain when selecting accommodations. This paper contribution comprising two main elements. Firstly, we present a novel dataset of authentic guest reviews sourced from a prominent online travel platform, totaling over two million reviews from 50,000 distinct accommodations. Secondly, we propose an innovative approach for personalized review ranking. Our method employs contrastive learning to intricately capture the relationship between a review and the contextual information of its respective reviewer. Through a comprehensive experimental study, we demonstrate that our approach surpasses several baselines across all reported metrics. Augmented by a comparative analysis, we showcase the efficacy of our method in elevating personalized review ranking. The implications of our research extend beyond the travel domain, with potential applications in other sectors where personalized review ranking is paramount, such as online e-commerce platforms.

[LG-113] owards Faster Matrix Diagonalization with Graph Isomorphism Networks and the AlphaZero Framework

链接: https://arxiv.org/abs/2407.00779
作者: Geigh Zollicoffer,Kshitij Bhatta,Manish Bhattarai,Phil Romero,Christian F. A. Negre,Anders M. N. Niklasson,Adetokunbo Adedoyin
关键词: Markov Decision Process, Semi-Markov Decision Process, Markov Decision, introduce innovative approaches, Decision Process
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Accepted to Deployable RL: From Research to Practice workshop @ RLC conference

点击查看摘要

Abstract:In this paper, we introduce innovative approaches for accelerating the Jacobi method for matrix diagonalization, specifically through the formulation of large matrix diagonalization as a Semi-Markov Decision Process and small matrix diagonalization as a Markov Decision Process. Furthermore, we examine the potential of utilizing scalable architecture between different-sized matrices. During a short training period, our method discovered a significant reduction in the number of steps required for diagonalization and exhibited efficient inference capabilities. Importantly, this approach demonstrated possible scalability to large-sized matrices, indicating its potential for wide-ranging applicability. Upon training completion, we obtain action-state probabilities and transition graphs, which depict transitions between different states. These outputs not only provide insights into the diagonalization process but also pave the way for cost savings pertinent to large-scale matrices. The advancements made in this research enhance the efficacy and scalability of matrix diagonalization, pushing for new possibilities for deployment in practical applications in scientific and engineering domains.

[LG-114] Structured and Balanced Multi-component and Multi-layer Neural Networks

链接: https://arxiv.org/abs/2407.00765
作者: Shijun Zhang,Hongkai Zhao,Yimin Zhong,Haomin Zhou
关键词: multi-layer neural network, computation cost, efficiency in terms, terms of degrees, degrees of freedom
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: Our codes and implementation details are available at this https URL

点击查看摘要

Abstract:In this work, we propose a balanced multi-component and multi-layer neural network (MMNN) structure to approximate functions with complex features with both accuracy and efficiency in terms of degrees of freedom and computation cost. The main idea is motivated by a multi-component, each of which can be approximated effectively by a single-layer network, and multi-layer decomposition in a “divide-and-conquer” type of strategy to deal with a complex function. While an easy modification to fully connected neural networks (FCNNs) or multi-layer perceptrons (MLPs) through the introduction of balanced multi-component structures in the network, MMNNs achieve a significant reduction of training parameters, a much more efficient training process, and a much improved accuracy compared to FCNNs or MLPs. Extensive numerical experiments are presented to illustrate the effectiveness of MMNNs in approximating high oscillatory functions and its automatic adaptivity in capturing localized features.

[LG-115] Improving the performance of Stein variational inference through extreme sparsification of physically-constrained neural network models

链接: https://arxiv.org/abs/2407.00761
作者: Govinda Anantha Padmanabha,Jan Niklas Fuhg,Cosmin Safta,Reese E. Jones,Nikolaos Bouklas
关键词: scientific machine learning, neural networks involve, networks involve hundreds, machine learning, thousands of parameters
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 30 pages, 11 figures

点击查看摘要

Abstract:Most scientific machine learning (SciML) applications of neural networks involve hundreds to thousands of parameters, and hence, uncertainty quantification for such models is plagued by the curse of dimensionality. Using physical applications, we show that L_0 sparsification prior to Stein variational gradient descent ( L_0 +SVGD) is a more robust and efficient means of uncertainty quantification, in terms of computational cost and performance than the direct application of SGVD or projected SGVD methods. Specifically, L_0 +SVGD demonstrates superior resilience to noise, the ability to perform well in extrapolated regions, and a faster convergence rate to an optimal solution.

[LG-116] Improved Graph-based semi-supervised learning Schemes

链接: https://arxiv.org/abs/2407.00760
作者: Farid Bozorgnia
关键词: Random Fields Learning, Gaussian Random Fields, address the classification, classification of large, Poisson Learning algorithms
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:In this work, we improve the accuracy of several known algorithms to address the classification of large datasets when few labels are available. Our framework lies in the realm of graph-based semi-supervised learning. With novel modifications on Gaussian Random Fields Learning and Poisson Learning algorithms, we increase the accuracy and create more robust algorithms. Experimental results demonstrate the efficiency and superiority of the proposed methods over conventional graph-based semi-supervised techniques, especially in the context of imbalanced datasets.

[LG-117] Self-consistent Deep Geometric Learning for Heterogeneous Multi-source Spatial Point Data Prediction

链接: https://arxiv.org/abs/2407.00748
作者: Dazhou Yu,Xiaoyun Gong,Yun Li,Meikang Qiu,Liang Zhao
关键词: holistic environmental understanding, natural resource management, environmental understanding, resource management, environmental monitoring
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-source spatial point data prediction is crucial in fields like environmental monitoring and natural resource management, where integrating data from various sensors is the key to achieving a holistic environmental understanding. Existing models in this area often fall short due to their domain-specific nature and lack a strategy for integrating information from various sources in the absence of ground truth labels. Key challenges include evaluating the quality of different data sources and modeling spatial relationships among them effectively. Addressing these issues, we introduce an innovative multi-source spatial point data prediction framework that adeptly aligns information from varied sources without relying on ground truth labels. A unique aspect of our method is the ‘fidelity score,’ a quantitative measure for evaluating the reliability of each data source. Furthermore, we develop a geo-location-aware graph neural network tailored to accurately depict spatial relationships between data points. Our framework has been rigorously tested on two real-world datasets and one synthetic dataset. The results consistently demonstrate its superior performance over existing state-of-the-art methods.

[LG-118] Posterior Sampling with Denoising Oracles via Tilted Transport

链接: https://arxiv.org/abs/2407.00745
作者: Joan Bruna,Jiequn Han
关键词: Score-based diffusion models, Score-based diffusion, significantly advanced high-dimensional, advanced high-dimensional data, high-dimensional data generation
类目: Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based diffusion models have significantly advanced high-dimensional data generation across various domains, by learning a denoising oracle (or score) from datasets. From a Bayesian perspective, they offer a realistic modeling of data priors and facilitate solving inverse problems through posterior sampling. Although many heuristic methods have been developed recently for this purpose, they lack the quantitative guarantees needed in many scientific applications. In this work, we introduce the \textittilted transport technique, which leverages the quadratic structure of the log-likelihood in linear inverse problems in combination with the prior denoising oracle to transform the original posterior sampling problem into a new `boosted’ posterior that is provably easier to sample from. We quantify the conditions under which this boosted posterior is strongly log-concave, highlighting the dependencies on the condition number of the measurement matrix and the signal-to-noise ratio. The resulting posterior sampling scheme is shown to reach the computational threshold predicted for sampling Ising models [Kunisky’23] with a direct analysis, and is further validated on high-dimensional Gaussian mixture models and scalar field \varphi^4 models. Subjects: Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO); Machine Learning (stat.ML) Cite as: arXiv:2407.00745 [cs.LG] (or arXiv:2407.00745v1 [cs.LG] for this version)

[LG-119] Disentangled Representations for Causal Cognition

链接: https://arxiv.org/abs/2407.00744
作者: Filippo Torresan,Manuel Baltieri
关键词: Complex adaptive agents, combined agent-environment systems, Complex adaptive, adaptive agents consistently, agents consistently achieve
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 49 pages, 9 figures

点击查看摘要

Abstract:Complex adaptive agents consistently achieve their goals by solving problems that seem to require an understanding of causal information, information pertaining to the causal relationships that exist among elements of combined agent-environment systems. Causal cognition studies and describes the main characteristics of causal learning and reasoning in human and non-human animals, offering a conceptual framework to discuss cognitive performances based on the level of apparent causal understanding of a task. Despite the use of formal intervention-based models of causality, including causal Bayesian networks, psychological and behavioural research on causal cognition does not yet offer a computational account that operationalises how agents acquire a causal understanding of the world. Machine and reinforcement learning research on causality, especially involving disentanglement as a candidate process to build causal representations, represent on the one hand a concrete attempt at designing causal artificial agents that can shed light on the inner workings of natural causal cognition. In this work, we connect these two areas of research to build a unifying framework for causal cognition that will offer a computational perspective on studies of animal cognition, and provide insights in the development of new algorithms for causal reinforcement learning in AI.

[LG-120] PolygonGNN: Representation Learning for Polygonal Geometries with Heterogeneous Visibility Graph

链接: https://arxiv.org/abs/2407.00742
作者: Dazhou Yu,Yuntong Hu,Yun Li,Liang Zhao
关键词: building pattern classification, geographic question answering, encompassing tasks, shape coding, building pattern
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Polygon representation learning is essential for diverse applications, encompassing tasks such as shape coding, building pattern classification, and geographic question answering. While recent years have seen considerable advancements in this field, much of the focus has been on single polygons, overlooking the intricate inner- and inter-polygonal relationships inherent in multipolygons. To address this gap, our study introduces a comprehensive framework specifically designed for learning representations of polygonal geometries, particularly multipolygons. Central to our approach is the incorporation of a heterogeneous visibility graph, which seamlessly integrates both inner- and inter-polygonal relationships. To enhance computational efficiency and minimize graph redundancy, we implement a heterogeneous spanning tree sampling method. Additionally, we devise a rotation-translation invariant geometric representation, ensuring broader applicability across diverse scenarios. Finally, we introduce Multipolygon-GNN, a novel model tailored to leverage the spatial and semantic heterogeneity inherent in the visibility graph. Experiments on five real-world and synthetic datasets demonstrate its ability to capture informative representations for polygonal geometries.

[LG-121] Diffusion Models for Offline Multi-agent Reinforcement Learning with Safety Constraints

链接: https://arxiv.org/abs/2407.00741
作者: Jianuo Huang
关键词: Multi-agent Reinforcement Learning, Multi-agent Reinforcement, Reinforcement Learning, advancements in Multi-agent, safety-critical scenarios
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:In recent advancements in Multi-agent Reinforcement Learning (MARL), its application has extended to various safety-critical scenarios. However, most methods focus on online learning, which presents substantial risks when deployed in real-world settings. Addressing this challenge, we introduce an innovative framework integrating diffusion models within the MARL paradigm. This approach notably enhances the safety of actions taken by multiple agents through risk mitigation while modeling coordinated action. Our framework is grounded in the Centralized Training with Decentralized Execution (CTDE) architecture, augmented by a Diffusion Model for prediction trajectory generation. Additionally, we incorporate a specialized algorithm to further ensure operational safety. We evaluate our model against baselines on the DSRL benchmark. Experiment results demonstrate that our model not only adheres to stringent safety constraints but also achieves superior performance compared to existing methodologies. This underscores the potential of our approach in advancing the safety and efficacy of MARL in real-world applications.

[LG-122] LocateEdit: Energy-based Text Editing for Efficient Flexible and Faithful Controlled Text Generation

链接: https://arxiv.org/abs/2407.00740
作者: Hye Ryung Son,Jay-Yoon Lee
关键词: Recent approaches, base language models, decoding time, base, approaches to controlled
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 18 pages, 2 figures

点击查看摘要

Abstract:Recent approaches to controlled text generation (CTG) often involve manipulating the weights or logits of base language models (LMs) at decoding time. However, these methods are inapplicable to latest black-box LMs and ineffective at preserving the core semantics of the base LM’s original generations. In this work, we propose LocateEdit(LE), an efficient and flexible energy-based approach to CTG, which edits text outputs from a base LM using off-the-shelf energy models. Given text outputs from the base LM, LE first locates spans that are most relevant to constraints (e.g., toxicity) utilizing energy models, and then edits these spans by replacing them with more suitable alternatives. Importantly, our method is compatible with black-box LMs, as it requires only the text outputs. Also, since LE doesn’t mandate specific architecture for its component models, it can work with a diverse combination of available off-the-shelf models. Moreover, LE preserves the base LM’s original generations, by selectively modifying constraint-related aspects of the texts and leaving others unchanged. These targeted edits also ensure that LE operates efficiently. Our experiments confirm that LE achieves superior semantic preservation of the base LM generations and speed, while simultaneously obtaining competitive or improved constraint satisfaction. Furthermore, we analyze how the granularity of energy distribution impacts CTG performance and find that fine-grained, regression-based energy models improve constraint satisfaction, compared to conventional binary classifier energy models.

[LG-123] Engineering an Efficient Object Tracker for Non-Linear Motion

链接: https://arxiv.org/abs/2407.00738
作者: Momir Adžemović,Predrag Tadić,Andrija Petrović,Mladen Nikolić
关键词: maintaining unique identifiers, video frames, detect and track, scene while maintaining, maintaining unique
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 3 figures, 20 tables

点击查看摘要

Abstract:The goal of multi-object tracking is to detect and track all objects in a scene while maintaining unique identifiers for each, by associating their bounding boxes across video frames. This association relies on matching motion and appearance patterns of detected objects. This task is especially hard in case of scenarios involving dynamic and non-linear motion patterns. In this paper, we introduce DeepMoveSORT, a novel, carefully engineered multi-object tracker designed specifically for such scenarios. In addition to standard methods of appearance-based association, we improve motion-based association by employing deep learnable filters (instead of the most commonly used Kalman filter) and a rich set of newly proposed heuristics. Our improvements to motion-based association methods are severalfold. First, we propose a new transformer-based filter architecture, TransFilter, which uses an object’s motion history for both motion prediction and noise filtering. We further enhance the filter’s performance by careful handling of its motion history and accounting for camera motion. Second, we propose a set of heuristics that exploit cues from the position, shape, and confidence of detected bounding boxes to improve association performance. Our experimental evaluation demonstrates that DeepMoveSORT outperforms existing trackers in scenarios featuring non-linear motion, surpassing state-of-the-art results on three such datasets. We also perform a thorough ablation study to evaluate the contributions of different tracker components which we proposed. Based on our study, we conclude that using a learnable filter instead of the Kalman filter, along with appearance-based association is key to achieving strong general tracking performance.

[LG-124] Large Language Models Struggle in Token-Level Clinical Named Entity Recognition

链接: https://arxiv.org/abs/2407.00731
作者: Qiuhao Lu,Rui Li,Andrew Wen,Jinlian Wang,Liwei Wang,Hongfang Liu
关键词: Large Language Models, Large Language, Named Entity Recognition, Language Models, token-level NER
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: AMIA 2024 Annual Symposium Proceedings

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized various sectors, including healthcare where they are employed in diverse applications. Their utility is particularly significant in the context of rare diseases, where data scarcity, complexity, and specificity pose considerable challenges. In the clinical domain, Named Entity Recognition (NER) stands out as an essential task and it plays a crucial role in extracting relevant information from clinical texts. Despite the promise of LLMs, current research mostly concentrates on document-level NER, identifying entities in a more general context across entire documents, without extracting their precise location. Additionally, efforts have been directed towards adapting ChatGPT for token-level NER. However, there is a significant research gap when it comes to employing token-level NER for clinical texts, especially with the use of local open-source LLMs. This study aims to bridge this gap by investigating the effectiveness of both proprietary and local LLMs in token-level clinical NER. Essentially, we delve into the capabilities of these models through a series of experiments involving zero-shot prompting, few-shot prompting, retrieval-augmented generation (RAG), and instruction-fine-tuning. Our exploration reveals the inherent challenges LLMs face in token-level NER, particularly in the context of rare diseases, and suggests possible improvements for their application in healthcare. This research contributes to narrowing a significant gap in healthcare informatics and offers insights that could lead to a more refined application of LLMs in the healthcare sector.

[LG-125] A Whole-Process Certifiably Robust Aggregation Method Against Backdoor Attacks in Federated Learning

链接: https://arxiv.org/abs/2407.00719
作者: Anqi Zhou,Yezheng Liu,Yidong Chai,Hongyi Zhu,Xinyue Ge,Yuanchun Jiang,Meng Wang
关键词: Federated Learning, garnered widespread adoption, garnered widespread, widespread adoption, Federated
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:Federated Learning (FL) has garnered widespread adoption across various domains such as finance, healthcare, and cybersecurity. Nonetheless, FL remains under significant threat from backdoor attacks, wherein malicious actors insert triggers into trained models, enabling them to perform certain tasks while still meeting FL’s primary objectives. In response, robust aggregation methods have been proposed, which can be divided into three types: ex-ante, ex-durante, and ex-post methods. Given the complementary nature of these methods, combining all three types is promising yet unexplored. Such a combination is non-trivial because it requires leveraging their advantages while overcoming their disadvantages. Our study proposes a novel whole-process certifiably robust aggregation (WPCRA) method for FL, which enhances robustness against backdoor attacks across three phases: ex-ante, ex-durante, and ex-post. Moreover, since the current geometric median estimation method fails to consider differences among clients, we propose a novel weighted geometric median estimation algorithm (WGME). This algorithm estimates the geometric median of model updates from clients based on each client’s weight, further improving the robustness of WPCRA against backdoor attacks. We also theoretically prove that WPCRA offers improved certified robustness guarantees with a larger certified radius. We evaluate the advantages of our methods based on the task of loan status prediction. Comparison with baselines shows that our methods significantly improve FL’s robustness against backdoor attacks. This study contributes to the literature with a novel WPCRA method and a novel WGME algorithm. Our code is available at this https URL.

[LG-126] Learning System Dynamics without Forgetting

链接: https://arxiv.org/abs/2407.00717
作者: Xikun Zhang,Dongjin Song,Yushan Jiang,Yixin Chen,Dacheng Tao
关键词: dynamics, governing rules, including physics, physics and biology, systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Predicting the trajectories of systems with unknown dynamics (\textiti.e. the governing rules) is crucial in various research fields, including physics and biology. This challenge has gathered significant attention from diverse communities. Most existing works focus on learning fixed system dynamics within one single system. However, real-world applications often involve multiple systems with different types of dynamics or evolving systems with non-stationary dynamics (dynamics shifts). When data from those systems are continuously collected and sequentially fed to machine learning models for training, these models tend to be biased toward the most recently learned dynamics, leading to catastrophic forgetting of previously observed/learned system dynamics. To this end, we aim to learn system dynamics via continual learning. Specifically, we present a novel framework of Mode-switching Graph ODE (MS-GODE), which can continually learn varying dynamics and encode the system-specific dynamics into binary masks over the model parameters. During the inference stage, the model can select the most confident mask based on the observational data to identify the system and predict future trajectories accordingly. Empirically, we systematically investigate the task configurations and compare the proposed MS-GODE with state-of-the-art techniques. More importantly, we construct a novel benchmark of biological dynamic systems, featuring diverse systems with disparate dynamics and significantly enriching the research field of machine learning for dynamic systems.

[LG-127] Weighted Missing Linear Discriminant Analysis: An Explainable Approach for Classification with Missing Data

链接: https://arxiv.org/abs/2407.00710
作者: Tuan L. Vo,Uyen Dang,Thu Nguyen
关键词: Artificial Intelligence, Linear Discriminant Analysis, Discriminant Analysis, real-life applications, Linear Discriminant
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:As Artificial Intelligence (AI) models are gradually being adopted in real-life applications, the explainability of the model used is critical, especially in high-stakes areas such as medicine, finance, etc. Among the commonly used models, Linear Discriminant Analysis (LDA) is a widely used classification tool that is also explainable thanks to its ability to model class distributions and maximize class separation through linear feature combinations. Nevertheless, real-world data is frequently incomplete, presenting significant challenges for classification tasks and model explanations. In this paper, we propose a novel approach to LDA under missing data, termed \textbf\textitWeighted missing Linear Discriminant Analysis (WLDA), to directly classify observations in data that contains missing values without imputation effectively by estimating the parameters directly on missing data and use a weight matrix for missing values to penalize missing entries during classification. Furthermore, we also analyze the theoretical properties and examine the explainability of the proposed technique in a comprehensive manner. Experimental results demonstrate that WLDA outperforms conventional methods by a significant margin, particularly in scenarios where missing values are present in both training and test sets.

[LG-128] Heterogeneous Graph Contrastive Learning with Spectral Augmentation

链接: https://arxiv.org/abs/2407.00708
作者: Jing Zhang,Xiaoqian Jiang,Yingjie Xie,Cangqi Zhou
关键词: heterogeneous graph, complex entity relationships, heterogeneous graph representation, graph, graph structure information
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneous graphs can well describe the complex entity relationships in the real world. For example, online shopping networks contain multiple physical types of consumers and products, as well as multiple relationship types such as purchasing and favoriting. More and more scholars pay attention to this research because heterogeneous graph representation learning shows strong application potential in real-world scenarios. However, the existing heterogeneous graph models use data augmentation techniques to enhance the use of graph structure information, which only captures the graph structure information from the spatial topology, ignoring the information displayed in the spectrum dimension of the graph structure. To address the issue that heterogeneous graph representation learning methods fail to model spectral information, this paper introduces a spectral-enhanced graph contrastive learning model (SHCL) and proposes a spectral augmentation algorithm for the first time in heterogeneous graph neural networks. The proposed model learns an adaptive topology augmentation scheme through the heterogeneous graph itself, disrupting the structural information of the heterogeneous graph in the spectrum dimension, and ultimately improving the learning effect of the model. Experimental results on multiple real-world datasets demonstrate substantial advantages of the proposed model.

[LG-129] Sum-of-norms regularized Nonnegative Matrix Factorization

链接: https://arxiv.org/abs/2407.00706
作者: Andersen Ang,Waqas Bin Hamed,Hans De Sterck
关键词: nonnegative matrix factorization, applying nonnegative matrix, rank, parameter is unknown, NMF
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 22 pages, 12 figures

点击查看摘要

Abstract:When applying nonnegative matrix factorization (NMF), generally the rank parameter is unknown. Such rank in NMF, called the nonnegative rank, is usually estimated heuristically since computing the exact value of it is NP-hard. In this work, we propose an approximation method to estimate such rank while solving NMF on-the-fly. We use sum-of-norm (SON), a group-lasso structure that encourages pairwise similarity, to reduce the rank of a factor matrix where the rank is overestimated at the beginning. On various datasets, SON-NMF is able to reveal the correct nonnegative rank of the data without any prior knowledge nor tuning. SON-NMF is a nonconvx nonsmmoth non-separable non-proximable problem, solving it is nontrivial. First, as rank estimation in NMF is NP-hard, the proposed approach does not enjoy a lower computational complexity. Using a graph-theoretic argument, we prove that the complexity of the SON-NMF is almost irreducible. Second, the per-iteration cost of any algorithm solving SON-NMF is possibly high, which motivated us to propose a first-order BCD algorithm to approximately solve SON-NMF with a low per-iteration cost, in which we do so by the proximal average operator. Lastly, we propose a simple greedy method for post-processing. SON-NMF exhibits favourable features for applications. Beside the ability to automatically estimate the rank from data, SON-NMF can deal with rank-deficient data matrix, can detect weak component with small energy. Furthermore, on the application of hyperspectral imaging, SON-NMF handle the issue of spectral variability naturally. Comments: 22 pages, 12 figures Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2407.00706 [cs.LG] (or arXiv:2407.00706v1 [cs.LG] for this version)

[LG-130] ackling Long-Horizon Tasks with Model-based Offline Reinforcement Learning

链接: https://arxiv.org/abs/2407.00699
作者: Kwanyoung Park,Youngwoon Lee
关键词: generating imaginary trajectories, offline reinforcement learning, Lower Expectile Q-learning, reinforcement learning, compelling approach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Model-based offline reinforcement learning (RL) is a compelling approach that addresses the challenge of learning from limited, static data by generating imaginary trajectories using learned models. However, it falls short in solving long-horizon tasks due to high bias in value estimation from model rollouts. In this paper, we introduce a novel model-based offline RL method, Lower Expectile Q-learning (LEQ), which enhances long-horizon task performance by mitigating the high bias in model-based value estimation via expectile regression of \lambda -returns. Our empirical results show that LEQ significantly outperforms previous model-based offline RL methods on long-horizon tasks, such as the D4RL AntMaze tasks, matching or surpassing the performance of model-free approaches. Our experiments demonstrate that expectile regression, \lambda -returns, and critic training on offline data are all crucial for addressing long-horizon tasks. Additionally, LEQ achieves performance comparable to the state-of-the-art model-based and model-free offline RL methods on the NeoRL benchmark and the D4RL MuJoCo Gym tasks.

[LG-131] NourishNet: Proactive Severity State Forecasting of Food Commodity Prices for Global Warning Systems

链接: https://arxiv.org/abs/2407.00698
作者: Sydney Balboni,Grace Ivey,Brett Storoe,John Cisler,Tyge Plater,Caitlyn Grant,Ella Bruce,Benjamin Paulson
关键词: critical signal indicating, signal indicating potential, indicating potential disruptions, global food commodities, critical signal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); General Economics (econ.GN); Numerical Analysis (math.NA)
*备注: MICS 2024 1st Place Paper, MSOE AI-Club Research Group

点击查看摘要

Abstract:Price volatility in global food commodities is a critical signal indicating potential disruptions in the food market. Understanding forthcoming changes in these prices is essential for bolstering food security, particularly for nations at risk. The Food and Agriculture Organization of the United Nations (FAO) previously developed sophisticated statistical frameworks for the proactive prediction of food commodity prices, aiding in the creation of global early warning systems. These frameworks utilize food security indicators to produce accurate forecasts, thereby facilitating preparations against potential food shortages. Our research builds on these foundations by integrating robust price security indicators with cutting-edge deep learning (DL) methodologies to reveal complex interdependencies. DL techniques examine intricate dynamics among diverse factors affecting food prices. Through sophisticated time-series forecasting models coupled with a classification model, our approach enhances existing models to better support communities worldwide in advancing their food security initiatives.

[LG-132] Graph in Graph Neural Network

链接: https://arxiv.org/abs/2407.00696
作者: Jiongshu Wang,Jing Yang,Jiankang Deng,Hatice Gunes,Siyang Song
关键词: Existing Graph Neural, Graph Neural Networks, GIG, describe complex objects, GIG sample
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing Graph Neural Networks (GNNs) are limited to process graphs each of whose vertices is represented by a vector or a single value, limited their representing capability to describe complex objects. In this paper, we propose the first GNN (called Graph in Graph Neural (GIG) Network) which can process graph-style data (called GIG sample) whose vertices are further represented by graphs. Given a set of graphs or a data sample whose components can be represented by a set of graphs (called multi-graph data sample), our GIG network starts with a GIG sample generation (GSG) module which encodes the input as a \textbfGIG sample, where each GIG vertex includes a graph. Then, a set of GIG hidden layers are stacked, with each consisting of: (1) a GIG vertex-level updating (GVU) module that individually updates the graph in every GIG vertex based on its internal information; and (2) a global-level GIG sample updating (GGU) module that updates graphs in all GIG vertices based on their relationships, making the updated GIG vertices become global context-aware. This way, both internal cues within the graph contained in each GIG vertex and the relationships among GIG vertices could be utilized for down-stream tasks. Experimental results demonstrate that our GIG network generalizes well for not only various generic graph analysis tasks but also real-world multi-graph data analysis (e.g., human skeleton video-based action recognition), which achieved the new state-of-the-art results on 13 out of 14 evaluated datasets. Our code is publicly available at this https URL.

[LG-133] BAPO: Base-Anchored Preference Optimization for Personalized Alignment in Large Language Models

链接: https://arxiv.org/abs/2407.00693
作者: Gihun Lee,Minchan Jeong,Yujin Kim,Hojung Jung,Jaehoon Oh,Sangmook Kim,Se-Young Yun
关键词: align Large Language, Large Language Models, Large Language, shown remarkable success, align Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:While learning to align Large Language Models (LLMs) with human preferences has shown remarkable success, aligning these models to meet the diverse user preferences presents further challenges in preserving previous knowledge. This paper examines the impact of personalized preference optimization on LLMs, revealing that the extent of knowledge loss varies significantly with preference heterogeneity. Although previous approaches have utilized the KL constraint between the reference model and the policy model, we observe that they fail to maintain general knowledge and alignment when facing personalized preferences. To this end, we introduce Base-Anchored Preference Optimization (BAPO), a simple yet effective approach that utilizes the initial responses of reference model to mitigate forgetting while accommodating personalized alignment. BAPO effectively adapts to diverse user preferences while minimally affecting global knowledge or general alignment. Our experiments demonstrate the efficacy of BAPO in various setups.

[LG-134] EAL: New Selection Strategy for Small Buffers in Experience Replay Class Incremental Learning

链接: https://arxiv.org/abs/2407.00673
作者: Shahar Shaul-Ariel,Daphna Weinshall
关键词: Continual Learning, called Catastrophic Forgetting, unresolved challenge, modern applications, relevance increases
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual Learning is an unresolved challenge, whose relevance increases when considering modern applications. Unlike the human brain, trained deep neural networks suffer from a phenomenon called Catastrophic Forgetting, where they progressively lose previously acquired knowledge upon learning new tasks. To mitigate this problem, numerous methods have been developed, many relying on replaying past exemplars during new task training. However, as the memory allocated for replay decreases, the effectiveness of these approaches diminishes. On the other hand, maintaining a large memory for the purpose of replay is inefficient and often impractical. Here we introduce TEAL, a novel approach to populate the memory with exemplars, that can be integrated with various experience-replay methods and significantly enhance their performance on small memory buffers. We show that TEAL improves the average accuracy of the SOTA method XDER as well as ER and ER-ACE on several image recognition benchmarks, with a small memory buffer of 1-3 exemplars per class in the final task. This confirms the hypothesis that when memory is scarce, it is best to prioritize the most typical data.

[LG-135] Establishing Deep InfoMax as an effective self-supervised learning methodology in materials informatics

链接: https://arxiv.org/abs/2407.00671
作者: Michael Moran,Vladimir V. Gusev,Michael W. Gaultois,Dmytro Antypov,Matthew J. Rosseinsky
关键词: Deep InfoMax pretraining, Crystallographic Information File, Deep InfoMax, property labels remains, property labels
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:The scarcity of property labels remains a key challenge in materials informatics, whereas materials data without property labels are abundant in comparison. By pretraining supervised property prediction models on self-supervised tasks that depend only on the “intrinsic information” available in any Crystallographic Information File (CIF), there is potential to leverage the large amount of crystal data without property labels to improve property prediction results on small datasets. We apply Deep InfoMax as a self-supervised machine learning framework for materials informatics that explicitly maximises the mutual information between a point set (or graph) representation of a crystal and a vector representation suitable for downstream learning. This allows the pretraining of supervised models on large materials datasets without the need for property labels and without requiring the model to reconstruct the crystal from a representation vector. We investigate the benefits of Deep InfoMax pretraining implemented on the Site-Net architecture to improve the performance of downstream property prediction models with small amounts (10^3) of data, a situation relevant to experimentally measured materials property databases. Using a property label masking methodology, where we perform self-supervised learning on larger supervised datasets and then train supervised models on a small subset of the labels, we isolate Deep InfoMax pretraining from the effects of distributional shift. We demonstrate performance improvements in the contexts of representation learning and transfer learning on the tasks of band gap and formation energy prediction. Having established the effectiveness of Deep InfoMax pretraining in a controlled environment, our findings provide a foundation for extending the approach to address practical challenges in materials informatics.

[LG-136] Improving Real-Time Music Accompaniment Separation with MMDenseNet

链接: https://arxiv.org/abs/2407.00657
作者: Chun-Hsiang Wang,Chung-Che Wang,Jun-You Wang,Jyh-Shing Roger Jang,Yen-Hsun Chu
关键词: separate polyphonic music, Music source separation, Music source, polyphonic music, source separation aims
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Music source separation aims to separate polyphonic music into different types of sources. Most existing methods focus on enhancing the quality of separated results by using a larger model structure, rendering them unsuitable for deployment on edge devices. Moreover, these methods may produce low-quality output when the input duration is short, making them impractical for real-time applications. Therefore, the goal of this paper is to enhance a lightweight model, MMDenstNet, to strike a balance between separation quality and latency for real-time applications. Different directions of improvement are explored or proposed in this paper, including complex ideal ratio mask, self-attention, band-merge-split method, and feature look back. Source-to-distortion ratio, real-time factor, and optimal latency are employed to evaluate the performance. To align with our application requirements, the evaluation process in this paper focuses on the separation performance of the accompaniment part. Experimental results demonstrate that our improvement achieves low real-time factor and optimal latency while maintaining acceptable separation quality.

[LG-137] HASNAS: A Hardware-Aware Spiking Neural Architecture Search Framework for Neuromorphic Compute-in-Memory Systems

链接: https://arxiv.org/abs/2407.00641
作者: Rachmad Vidya Wicaksana Putra,Muhammad Shafique
关键词: Artificial Neural Networks, Spiking Neural Networks, machine learning tasks, SNN, solving diverse machine
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 9 pages, 13 figures, 2 tables

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have shown capabilities for solving diverse machine learning tasks with ultra-low-power/energy computation. To further improve the performance and efficiency of SNN inference, the Compute-in-Memory (CIM) paradigm with emerging device technologies such as resistive random access memory is employed. However, most of SNN architectures are developed without considering constraints from the application and the underlying CIM hardware (e.g., memory, area, latency, and energy consumption). Moreover, most of SNN designs are derived from the Artificial Neural Networks, whose network operations are different from SNNs. These limitations hinder SNNs from reaching their full potential in accuracy and efficiency. Toward this, we propose HASNAS, a novel hardware-aware spiking neural architecture search (NAS) framework for neuromorphic CIM systems that finds an SNN that offers high accuracy under the given memory, area, latency, and energy constraints. To achieve this, HASNAS employs the following key steps: (1) optimizing SNN operations to achieve high accuracy, (2) developing an SNN architecture that facilitates an effective learning process, and (3) devising a systematic hardware-aware search algorithm to meet the constraints. The experimental results show that our HASNAS quickly finds an SNN that maintains high accuracy compared to the state-of-the-art by up to 11x speed-up, and meets the given constraints: 4x10^6 parameters of memory, 100mm^2 of area, 400ms of latency, and 120uJ energy consumption for CIFAR10 and CIFAR100; while the state-of-the-art fails to meet the constraints. In this manner, our HASNAS can enable efficient design automation for providing high-performance and energy-efficient neuromorphic CIM systems for diverse applications.

[LG-138] arsier: Recipes for Training and Evaluating Large Video Description Models

链接: https://arxiv.org/abs/2407.00634
作者: Jiawei Wang,Liping Yuan,Yuchen Zhang
关键词: Generating fine-grained video, Generating fine-grained, fundamental challenge, video, Generating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a +51.4% advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a +12.3% advantage against GPT-4V and a -6.7% disadvantage against Gemini 1.5 Pro. Besides video description, Tarsier proves to be a versatile generalist model, achieving new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. Our second contribution is the introduction of a new benchmark for evaluating video description models, consisting of a new challenging dataset featuring videos from diverse sources and varying complexity, along with an automatic method specifically designed to assess the quality of fine-grained video descriptions. We make our models and evaluation benchmark publicly available at \urlthis https URL.

[LG-139] rialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets

链接: https://arxiv.org/abs/2407.00631
作者: Jintai Chen,Yaojun Hu,Yue Wang,Yingzhou Lu,Xu Cao,Miao Lin,Hongxia Xu,Jian Wu,Cao Xiao,Jimeng Sun,Lucas Glass,Kexin Huang,Marinka Zitnik,Tianfan Fu
关键词: waste immense efforts, immense efforts spanning, clinical trial design, clinical trial, trial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Clinical trials are pivotal for developing new medical treatments, yet they typically pose some risks such as patient mortality, adverse events, and enrollment failure that waste immense efforts spanning over a decade. Applying artificial intelligence (AI) to forecast or simulate key events in clinical trials holds great potential for providing insights to guide trial designs. However, complex data collection and question definition requiring medical expertise and a deep understanding of trial designs have hindered the involvement of AI thus far. This paper tackles these challenges by presenting a comprehensive suite of meticulously curated AIready datasets covering multi-modal data (e.g., drug molecule, disease code, text, categorical/numerical features) and 8 crucial prediction challenges in clinical trial design, encompassing prediction of trial duration, patient dropout rate, serious adverse event, mortality rate, trial approval outcome, trial failure reason, drug dose finding, design of eligibility criteria. Furthermore, we provide basic validation methods for each task to ensure the datasets’ usability and reliability. We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design, ultimately advancing clinical trial research and accelerating medical solution development. The curated dataset, metrics, and basic models are publicly available at this https URL.

[LG-140] Maximum Entropy Inverse Reinforcement Learning of Diffusion Models with Energy-Based Models

链接: https://arxiv.org/abs/2407.00626
作者: Sangwoong Yoon,Himchan Hwang,Dohyun Kwon,Yung-Kyun Noh,Frank C. Park
关键词: Maximum Entropy IRL, diffusion generative models, diffusion model, maximum entropy inverse, diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code is released at this https URL

点击查看摘要

Abstract:We present a maximum entropy inverse reinforcement learning (IRL) approach for improving the sample quality of diffusion generative models, especially when the number of generation time steps is small. Similar to how IRL trains a policy based on the reward function learned from expert demonstrations, we train (or fine-tune) a diffusion model using the log probability density estimated from training data. Since we employ an energy-based model (EBM) to represent the log density, our approach boils down to the joint training of a diffusion model and an EBM. Our IRL formulation, named Diffusion by Maximum Entropy IRL (DxMI), is a minimax problem that reaches equilibrium when both models converge to the data distribution. The entropy maximization plays a key role in DxMI, facilitating the exploration of the diffusion model and ensuring the convergence of the EBM. We also propose Diffusion by Dynamic Programming (DxDP), a novel reinforcement learning algorithm for diffusion models, as a subroutine in DxMI. DxDP makes the diffusion model update in DxMI efficient by transforming the original problem into an optimal control formulation where value functions replace back-propagation in time. Our empirical studies show that diffusion models fine-tuned using DxMI can generate high-quality samples in as few as 4 and 10 steps. Additionally, DxMI enables the training of an EBM without MCMC, stabilizing EBM training dynamics and enhancing anomaly detection performance.

[LG-141] Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

链接: https://arxiv.org/abs/2407.00617
作者: Yuheng Zhang,Dian Yu,Baolin Peng,Linfeng Song,Ye Tian,Mingyue Huo,Nan Jiang,Haitao Mi,Dong Yu
关键词: achieved great success, aligning large language, large language models, Human Feedback, Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art iterative algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our ablation study highlights the benefits of incorporating KL regularization for response length control.

[LG-142] DADEE: Well-calibrated uncertainty quantification in neural networks for barriers-based robot safety

链接: https://arxiv.org/abs/2407.00616
作者: Masoud Ataei,Vikas Dhiman
关键词: Control Barrier Functions, safety critical applications, Uncertainty-aware controllers, safety critical, Control Barrier
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Uncertainty-aware controllers that guarantee safety are critical for safety critical applications. Among such controllers, Control Barrier Functions (CBFs) based approaches are popular because they are fast, yet safe. However, most such works depend on Gaussian Processes (GPs) or MC-Dropout for learning and uncertainty estimation, and both approaches come with drawbacks: GPs are non-parametric methods that are slow, while MC-Dropout does not capture aleatoric uncertainty. On the other hand, modern Bayesian learning algorithms have shown promise in uncertainty quantification. The application of modern Bayesian learning methods to CBF-based controllers has not yet been studied. We aim to fill this gap by surveying uncertainty quantification algorithms and evaluating them on CBF-based safe controllers. We find that model variance-based algorithms (for example, Deep ensembles, MC-dropout, etc.) and direct estimation-based algorithms (such as DEUP) have complementary strengths. Algorithms in the former category can only estimate uncertainty accurately out-of-domain, while those in the latter category can only do so in-domain. We combine the two approaches to obtain more accurate uncertainty estimates both in- and out-of-domain. As measured by the failure rate of a simulated robot, this results in a safer CBF-based robot controller.

[LG-143] GC-Bench: An Open and Unified Benchmark for Graph Condensation

链接: https://arxiv.org/abs/2407.00615
作者: Qingyun Sun,Ziying Chen,Beining Yang,Cheng Ji,Xingcheng Fu,Sheng Zhou,Hao Peng,Jianxin Li,Philip S. Yu
关键词: recently garnered considerable, garnered considerable attention, considerable attention due, reduce large-scale graph, Graph condensation
类目: Machine Learning (cs.LG)
*备注: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Preprint, under review)

点击查看摘要

Abstract:Graph condensation (GC) has recently garnered considerable attention due to its ability to reduce large-scale graph datasets while preserving their essential properties. The core concept of GC is to create a smaller, more manageable graph that retains the characteristics of the original graph. Despite the proliferation of graph condensation methods developed in recent years, there is no comprehensive evaluation and in-depth analysis, which creates a great obstacle to understanding the progress in this field. To fill this gap, we develop a comprehensive Graph Condensation Benchmark (GC-Bench) to analyze the performance of graph condensation in different scenarios systematically. Specifically, GC-Bench systematically investigates the characteristics of graph condensation in terms of the following dimensions: effectiveness, transferability, and complexity. We comprehensively evaluate 12 state-of-the-art graph condensation algorithms in node-level and graph-level tasks and analyze their performance in 12 diverse graph datasets. Further, we have developed an easy-to-use library for training and evaluating different GC methods to facilitate reproducible research. The GC-Bench library is available at this https URL.

[LG-144] A Linear Programming Enhanced Genetic Algorithm for Hyperparameter Tuning in Machine Learning

链接: https://arxiv.org/abs/2407.00613
作者: Ankur Sinha,Paritosh Pankaj
关键词: linear program, linear program enhancement, bilevel program, program, hyperparameter tuning problem
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 8 pages; this https URL

点击查看摘要

Abstract:In this paper, we formulate the hyperparameter tuning problem in machine learning as a bilevel program. The bilevel program is solved using a micro genetic algorithm that is enhanced with a linear program. While the genetic algorithm searches over discrete hyperparameters, the linear program enhancement allows hyper local search over continuous hyperparameters. The major contribution in this paper is the formulation of a linear program that supports fast search over continuous hyperparameters, and can be integrated with any hyperparameter search technique. It can also be applied directly on any trained machine learning or deep learning model for the purpose of fine-tuning. We test the performance of the proposed approach on two datasets, MNIST and CIFAR-10. Our results clearly demonstrate that using the linear program enhancement offers significant promise when incorporated with any population-based approach for hyperparameter tuning.

[LG-145] Diff-BBO: Diffusion-Based Inverse Modeling for Black-Box Optimization

链接: https://arxiv.org/abs/2407.00610
作者: Dongxia Wu,Nikki Lijing Kuang,Ruijia Niu,Yi-An Ma,Rose Yu
关键词: aims to optimize, iteratively querying, objective function, function, black-box oracle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Black-box optimization (BBO) aims to optimize an objective function by iteratively querying a black-box oracle. This process demands sample-efficient optimization due to the high computational cost of function evaluations. While prior studies focus on forward approaches to learn surrogates for the unknown objective function, they struggle with high-dimensional inputs where valid inputs form a small subspace (e.g., valid protein sequences), which is common in real-world tasks. Recently, diffusion models have demonstrated impressive capability in learning the high-dimensional data manifold. They have shown promising performance in black-box optimization tasks but only in offline settings. In this work, we propose diffusion-based inverse modeling for black-box optimization (Diff-BBO), the first inverse approach leveraging diffusion models for online BBO problem. Diff-BBO distinguishes itself from forward approaches through the design of acquisition function. Instead of proposing candidates in the design space, Diff-BBO employs a novel acquisition function Uncertainty-aware Exploration (UaE) to propose objective function values, which leverages the uncertainty of a conditional diffusion model to generate samples in the design space. Theoretically, we prove that using UaE leads to optimal optimization outcomes. Empirically, we redesign experiments on the Design-Bench benchmark for online settings and show that Diff-BBO achieves state-of-the-art performance.

[LG-146] ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding

链接: https://arxiv.org/abs/2407.00609
作者: Quang P.M. Pham,Khoi T.N. Nguyen,Lan C. Ngo,Truong Do,Truong Son Hy
关键词: understanding tasks due, explicit nature, scene understanding tasks, tasks due, compact and explicit
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, multi-view 3D data. This work, to the best of our knowledge, is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding. Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence. ESGNN demands low computational resources and is easy to implement from available frameworks, paving the way for real-time applications such as robotics and computer vision.

[LG-147] Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

链接: https://arxiv.org/abs/2407.00599
作者: Xinglin Pan Wenxiang Lin,Shaohuai Shi,Xiaowen Chu,Weinong Sun,Bo Li
关键词: found practical applications, large-scale foundation models, found practical, practical applications, applications in enlarging
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparsely-activated Mixture-of-Expert (MoE) layers have found practical applications in enlarging the model size of large-scale foundation models, with only a sub-linear increase in computation demands. Despite the wide adoption of hybrid parallel paradigms like model parallelism, expert parallelism, and expert-sharding parallelism (i.e., MP+EP+ESP) to support MoE model training on GPU clusters, the training efficiency is hindered by communication costs introduced by these parallel paradigms. To address this limitation, we propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks. The proposed schedules eliminate redundant computations and communications and enable overlaps between intra-node and inter-node communications, ultimately reducing the overall training time. As the two schedules are not mutually exclusive, we provide comprehensive theoretical analyses and derive an automatic and accurate solution to determine which schedule should be applied in different scenarios. Experimental results on an 8-GPU server and a 32-GPU cluster demonstrate that Parm outperforms the state-of-the-art MoE training system, DeepSpeed-MoE, achieving 1.13 \times to 5.77 \times speedup on 1296 manually configured MoE layers and approximately 3 \times improvement on two real-world MoE models based on BERT and GPT-2.

[LG-148] Hyperparameter Optimization for Randomized Algorithms: A Case Study for Random Features

链接: https://arxiv.org/abs/2407.00584
作者: Oliver R. A. Dunbar,Nicholas H. Nelsen,Maya Mutic
关键词: reduce computational complexity, algorithms exploit stochasticity, computational complexity, exploit stochasticity, stochasticity to reduce
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Randomized algorithms exploit stochasticity to reduce computational complexity. One important example is random feature regression (RFR) that accelerates Gaussian process regression (GPR). RFR approximates an unknown function with a random neural network whose hidden weights and biases are sampled from a probability distribution. Only the final output layer is fit to data. In randomized algorithms like RFR, the hyperparameters that characterize the sampling distribution greatly impact performance, yet are not directly accessible from samples. This makes optimization of hyperparameters via standard (gradient-based) optimization tools inapplicable. Inspired by Bayesian ideas from GPR, this paper introduces a random objective function that is tailored for hyperparameter tuning of vector-valued random features. The objective is minimized with ensemble Kalman inversion (EKI). EKI is a gradient-free particle-based optimizer that is scalable to high-dimensions and robust to randomness in objective functions. A numerical study showcases the new black-box methodology to learn hyperparameter distributions in several problems that are sensitive to the hyperparameter selection: two global sensitivity analyses, integrating a chaotic dynamical system, and solving a Bayesian inverse problem from atmospheric dynamics. The success of the proposed EKI-based algorithm for RFR suggests its potential for automated optimization of hyperparameters arising in other randomized algorithms.

[LG-149] Learning to Control Unknown Strongly Monotone Games

链接: https://arxiv.org/abs/2407.00575
作者: Siddharth Chandak,Ilai Bistritz,Nicholas Bambos
关键词: linear constraints, dimensional linear constraints, linear, algorithm, constraints
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to IEEE Transactions on Automatic Control

点击查看摘要

Abstract:Consider N players each with a d -dimensional action set. Each of the players’ utility functions includes their reward function and a linear term for each dimension, with coefficients that are controlled by the manager. We assume that the game is strongly monotone, so if each player runs gradient descent, the dynamics converge to a unique Nash equilibrium (NE). The NE is typically inefficient in terms of global performance. The resulting global performance of the system can be improved by imposing K -dimensional linear constraints on the NE. We therefore want the manager to pick the controlled coefficients that impose the desired constraint on the NE. However, this requires knowing the players’ reward functions and their action sets. Obtaining this game structure information is infeasible in a large-scale network and violates the users’ privacy. To overcome this, we propose a simple algorithm that learns to shift the NE of the game to meet the linear constraints by adjusting the controlled coefficients online. Our algorithm only requires the linear constraints violation as feedback and does not need to know the reward functions or the action sets. We prove that our algorithm, which is based on two time-scale stochastic approximation, guarantees convergence with probability 1 to the set of NE that meet target linear constraints. We then provide a mean square convergence rate of O(t^-1/4) for our algorithm. This is the first such bound for two time-scale stochastic approximation where the slower time-scale is a fixed point iteration with a non-expansive mapping. We demonstrate how our scheme can be applied to optimizing a global quadratic cost at NE and load balancing in resource allocation games. We provide simulations of our algorithm for these scenarios.

[LG-150] Adversarial Online Learning with Temporal Feedback Graphs

链接: https://arxiv.org/abs/2407.00571
作者: Khashayar Gatmiry,Jon Schneider
关键词: learner action, visible at time, study a variant, variant of prediction, prediction with expert
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study a variant of prediction with expert advice where the learner’s action at round t is only allowed to depend on losses on a specific subset of the rounds (where the structure of which rounds’ losses are visible at time t is provided by a directed “feedback graph” known to the learner). We present a novel learning algorithm for this setting based on a strategy of partitioning the losses across sub-cliques of this graph. We complement this with a lower bound that is tight in many practical settings, and which we conjecture to be within a constant factor of optimal. For the important class of transitive feedback graphs, we prove that this algorithm is efficiently implementable and obtains the optimal regret bound (up to a universal constant).

[LG-151] Divide And Conquer: Learning Chaotic Dynamical Systems With Multistep Penalty Neural Ordinary Differential Equations

链接: https://arxiv.org/abs/2407.00568
作者: Dibyajyoti Chakraborty,Seung Whan Chung,Romit Maulik
关键词: Forecasting high-dimensional dynamical, Neural Ordinary Differential, high-dimensional dynamical systems, chaotic dynamical systems, Ordinary Differential Equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 10 Figures, submitted to Journal of Computational Physics

点击查看摘要

Abstract:Forecasting high-dimensional dynamical systems is a fundamental challenge in various fields, such as the geosciences and engineering. Neural Ordinary Differential Equations (NODEs), which combine the power of neural networks and numerical solvers, have emerged as a promising algorithm for forecasting complex nonlinear dynamical systems. However, classical techniques used for NODE training are ineffective for learning chaotic dynamical systems. In this work, we propose a novel NODE-training approach that allows for robust learning of chaotic dynamical systems. Our method addresses the challenges of non-convexity and exploding gradients associated with underlying chaotic dynamics. Training data trajectories from such systems are split into multiple, non-overlapping time windows. In addition to the deviation from the training data, the optimization loss term further penalizes the discontinuities of the predicted trajectory between the time windows. The window size is selected based on the fastest Lyapunov time scale of the system. Multi-step penalty(MP) method is first demonstrated on Lorenz equation, to illustrate how it improves the loss landscape and thereby accelerating the optimization convergence. MP method can optimize chaotic systems in a manner similar to least-squares shadowing with significantly lower computational costs. Our proposed algorithm, denoted the Multistep Penalty NODE(MP-NODE), is applied to chaotic systems such as the Kuramoto-Sivashinsky equation and the two-dimensional Kolmogorov flow. It is observed that MP-NODE provide viable performance for such chaotic systems, not only for short-term trajectory predictions but also for invariant statistics that are hallmarks of the chaotic nature of these dynamics.

[LG-152] A Contextual Combinatorial Bandit Approach to Negotiation

链接: https://arxiv.org/abs/2407.00567
作者: Yexin Li,Zhancun Mu,Siyuan Qi
关键词: Learning effective negotiation, Learning effective, effective negotiation strategies, negotiation strategies poses, large action spaces
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning effective negotiation strategies poses two key challenges: the exploration-exploitation dilemma and dealing with large action spaces. However, there is an absence of learning-based approaches that effectively address these challenges in negotiation. This paper introduces a comprehensive formulation to tackle various negotiation problems. Our approach leverages contextual combinatorial multi-armed bandits, with the bandits resolving the exploration-exploitation dilemma, and the combinatorial nature handles large action spaces. Building upon this formulation, we introduce NegUCB, a novel method that also handles common issues such as partial observations and complex reward functions in negotiation. NegUCB is contextual and tailored for full-bandit feedback without constraints on the reward functions. Under mild assumptions, it ensures a sub-linear regret upper bound. Experiments conducted on three negotiation tasks demonstrate the superiority of our approach.

[LG-153] Cooperative Advisory Residual Policies for Congestion Mitigation

链接: https://arxiv.org/abs/2407.00553
作者: Aamir Hasan,Neeloy Chakraborty,Haonan Chen,Jung-Hoon Cho,Cathy Wu,Katherine Driggs-Campbell
关键词: autonomous vehicle fleets, simple actions, improving many socioeconomic, socioeconomic factors, mitigate traffic congestion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fleets of autonomous vehicles can mitigate traffic congestion through simple actions, thus improving many socioeconomic factors such as commute time and gas costs. However, these approaches are limited in practice as they assume precise control over autonomous vehicle fleets, incur extensive installation costs for a centralized sensor ecosystem, and also fail to account for uncertainty in driver behavior. To this end, we develop a class of learned residual policies that can be used in cooperative advisory systems and only require the use of a single vehicle with a human driver. Our policies advise drivers to behave in ways that mitigate traffic congestion while accounting for diverse driver behaviors, particularly drivers’ reactions to instructions, to provide an improved user experience. To realize such policies, we introduce an improved reward function that explicitly addresses congestion mitigation and driver attitudes to advice. We show that our residual policies can be personalized by conditioning them on an inferred driver trait that is learned in an unsupervised manner with a variational autoencoder. Our policies are trained in simulation with our novel instruction adherence driver model, and evaluated in simulation and through a user study (N=16) to capture the sentiments of human drivers. Our results show that our approaches successfully mitigate congestion while adapting to different driver behaviors, with up to 20% and 40% improvement as measured by a combination metric of speed and deviations in speed across time over baselines in our simulation tests and user study, respectively. Our user study further shows that our policies are human-compatible and personalize to drivers.

[LG-154] Detecting and Identifying Selection Structure in Sequential Data

链接: https://arxiv.org/abs/2407.00529
作者: Yujia Zheng,Zeyu Tang,Yiwen Qiu,Bernhard Schölkopf,Kun Zhang
关键词: data points based, practical situations, selective inclusion, points based, objectives is common
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: ICML 2024

点击查看摘要

Abstract:We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences. Since this selection process often distorts statistical analysis, previous work primarily views it as a bias to be corrected and proposes various methods to mitigate its effect. However, while controlling this bias is crucial, selection also offers an opportunity to provide a deeper insight into the hidden generation process, as it is a fundamental mechanism underlying what we observe. In particular, overlooking selection in sequential data can lead to an incomplete or overcomplicated inductive bias in modeling, such as assuming a universal autoregressive structure for all dependencies. Therefore, rather than merely viewing it as a bias, we explore the causal structure of selection in sequential data to delve deeper into the complete causal process. Specifically, we show that selection structure is identifiable without any parametric assumptions or interventional experiments. Moreover, even in cases where selection variables coexist with latent confounders, we still establish the nonparametric identifiability under appropriate structural conditions. Meanwhile, we also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies. The framework has been validated empirically on both synthetic data and real-world music.

[LG-155] Real-Time Energy Measurement for Non-Intrusive Well-Being Monitoring of Elderly People – a Case Study

链接: https://arxiv.org/abs/2407.00524
作者: Mateusz Brzozowski,Artur Janicki
关键词: case study demonstrating, article presents, presents a case, elderly people, Abstract
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:This article presents a case study demonstrating a non-intrusive method for the well-being monitoring of elderly people. It is based on our real-time energy measurement system, which uses tiny beacons attached to electricity meters. Four participants aged 67-82 years took part in our study. We observed their electric power consumption for approx. a month, and then we analyzed them, taking into account the participants’ notes on their activities. We created typical daily usage profiles for each participant and used anomaly detection to find unusual energy consumption. We found out that real-time energy measurement can give significant insight into someone’s daily activities and, consequently, bring invaluable information to caregivers about the well-being of an elderly person, while being discreet and entirely non-intrusive.

[LG-156] A Medical Low-Back Pain Physical Rehabilitation Dataset for Human Body Movement Analysis

链接: https://arxiv.org/abs/2407.00521
作者: Sao Mai Nguyen,Maxime Devanne,Olivier Remy-Neris,Mathieu Lempereur,André Thepaut
关键词: showing encouraging results, non-medical applications, limited use contexts, automatic monitoring, monitoring and coaching
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While automatic monitoring and coaching of exercises are showing encouraging results in non-medical applications, they still have limitations such as errors and limited use contexts. To allow the development and assessment of physical rehabilitation by an intelligent tutoring system, we identify in this article four challenges to address and propose a medical dataset of clinical patients carrying out low back-pain rehabilitation exercises. The dataset includes 3D Kinect skeleton positions and orientations, RGB videos, 2D skeleton data, and medical annotations to assess the correctness, and error classification and localisation of body part and timespan. Along this dataset, we perform a complete research path, from data collection to processing, and finally a small benchmark. We evaluated on the dataset two baseline movement recognition algorithms, pertaining to two different approaches: the probabilistic approach with a Gaussian Mixture Model (GMM), and the deep learning approach with a Long-Short Term Memory (LSTM). This dataset is valuable because it includes rehabilitation relevant motions in a clinical setting with patients in their rehabilitation program, using a cost-effective, portable, and convenient sensor, and because it shows the potential for improvement on these challenges. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.5.4; I.4.8 Cite as: arXiv:2407.00521 [cs.LG] (or arXiv:2407.00521v1 [cs.LG] for this version) Journalreference: IJCNN 2024

[LG-157] Stochastic stem bucking using mixture density neural networks

链接: https://arxiv.org/abs/2407.00510
作者: Simon Schmiedel
关键词: Poor bucking decisions, bucking decisions made, bucking decisions, Poor bucking, bucking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Poor bucking decisions made by forest harvesters can have a negative effect on the products that are generated from the logs. Making the right bucking decisions is not an easy task because harvesters must rely on predictions of the stem profile for the part of the stems that is not yet measured. The goal of this project is to improve the bucking decisions made by forest harvesters with a stochastic bucking method. We developed a Long Short-Term Memory (LSTM) neural network that predicted the parameters of a Gaussian distribution conditioned on the known part of the stem, enabling the creation of multiple samples of stem profile predictions for the unknown part of the stem. The bucking decisions could then be optimized using a novel stochastic bucking algorithm which used all the stem profiles generated to choose the logs to generate from the stem. The stochastic bucking algorithm was compared to two benchmark models: A polynomial model that could not condition its predictions on more than one diameter measurement, and a deterministic LSTM neural network. All models were evaluated on stem profiles of four coniferous species prevalent in eastern Canada. In general, the best bucking decisions were taken by the stochastic LSTM models, demonstrating the usefulness of the method. The second-best results were mostly obtained by the deterministic LSTM model and the worst results by the polynomial model, corroborating the usefulness of conditioning the stem curve predictions on multiple measurements.

[LG-158] ShapG: new feature importance method based on the Shapley value

链接: https://arxiv.org/abs/2407.00506
作者: Chi Zhao,Jing Liu,Elena Parilina
关键词: Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, make decisions, Intelligence
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With wide application of Artificial Intelligence (AI), it has become particularly important to make decisions of AI systems explainable and transparent. In this paper, we proposed a new Explainable Artificial Intelligence (XAI) method called ShapG (Explanations based on Shapley value for Graphs) for measuring feature importance. ShapG is a model-agnostic global explanation method. At the first stage, it defines an undirected graph based on the dataset, where nodes represent features and edges are added based on calculation of correlation coefficients between features. At the second stage, it calculates an approximated Shapley value by sampling the data taking into account this graph structure. The sampling approach of ShapG allows to calculate the importance of features efficiently, i.e. to reduce computational complexity. Comparison of ShapG with other existing XAI methods shows that it provides more accurate explanations for two examined datasets. We also compared other XAI methods developed based on cooperative game theory with ShapG in running time, and the results show that ShapG exhibits obvious advantages in its running time, which further proves efficiency of ShapG. In addition, extensive experiments demonstrate a wide range of applicability of the ShapG method for explaining complex models. We find ShapG an important tool in improving explainability and transparency of AI systems and believe it can be widely used in various fields.

[LG-159] Deep Frequency Derivative Learning for Non-stationary Time Series Forecasting

链接: https://arxiv.org/abs/2407.00502
作者: Wei Fan,Kun Yi,Hangting Ye,Zhiyuan Ning,Qi Zhang,Ning An
关键词: time series, time series forecasting, inevitable for models, models to face, series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by IJCAI 2024

点击查看摘要

Abstract:While most time series are non-stationary, it is inevitable for models to face the distribution shift issue in time series forecasting. Existing solutions manipulate statistical measures (usually mean and std.) to adjust time series distribution. However, these operations can be theoretically seen as the transformation towards zero frequency component of the spectrum which cannot reveal full distribution information and would further lead to information utilization bottleneck in normalization, thus hindering forecasting performance. To address this problem, we propose to utilize the whole frequency spectrum to transform time series to make full use of data distribution from the frequency perspective. We present a deep frequency derivative learning framework, DERITS, for non-stationary time series forecasting. Specifically, DERITS is built upon a novel reversible transformation, namely Frequency Derivative Transformation (FDT) that makes signals derived in the frequency domain to acquire more stationary frequency representations. Then, we propose the Order-adaptive Fourier Convolution Network to conduct adaptive frequency filtering and learning. Furthermore, we organize DERITS as a parallel-stacked architecture for the multi-order derivation and fusion for forecasting. Finally, we conduct extensive experiments on several datasets which show the consistent superiority in both time series forecasting and shift alleviation.

[LG-160] Aeroengine performance prediction using a physical-embedded data-driven method

链接: https://arxiv.org/abs/2407.00501
作者: Tong Mo,Shiran Dai,An Fu,Xiaomeng Zhu,Shuxiao Li
关键词: Accurate and efficient, optimization endeavours, paramount importance, Accurate, efficient prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Accurate and efficient prediction of aeroengine performance is of paramount importance for engine design, maintenance, and optimization endeavours. However, existing methodologies often struggle to strike an optimal balance among predictive accuracy, computational efficiency, modelling complexity, and data dependency. To address these challenges, we propose a strategy that synergistically combines domain knowledge from both the aeroengine and neural network realms to enable real-time prediction of engine performance parameters. Leveraging aeroengine domain knowledge, we judiciously design the network structure and regulate the internal information flow. Concurrently, drawing upon neural network domain expertise, we devise four distinct feature fusion methods and introduce an innovative loss function formulation. To rigorously evaluate the effectiveness and robustness of our proposed strategy, we conduct comprehensive validation across two distinct datasets. The empirical results demonstrate :(1) the evident advantages of our tailored loss function; (2) our model’s ability to maintain equal or superior performance with a reduced parameter count; (3) our model’s reduced data dependency compared to generalized neural network architectures; (4)Our model is more interpretable than traditional black box machine learning methods.

[LG-161] Intrinsic PAPR for Point-level 3D Scene Albedo and Shading Editing

链接: https://arxiv.org/abs/2407.00500
作者: Alireza Moazeni,Shichong Peng,Ke Li
关键词: multi-view RGB images, RGB images, multi-view RGB, Intrinsic PAPR, neural rendering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in neural rendering have excelled at novel view synthesis from multi-view RGB images. However, they often lack the capability to edit the shading or colour of the scene at a detailed point-level, while ensuring consistency across different viewpoints. In this work, we address the challenge of point-level 3D scene albedo and shading editing from multi-view RGB images, focusing on detailed editing at the point-level rather than at a part or global level. While prior works based on volumetric representation such as NeRF struggle with achieving 3D consistent editing at the point level, recent advancements in point-based neural rendering show promise in overcoming this challenge. We introduce ``Intrinsic PAPR’', a novel method based on the recent point-based neural rendering technique Proximity Attention Point Rendering (PAPR). Unlike other point-based methods that model the intrinsic decomposition of the scene, our approach does not rely on complicated shading models or simplistic priors that may not universally apply. Instead, we directly model scene decomposition into albedo and shading components, leading to better estimation accuracy. Comparative evaluations against the latest point-based inverse rendering methods demonstrate that Intrinsic PAPR achieves higher-quality novel view rendering and superior point-level albedo and shading editing.

[LG-162] ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees

链接: https://arxiv.org/abs/2407.00499
作者: Zhiyuan Wang,Jinhao Duan,Lu Cheng,Yue Zhang,Qingni Wang,Hengtao Shen,Xiaofeng Zhu,Xiaoshuang Shi,Kaidi Xu
关键词: natural language generation, recent large language, large language models, open-ended NLG tasks, language generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Uncertainty quantification (UQ) in natural language generation (NLG) tasks remains an open challenge, exacerbated by the intricate nature of the recent large language models (LLMs). This study investigates adapting conformal prediction (CP), which can convert any heuristic measure of uncertainty into rigorous theoretical guarantees by constructing prediction sets, for black-box LLMs in open-ended NLG tasks. We propose a sampling-based uncertainty measure leveraging self-consistency and develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the design of the CP algorithm. Experimental results indicate that our uncertainty measure generally surpasses prior state-of-the-art methods. Furthermore, we calibrate the prediction sets within the model’s unfixed answer distribution and achieve strict control over the correctness coverage rate across 6 LLMs on 4 free-form NLG datasets, spanning general-purpose and medical domains, while the small average set size further highlights the efficiency of our method in providing trustworthy guarantees for practical open-ended NLG applications.

[LG-163] A Two-stage Reinforcement Learning-based Approach for Multi-entity Task Allocation

链接: https://arxiv.org/abs/2407.00496
作者: Aicheng Gong,Kai Yang,Jiafei Lyu,Xiu Li
关键词: key combinatorial optimization, combinatorial optimization problem, crucial for modern, resource scheduling, key combinatorial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Task allocation is a key combinatorial optimization problem, crucial for modern applications such as multi-robot cooperation and resource scheduling. Decision makers must allocate entities to tasks reasonably across different scenarios. However, traditional methods assume static attributes and numbers of tasks and entities, often relying on dynamic programming and heuristic algorithms for solutions. In reality, task allocation resembles Markov decision processes, with dynamically changing task and entity attributes. Thus, algorithms must dynamically allocate tasks based on their states. To address this issue, we propose a two-stage task allocation algorithm based on similarity, utilizing reinforcement learning to learn allocation strategies. The proposed pre-assign strategy allows entities to preselect appropriate tasks, effectively avoiding local optima and thereby better finding the optimal allocation. We also introduce an attention mechanism and a hyperparameter network structure to adapt to the changing number and attributes of entities and tasks, enabling our network structure to generalize to new tasks. Experimental results across multiple environments demonstrate that our algorithm effectively addresses the challenges of dynamic task allocation in practical applications. Compared to heuristic algorithms like genetic algorithms, our reinforcement learning approach better solves dynamic allocation problems and achieves zero-shot generalization to new tasks with good performance. The code is available at this https URL.

[LG-164] A Bayesian Solution To The Imitation Gap

链接: https://arxiv.org/abs/2407.00495
作者: Risto Vuorio,Mattie Fellows,Cong Lu,Clémence Grislain,Shimon Whiteson
关键词: expert demonstrations, imitation gap, expert, real-world settings, act in environments
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many real-world settings, an agent must learn to act in environments where no reward signal can be specified, but a set of expert demonstrations is available. Imitation learning (IL) is a popular framework for learning policies from such demonstrations. However, in some cases, differences in observability between the expert and the agent can give rise to an imitation gap such that the expert’s policy is not optimal for the agent and a naive application of IL can fail catastrophically. In particular, if the expert observes the Markov state and the agent does not, then the expert will not demonstrate the information-gathering behavior needed by the agent but not the expert. In this paper, we propose a Bayesian solution to the Imitation Gap (BIG), first using the expert demonstrations, together with a prior specifying the cost of exploratory behavior that is not demonstrated, to infer a posterior over rewards with Bayesian inverse reinforcement learning (IRL). BIG then uses the reward posterior to learn a Bayes-optimal policy. Our experiments show that BIG, unlike IL, allows the agent to explore at test time when presented with an imitation gap, whilst still learning to behave optimally using expert demonstrations when no such gap exists.

[LG-165] Graph Neural Networks Gone Hogwild

链接: https://arxiv.org/abs/2407.00494
作者: Olga Solodova,Nick Richardson,Deniz Oktay,Ryan P. Adams
关键词: Message passing graph, passing graph neural, generate catastrophically incorrect, catastrophically incorrect predictions, nodes update asynchronously
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Message passing graph neural networks (GNNs) would appear to be powerful tools to learn distributed algorithms via gradient descent, but generate catastrophically incorrect predictions when nodes update asynchronously during inference. This failure under asynchrony effectively excludes these architectures from many potential applications, such as learning local communication policies between resource-constrained agents in, e.g., robotic swarms or sensor networks. In this work we explore why this failure occurs in common GNN architectures, and identify “implicitly-defined” GNNs as a class of architectures which is provably robust to partially asynchronous “hogwild” inference, adapting convergence guarantees from work in asynchronous and distributed optimization, e.g., Bertsekas (1982); Niu et al. (2011). We then propose a novel implicitly-defined GNN architecture, which we call an energy GNN. We show that this architecture outperforms other GNNs from this class on a variety of synthetic tasks inspired by multi-agent systems, and achieves competitive performance on real-world datasets.

[LG-166] Fast Gibbs sampling for the local and global trend Bayesian exponential smoothing model

链接: https://arxiv.org/abs/2407.00492
作者: Xueying Long,Daniel F. Schmidt,Christoph Bergmeir,Slawek Smyl
关键词: Bayesian exponential smoothing, trend Bayesian exponential, global trend Bayesian, Smyl, exponential smoothing
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In Smyl et al. [Local and global trend Bayesian exponential smoothing models. International Journal of Forecasting, 2024.], a generalised exponential smoothing model was proposed that is able to capture strong trends and volatility in time series. This method achieved state-of-the-art performance in many forecasting tasks, but its fitting procedure, which is based on the NUTS sampler, is very computationally expensive. In this work, we propose several modifications to the original model, as well as a bespoke Gibbs sampler for posterior exploration; these changes improve sampling time by an order of magnitude, thus rendering the model much more practically relevant. The new model, and sampler, are evaluated on the M3 dataset and are shown to be competitive, or superior, in terms of accuracy to the original method, while being substantially faster to run.

[LG-167] oward Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixture Models

链接: https://arxiv.org/abs/2407.00490
作者: Weihang Xu,Maryam Fazel,Simon S. Du
关键词: Gaussian Mixture Models, truth Gaussian distribution, ground truth Gaussian, single ground truth, Mixture Models
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 25 pages

点击查看摘要

Abstract:We study the gradient Expectation-Maximization (EM) algorithm for Gaussian Mixture Models (GMM) in the over-parameterized setting, where a general GMM with n1 components learns from data that are generated by a single ground truth Gaussian distribution. While results for the special case of 2-Gaussian mixtures are well-known, a general global convergence analysis for arbitrary n remains unresolved and faces several new technical barriers since the convergence becomes sub-linear and non-monotonic. To address these challenges, we construct a novel likelihood-based convergence analysis framework and rigorously prove that gradient EM converges globally with a sublinear rate O(1/\sqrtt) . This is the first global convergence result for Gaussian mixtures with more than 2 components. The sublinear convergence rate is due to the algorithmic nature of learning over-parameterized GMM with gradient EM. We also identify a new emerging technical challenge for learning general over-parameterized GMM: the existence of bad local regions that can trap gradient EM for an exponential number of steps.

[LG-168] Quantifying Spuriousness of Biased Datasets Using Partial Information Decomposition

链接: https://arxiv.org/abs/2407.00482
作者: Barproda Halder,Faisal Hamman,Pasan Dissanayake,Qiuyi Zhang,Ilia Sucholutsky,Sanghamitra Dutta
关键词: Spurious patterns refer, Partial Information Decomposition, unique information, causally related, patterns refer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Information Theory (cs.IT)
*备注: Accepted at ICML 2024 Workshop on Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

点击查看摘要

Abstract:Spurious patterns refer to a mathematical association between two or more variables in a dataset that are not causally related. However, this notion of spuriousness, which is usually introduced due to sampling biases in the dataset, has classically lacked a formal definition. To address this gap, this work presents the first information-theoretic formalization of spuriousness in a dataset (given a split of spurious and core features) using a mathematical framework called Partial Information Decomposition (PID). Specifically, we disentangle the joint information content that the spurious and core features share about another target variable (e.g., the prediction label) into distinct components, namely unique, redundant, and synergistic information. We propose the use of unique information, with roots in Blackwell Sufficiency, as a novel metric to formally quantify dataset spuriousness and derive its desirable properties. We empirically demonstrate how higher unique information in the spurious features in a dataset could lead a model into choosing the spurious features over the core features for inference, often having low worst-group-accuracy. We also propose a novel autoencoder-based estimator for computing unique information that is able to handle high-dimensional image data. Finally, we also show how this unique information in the spurious feature is reduced across several dataset-based spurious-pattern-mitigation techniques such as data reweighting and varying levels of background mixing, demonstrating a novel tradeoff between unique information (spuriousness) and worst-group-accuracy.

[LG-169] Knowledge-Aware Parsimony Learning: A Perspective from Relational Graphs

链接: https://arxiv.org/abs/2407.00478
作者: Quanming Yao,Yongqi Zhang,Yaqing Wang,Nan Yin,James Kwok,Qiang Yang
关键词: strategy that involves, involves the brute-force, training dataset, dataset and learnable, prevalent approach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The scaling law, a strategy that involves the brute-force scaling of the training dataset and learnable parameters, has become a prevalent approach for developing stronger learning models. In this paper, we examine its rationale in terms of learning from relational graphs. We demonstrate that directly adhering to such a scaling law does not necessarily yield stronger models due to architectural incompatibility and representation bottlenecks. To tackle this challenge, we propose a novel framework for learning from relational graphs via knowledge-aware parsimony learning. Our method draws inspiration from the duality between data and knowledge inherent in these graphs. Specifically, we first extract knowledge (like symbolic logic and physical laws) during the learning process, and then apply combinatorial generalization to the task at hand. This extracted knowledge serves as the ``building blocks’’ for achieving parsimony learning. By applying this philosophy to architecture, parameters, and inference, we can effectively achieve versatile, sample-efficient, and interpretable learning. Experimental results show that our proposed framework surpasses methods that strictly follow the traditional scaling-up roadmap. This highlights the importance of incorporating knowledge in the development of next-generation learning technologies.

[LG-170] MH-pFLGB: Model Heterogeneous personalized Federated Learning via Global Bypass for Medical Image Analysis

链接: https://arxiv.org/abs/2407.00474
作者: Luyuan Xie,Manqing Lin,ChenMing Xu,Tianyu Luan,Zhipeng Zeng,Wenjun Qian,Cong Li,Yuejian Fang,Qingni Shen,Zhonghai Wu
关键词: training data privacy, protect training data, medical artificial intelligence, federated learning, artificial intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2405.06822

点击查看摘要

Abstract:In the evolving application of medical artificial intelligence, federated learning is notable for its ability to protect training data privacy. Federated learning facilitates collaborative model development without the need to share local data from healthcare institutions. Yet, the statistical and system heterogeneity among these institutions poses substantial challenges, which affects the effectiveness of federated learning and hampers the exchange of information between clients. To address these issues, we introduce a novel approach, MH-pFLGB, which employs a global bypass strategy to mitigate the reliance on public datasets and navigate the complexities of non-IID data distributions. Our method enhances traditional federated learning by integrating a global bypass model, which would share the information among the clients, but also serves as part of the network to enhance the performance on each client. Additionally, MH-pFLGB provides a feature fusion module to better combine the local and global features. We validate \model’s effectiveness and adaptability through extensive testing on different medical tasks, demonstrating superior performance compared to existing state-of-the-art methods.

[LG-171] VcLLM: Video Codecs are Secretly Tensor Codecs

链接: https://arxiv.org/abs/2407.00467
作者: Ceyu Xu,Yongji Wu,Xinyu Yang,Beidi Chen,Matthew Lentz,Danyang Zhuo,Lisa Wu Wills
关键词: continues to expand, high communication bandwidth, large memory footprint, footprint and high, high communication
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:As the parameter size of large language models (LLMs) continues to expand, the need for a large memory footprint and high communication bandwidth have become significant bottlenecks for the training and inference of LLMs. To mitigate these bottlenecks, various tensor compression techniques have been proposed to reduce the data size, thereby alleviating memory requirements and communication pressure. Our research found that video codecs, despite being originally designed for compressing videos, show excellent efficiency when compressing various types of tensors. We demonstrate that video codecs can be versatile and general-purpose tensor codecs while achieving the state-of-the-art compression efficiency in various tasks. We further make use of the hardware video encoding and decoding module available on GPUs to create a framework capable of both inference and training with video codecs repurposed as tensor codecs. This greatly reduces the requirement for memory capacity and communication bandwidth, enabling training and inference of large models on consumer-grade GPUs. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Image and Video Processing (eess.IV) Cite as: arXiv:2407.00467 [cs.LG] (or arXiv:2407.00467v1 [cs.LG] for this version)

[LG-172] Characterizing Continual Learning Scenarios and Strategies for Audio Analysis

链接: https://arxiv.org/abs/2407.00465
作者: Ruchi Bhatt,Pratibha Kumari,Dwarikanath Mahapatra,Abdulmotaleb El Saddik,Mukesh Saini
关键词: Audio analysis, Audio, analysis, characterize continual learning, audio analysis approaches
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio analysis is useful in many application scenarios. The state-of-the-art audio analysis approaches assume that the data distribution at training and deployment time will be the same. However, due to various real-life environmental factors, the data may encounter drift in its distribution or can encounter new classes in the late future. Thus, a one-time trained model might not perform adequately. In this paper, we characterize continual learning (CL) approaches in audio analysis. In this paper, we characterize continual learning (CL) approaches, intended to tackle catastrophic forgetting arising due to drifts. As there is no CL dataset for audio analysis, we use DCASE 2020 to 2023 datasets to create various CL scenarios for audio-based monitoring tasks. We have investigated the following CL and non-CL approaches: EWC, LwF, SI, GEM, A-GEM, GDumb, Replay, Naive, cumulative, and joint training. The study is very beneficial for researchers and practitioners working in the area of audio analysis for developing adaptive models. We observed that Replay achieved better results than other methods in the DCASE challenge data. It achieved an accuracy of 70.12% for the domain incremental scenario and an accuracy of 96.98% for the class incremental scenario.

[LG-173] Open-Source Conversational AI with SpeechBrain 1.0

链接: https://arxiv.org/abs/2407.00463
作者: Mirco Ravanelli,Titouan Parcollet,Adel Moumen,Sylvain de Langen,Cem Subakan,Peter Plantinga,Yingzhi Wang,Pooneh Mousavi,Luca Della Libera,Artem Ploujnikov,Francesco Paissan,Davide Borra,Salah Zaiem,Zeyu Zhao,Shucong Zhang,Georgios Karakasidis,Sung-Lin Yeh,Aku Rouhe,Rudolf Braun,Florian Mai,Juan Zuluaga-Gomez,Seyed Mahed Mousavi,Andreas Nautsch,Xuechen Liu,Sangeet Sagar,Jarod Duret,Salima Mdhaffar,Gaelle Laperriere,Renato De Mori,Yannick Esteve
关键词: http URL promotes, URL promotes transparency, open-source Conversational, http URL, URL promotes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
*备注: Submitted to JMLR (Machine Learning Open Source Software)

点击查看摘要

Abstract:SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much this http URL promotes transparency and replicability by releasing both the pre-trained models and the complete “recipes” of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

[LG-174] PerSEval: Assessing Personalization in Text Summarizers

链接: https://arxiv.org/abs/2407.00453
作者: Sourish Dasgupta,Ankush Chander,Parth Borad,Isha Motiyani,Tanmoy Chakraborty
关键词: individuals’ subjective understanding, understanding of saliency, topics of attention, cater to individuals’, individuals’ subjective
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized summarization models cater to individuals’ subjective understanding of saliency, as represented by their reading history and current topics of attention. Existing personalized text summarizers are primarily evaluated based on accuracy measures such as BLEU, ROUGE, and METEOR. However, a recent study argued that accuracy measures are inadequate for evaluating the degree of personalization of these models and proposed EGISES, the first metric to evaluate personalized text summaries. It was suggested that accuracy is a separate aspect and should be evaluated standalone. In this paper, we challenge the necessity of an accuracy leaderboard, suggesting that relying on accuracy-based aggregated results might lead to misleading conclusions. To support this, we delve deeper into EGISES, demonstrating both theoretically and empirically that it measures the degree of responsiveness, a necessary but not sufficient condition for degree-of-personalization. We subsequently propose PerSEval, a novel measure that satisfies the required sufficiency condition. Based on the benchmarking of ten SOTA summarization models on the PENS dataset, we empirically establish that – (i) PerSEval is reliable w.r.t human-judgment correlation (Pearson’s r = 0.73; Spearman’s \rho = 0.62; Kendall’s \tau = 0.42), (ii) PerSEval has high rank-stability, (iii) PerSEval as a rank-measure is not entailed by EGISES-based ranking, and (iv) PerSEval can be a standalone rank-measure without the need of any aggregated ranking.

[LG-175] KHNNs: hypercomplex neural networks computations via Keras using TensorFlow and PyTorch

链接: https://arxiv.org/abs/2407.00452
作者: Agnieszka Niemczynowicz,Radosław Antoni Kycia
关键词: real numbers perform, Neural networks, advanced algebras, algebras than real, real numbers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Neural networks used in computations with more advanced algebras than real numbers perform better in some applications. However, there is no general framework for constructing hypercomplex neural networks. We propose a library integrated with Keras that can do computations within TensorFlow and PyTorch. It provides Dense and Convolutional 1D, 2D, and 3D layers architectures.

[LG-176] Fully tensorial approach to hypercomplex neural networks

链接: https://arxiv.org/abs/2407.00449
作者: Agnieszka Niemczynowicz,Radosław Antoni Kycia
关键词: Fully tensorial theory, Fully tensorial, hypercomplex neural networks, theory of hypercomplex, Fully
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 10 pages

点击查看摘要

Abstract:Fully tensorial theory of hypercomplex neural networks is given. The key point is to observe that the algebra multiplication can be represented as a rank three tensor. This approach is attractive for neural network libraries that support effective tensorial operations.

[LG-177] me Series Clustering with General State Space Models via Stochastic Variational Inference

链接: https://arxiv.org/abs/2407.00429
作者: Ryoichi Ishizuka,Takashi Imai,Kaoru Kawamoto
关键词: model-based time series, time series, time series models, mixtures of general, time series clustering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 4 figures

点击查看摘要

Abstract:In this paper, we propose a novel method of model-based time series clustering with mixtures of general state space models (MSSMs). Each component of MSSMs is associated with each cluster. An advantage of the proposed method is that it enables the use of time series models appropriate to the specific time series. This not only improves clustering and prediction accuracy but also enhances the interpretability of the estimated parameters. The parameters of the MSSMs are estimated using stochastic variational inference, a subtype of variational inference. The proposed method estimates the latent variables of an arbitrary state space model by using neural networks with a normalizing flow as a variational estimator. The number of clusters can be estimated using the Bayesian information criterion. In addition, to prevent MSSMs from converging to the local optimum, we propose several optimization tricks, including an additional penalty term called entropy annealing. Experiments on simulated datasets show that the proposed method is effective for clustering, parameter estimation, and estimating the number of clusters.

[LG-178] On the Complexity of Learning to Cooperate with Populations of Socially Rational Agents

链接: https://arxiv.org/abs/2407.00419
作者: Robert Loftin,Saptarashmi Bandyopadhyay,Mustafa Mert Çelikok
关键词: Artificially intelligent agents, Artificially intelligent, intelligent agents deployed, ability to reliably, real-world will require
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Artificially intelligent agents deployed in the real-world will require the ability to reliably \textitcooperate with humans (as well as other, heterogeneous AI agents). To provide formal guarantees of successful cooperation, we must make some assumptions about how partner agents could plausibly behave. Any realistic set of assumptions must account for the fact that other agents may be just as adaptable as our agent is. In this work, we consider the problem of cooperating with a \textitpopulation of agents in a finitely-repeated, two player general-sum matrix game with private utilities. Two natural assumptions in such settings are that: 1) all agents in the population are individually rational learners, and 2) when any two members of the population are paired together, with high-probability they will achieve at least the same utility as they would under some Pareto efficient equilibrium strategy. Our results first show that these assumptions alone are insufficient to ensure \textitzero-shot cooperation with members of the target population. We therefore consider the problem of \textitlearning a strategy for cooperating with such a population using prior observations its members interacting with one another. We provide upper and lower bounds on the number of samples needed to learn an effective cooperation strategy. Most importantly, we show that these bounds can be much stronger than those arising from a "naive’’ reduction of the problem to one of imitation learning.

[LG-179] Fontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey

链接: https://arxiv.org/abs/2407.00418
作者: Krzysztof Nowak,Jędrzej Ziębura,Krzysztof Wróbel,Aleksander Smywiński-Pohl
关键词: Medieval Latin texts, Polish Medieval Latin, automatic linguistic annotation, Medieval Latin, Latin texts
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study introduces the eFontes models for automatic linguistic annotation of Medieval Latin texts, focusing on lemmatization, part-of-speech tagging, and morphological feature determination. Using the Transformers library, these models were trained on Universal Dependencies (UD) corpora and the newly developed eFontes corpus of Polish Medieval Latin. The research evaluates the models’ performance, addressing challenges such as orthographic variations and the integration of Latinized vernacular terms. The models achieved high accuracy rates: lemmatization at 92.60%, part-of-speech tagging at 83.29%, and morphological feature determination at 88.57%. The findings underscore the importance of high-quality annotated corpora and propose future enhancements, including extending the models to Named Entity Recognition.

[LG-180] Explainability of Machine Learning Models under Missing Data

链接: https://arxiv.org/abs/2407.00411
作者: Tuan L. Vo,Thu Nguyen,Hugo L. Hammer,Michael A. Riegler,Pal Halvorsen
关键词: Explainable Artificial Intelligence, Missing data, significantly impair model, impair model performance, Shapley
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Missing data is a prevalent issue that can significantly impair model performance and interpretability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on the calculation of Shapley values, a popular technique for interpreting complex machine learning models. We compare different imputation strategies and assess their impact on feature importance and interaction as determined by Shapley values. Moreover, we also theoretically analyze the effects of missing values on Shapley values. Importantly, our findings reveal that the choice of imputation method can introduce biases that could lead to changes in the Shapley values, thereby affecting the interpretability of the model. Moreover, and that a lower test prediction mean square error (MSE) may not imply a lower MSE in Shapley values and vice versa. Also, while Xgboost is a method that could handle missing data directly, using Xgboost directly on missing data can seriously affect interpretability compared to imputing the data before training Xgboost. This study provides a comprehensive evaluation of imputation methods in the context of model interpretation, offering practical guidance for selecting appropriate techniques based on dataset characteristics and analysis objectives. The results underscore the importance of considering imputation effects to ensure robust and reliable insights from machine learning models.

[LG-181] PUZZLES: A Benchmark for Neural Algorithmic Reasoning

链接: https://arxiv.org/abs/2407.00401
作者: Benjamin Estermann,Luca A. Lanzendörfer,Yannick Niedermayr,Roger Wattenhofer
关键词: fundamental cognitive ability, decision-making processes, fundamental cognitive, cognitive ability, ability that plays
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Algorithmic reasoning is a fundamental cognitive ability that plays a pivotal role in problem-solving and decision-making processes. Reinforcement Learning (RL) has demonstrated remarkable proficiency in tasks such as motor control, handling perceptual input, and managing stochastic environments. These advancements have been enabled in part by the availability of benchmarks. In this work we introduce PUZZLES, a benchmark based on Simon Tatham’s Portable Puzzle Collection, aimed at fostering progress in algorithmic and logical reasoning in RL. PUZZLES contains 40 diverse logic puzzles of adjustable sizes and varying levels of complexity; many puzzles also feature a diverse set of additional configuration parameters. The 40 puzzles provide detailed information on the strengths and generalization capabilities of RL agents. Furthermore, we evaluate various RL algorithms on PUZZLES, providing baseline comparisons and demonstrating the potential for future research. All the software, including the environment, is available at this https URL.

[LG-182] Markovian Gaussian Process: A Universal State-Space Representation for Stationary Temporal Gaussian Process

链接: https://arxiv.org/abs/2407.00397
作者: Weihan Li,Yule Wang,Chengrui Li,Anqi Wu
关键词: Linear Dynamical Systems, system modeling tools, Dynamical Systems, dynamic system modeling, essential time series
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian Processes (GPs) and Linear Dynamical Systems (LDSs) are essential time series and dynamic system modeling tools. GPs can handle complex, nonlinear dynamics but are computationally demanding, while LDSs offer efficient computation but lack the expressive power of GPs. To combine their benefits, we introduce a universal method that allows an LDS to mirror stationary temporal GPs. This state-space representation, known as the Markovian Gaussian Process (Markovian GP), leverages the flexibility of kernel functions while maintaining efficient linear computation. Unlike existing GP-LDS conversion methods, which require separability for most multi-output kernels, our approach works universally for single- and multi-output stationary temporal kernels. We evaluate our method by computing covariance, performing regression tasks, and applying it to a neuroscience application, demonstrating that our method provides an accurate state-space representation for stationary temporal GPs.

[LG-183] FANFOLD: Graph Normalizing Flows-driven Asymmetric Network for Unsupervised Graph-Level Anomaly Detection

链接: https://arxiv.org/abs/2407.00383
作者: Rui Cao,Shijie Xue,Jindong Li,Qi Wang,Yi Chang
关键词: attracted increasing interest, Unsupervised graph-level anomaly, graph-level anomaly detection, increasing interest due, anomaly detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised graph-level anomaly detection (UGAD) has attracted increasing interest due to its widespread application. In recent studies, knowledge distillation-based methods have been widely used in unsupervised anomaly detection to improve model efficiency and generalization. However, the inherent symmetry between the source (teacher) and target (student) networks typically results in consistent outputs across both architectures, making it difficult to distinguish abnormal graphs from normal graphs. Also, existing methods mainly rely on graph features to distinguish anomalies, which may be unstable with complex and diverse data and fail to capture the essence that differentiates normal graphs from abnormal ones. In this work, we propose a Graph Normalizing Flows-driven Asymmetric Network For Unsupervised Graph-Level Anomaly Detection (FANFOLD in short). We introduce normalizing flows to unsupervised graph-level anomaly detection due to their successful application and superior quality in learning the underlying distribution of samples. Specifically, we adopt the knowledge distillation technique and apply normalizing flows on the source network, achieving the asymmetric network. In the training stage, FANFOLD transforms the original distribution of normal graphs to a standard normal distribution. During inference, FANFOLD computes the anomaly score using the source-target loss to discriminate between normal and anomalous graphs. We conduct extensive experiments on 15 datasets of different fields with 9 baseline methods to validate the superiority of FANFOLD.

[LG-184] UM2N: Towards Universal Mesh Movement Networks

链接: https://arxiv.org/abs/2407.00382
作者: Mingrui Zhang,Chunyang Wang,Stephan Kramer,Joseph G. Wallwork,Siyi Li,Jiancheng Liu,Xiang Chen,Matthew D. Piggott
关键词: Partial Differential Equations, Solving complex Partial, complex Partial Differential, Differential Equations, Partial Differential
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving complex Partial Differential Equations (PDEs) accurately and efficiently is an essential and challenging problem in all scientific and engineering disciplines. Mesh movement methods provide the capability to improve the accuracy of the numerical solution without increasing the overall mesh degree of freedom count. Conventional sophisticated mesh movement methods are extremely expensive and struggle to handle scenarios with complex boundary geometries. However, existing learning-based methods require re-training from scratch given a different PDE type or boundary geometry, which limits their applicability, and also often suffer from robustness issues in the form of inverted elements. In this paper, we introduce the Universal Mesh Movement Network (UM2N), which – once trained – can be applied in a non-intrusive, zero-shot manner to move meshes with different size distributions and structures, for solvers applicable to different PDE types and boundary geometries. UM2N consists of a Graph Transformer (GT) encoder for extracting features and a Graph Attention Network (GAT) based decoder for moving the mesh. We evaluate our method on advection and Navier-Stokes based examples, as well as a real-world tsunami simulation case. Our method outperforms existing learning-based mesh movement methods in terms of the benchmarks described above. In comparison to the conventional sophisticated Monge-Ampère PDE-solver based method, our approach not only significantly accelerates mesh movement, but also proves effective in scenarios where the conventional method fails. Our project page is at \urlthis https URL.

[LG-185] Axiomatization of Gradient Smoothing in Neural Networks

链接: https://arxiv.org/abs/2407.00371
作者: Linjiang Zhou,Xiaochuan Shi,Chao Ma,Zepeng Wang
关键词: neural networks explanation, neural networks, play a pivotal, pivotal role, networks explanation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Gradients play a pivotal role in neural networks explanation. The inherent high dimensionality and structural complexity of neural networks result in the original gradients containing a significant amount of noise. While several approaches were proposed to reduce noise with smoothing, there is little discussion of the rationale behind smoothing gradients in neural networks. In this work, we proposed a gradient smooth theoretical framework for neural networks based on the function mollification and Monte Carlo integration. The framework intrinsically axiomatized gradient smoothing and reveals the rationale of existing methods. Furthermore, we provided an approach to design new smooth methods derived from the framework. By experimental measurement of several newly designed smooth methods, we demonstrated the research potential of our framework.

[LG-186] Enhancing Accuracy and Parameter-Efficiency of Neural Representations for Network Parameterization

链接: https://arxiv.org/abs/2407.00356
作者: Hongjun Choi,Jayaraman J. Thiagarajan,Ruben Glatt,Shusen Liu
关键词: neural network weights, investigate the fundamental, fundamental trade-off, parameter efficiency, parameterization of neural
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we investigate the fundamental trade-off regarding accuracy and parameter efficiency in the parameterization of neural network weights using predictor networks. We present a surprising finding that, when recovering the original model accuracy is the sole objective, it can be achieved effectively through the weight reconstruction objective alone. Additionally, we explore the underlying factors for improving weight reconstruction under parameter-efficiency constraints, and propose a novel training scheme that decouples the reconstruction objective from auxiliary objectives such as knowledge distillation that leads to significant improvements compared to state-of-the-art approaches. Finally, these results pave way for more practical scenarios, where one needs to achieve improvements on both model accuracy and predictor network parameter-efficiency simultaneously.

[LG-187] WgLaSDI: Weak-Form Greedy Latent Space Dynamics Identification

链接: https://arxiv.org/abs/2407.00337
作者: Xiaolong He,April Tran,David M. Bortz,Youngsoo Choi
关键词: nonlinear physical systems, demonstrated promising potential, physical systems, promising potential, high-dimensional nonlinear physical
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The parametric greedy latent space dynamics identification (gLaSDI) framework has demonstrated promising potential for accurate and efficient modeling of high-dimensional nonlinear physical systems. However, it remains challenging to handle noisy data. To enhance robustness against noise, we incorporate the weak-form estimation of nonlinear dynamics (WENDy) into gLaSDI. In the proposed weak-form gLaSDI (WgLaSDI) framework, an autoencoder and WENDy are trained simultaneously to discover intrinsic nonlinear latent-space dynamics of high-dimensional data. Compared to the standard sparse identification of nonlinear dynamics (SINDy) employed in gLaSDI, WENDy enables variance reduction and robust latent space discovery, therefore leading to more accurate and efficient reduced-order modeling. Furthermore, the greedy physics-informed active learning in WgLaSDI enables adaptive sampling of optimal training data on the fly for enhanced modeling accuracy. The effectiveness of the proposed framework is demonstrated by modeling various nonlinear dynamical problems, including viscous and inviscid Burgers’ equations, time-dependent radial advection, and the Vlasov equation for plasma physics. With data that contains 5-10% Gaussian white noise, WgLaSDI outperforms gLaSDI by orders of magnitude, achieving 1-7% relative errors. Compared with the high-fidelity models, WgLaSDI achieves 121 to 1,779x speed-up.

[LG-188] Dual-view Aware Smart Contract Vulnerability Detection for Ethereum

链接: https://arxiv.org/abs/2407.00336
作者: Jiacheng Yao,Maolin Wang,Wanqi Chen,Chengxiang Jin,Jiajun Zhou,Shanqing Yu,Qi Xuan
关键词: brought technological innovation, smart contracts, Ethereum core applications, Contract Vulnerability Detection, traditional industries
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by International Conference on Blockchain and Trustworthy Systems 2024

点击查看摘要

Abstract:The wide application of Ethereum technology has brought technological innovation to traditional industries. As one of Ethereum’s core applications, smart contracts utilize diverse contract codes to meet various functional needs and have gained widespread use. However, the non-tamperability of smart contracts, coupled with vulnerabilities caused by natural flaws or human errors, has brought unprecedented challenges to blockchain security. Therefore, in order to ensure the healthy development of blockchain technology and the stability of the blockchain community, it is particularly important to study the vulnerability detection techniques for smart contracts. In this paper, we propose a Dual-view Aware Smart Contract Vulnerability Detection Framework named DVDet. The framework initially converts the source code and bytecode of smart contracts into weighted graphs and control flow sequences, capturing potential risk features from these two perspectives and integrating them for analysis, ultimately achieving effective contract vulnerability detection. Comprehensive experiments on the Ethereum dataset show that our method outperforms others in detecting vulnerabilities.

[LG-189] Revisiting Constant Negative Rewards for Goal-Reaching Tasks in Robot Learning

链接: https://arxiv.org/abs/2407.00324
作者: Gautham Vasan,Yan Wang,Fahim Shahriar,James Bergstra,Martin Jagersand,A. Rupam Mahmood
关键词: real-world robot learning, goal state, real-world robot, robot learning problems, goal
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: In Proceedings of Reinforcement Learning Conference 2024. For video demo, see this https URL

点击查看摘要

Abstract:Many real-world robot learning problems, such as pick-and-place or arriving at a destination, can be seen as a problem of reaching a goal state as soon as possible. These problems, when formulated as episodic reinforcement learning tasks, can easily be specified to align well with our intended goal: -1 reward every time step with termination upon reaching the goal state, called minimum-time tasks. Despite this simplicity, such formulations are often overlooked in favor of dense rewards due to their perceived difficulty and lack of informativeness. Our studies contrast the two reward paradigms, revealing that the minimum-time task specification not only facilitates learning higher-quality policies but can also surpass dense-reward-based policies on their own performance metrics. Crucially, we also identify the goal-hit rate of the initial policy as a robust early indicator for learning success in such sparse feedback settings. Finally, using four distinct real-robotic platforms, we show that it is possible to learn pixel-based policies from scratch within two to three hours using constant negative rewards.

[LG-190] LiteSearch: Efficacious Tree Search for LLM

链接: https://arxiv.org/abs/2407.00320
作者: Ante Wang,Linfeng Song,Ye Tian,Baolin Peng,Dian Yu,Haitao Mi,Jinsong Su,Dong Yu
关键词: Monte Carlo Tree, Recent research suggests, dramatically boost LLM, mathematical reasoning tasks, Monte Carlo
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research suggests that tree search algorithms (e.g. Monte Carlo Tree Search) can dramatically boost LLM performance on complex mathematical reasoning tasks. However, they often require more than 10 times the computational resources of greedy decoding due to wasteful search strategies, making them difficult to be deployed in practical applications. This study introduces a novel guided tree search algorithm with dynamic node selection and node-level exploration budget (maximum number of children) calculation to tackle this issue. By considering the search progress towards the final answer (history) and the guidance from a value network (future) trained without any step-wise annotations, our algorithm iteratively selects the most promising tree node before expanding it within the boundaries of the allocated computational budget. Experiments conducted on the GSM8K and TabMWP datasets demonstrate that our approach not only offers competitive performance but also enjoys significantly lower computational costs compared to baseline methods.

[LG-191] Deep Neural Networks with Symplectic Preservation Properties

链接: https://arxiv.org/abs/2407.00294
作者: Qing He,Wei Cai
关键词: network architecture designed, deep neural network, neural network architecture, propose a deep, architecture designed
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We propose a deep neural network architecture designed such that its output forms an invertible symplectomorphism of the input. This design draws an analogy to the real-valued non-volume-preserving (real NVP) method used in normalizing flow techniques. Utilizing this neural network type allows for learning tasks on unknown Hamiltonian systems without breaking the inherent symplectic structure of the phase space.

[LG-192] Digital Twin-Assisted Data-Driven Optimization for Reliable Edge Caching in Wireless Networks

链接: https://arxiv.org/abs/2407.00286
作者: Zifan Zhang,Yuchen Liu,Zhiyuan Peng,Mingzhe Chen,Dongkuan Xu,Shuguang Cui
关键词: Optimizing edge caching, advancement of next-generation, ensuring high-speed, Optimizing edge, high-speed and low-latency
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Accepted by IEEE Journal on Selected Areas in Communications (JSAC)

点击查看摘要

Abstract:Optimizing edge caching is crucial for the advancement of next-generation (nextG) wireless networks, ensuring high-speed and low-latency services for mobile users. Existing data-driven optimization approaches often lack awareness of the distribution of random data variables and focus solely on optimizing cache hit rates, neglecting potential reliability concerns, such as base station overload and unbalanced cache issues. This oversight can result in system crashes and degraded user experience. To bridge this gap, we introduce a novel digital twin-assisted optimization framework, called D-REC, which integrates reinforcement learning (RL) with diverse intervention modules to ensure reliable caching in nextG wireless networks. We first develop a joint vertical and horizontal twinning approach to efficiently create network digital twins, which are then employed by D-REC as RL optimizers and safeguards, providing ample datasets for training and predictive evaluation of our cache replacement policy. By incorporating reliability modules into a constrained Markov decision process, D-REC can adaptively adjust actions, rewards, and states to comply with advantageous constraints, minimizing the risk of network failures. Theoretical analysis demonstrates comparable convergence rates between D-REC and vanilla data-driven methods without compromising caching performance. Extensive experiments validate that D-REC outperforms conventional approaches in cache hit rate and load balancing while effectively enforcing predetermined reliability intervention modules.

[LG-193] PerAct2: A Perceiver Actor Framework for Bimanual Manipulation Tasks

链接: https://arxiv.org/abs/2407.00278
作者: Markus Grotz,Mohit Shridhar,Tamim Asfour,Dieter Fox
关键词: temporal coordination required, challenging due, due to precise, precise spatial, spatial and temporal
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by extending RLBench to bimanual manipulation. We open-source our code and benchmark comprising 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To kickstart the benchmark, we extended several state-of-the art methods to bimanual manipulation and also present a language-conditioned behavioral cloning agent – PerAct2, which enables the learning and execution of bimanual 6-DoF manipulation tasks. Our novel network architecture efficiently integrates language processing with action prediction, allowing robots to understand and perform complex bimanual tasks in response to user-specified goals. Project website with code is available at: this http URL

[LG-194] Learning a Clinically-Relevant Concept Bottleneck for Lesion Detection in Breast Ultrasound

链接: https://arxiv.org/abs/2407.00267
作者: Arianna Bunnell,Yannik Glaser,Dustin Valdez,Thomas Wolfgruber,Aleen Altamirano,Carol Zamora González,Brenda Y. Hernandez,Peter Sadowski,John A. Shepherd
关键词: Detecting and classifying, Radiology Breast Imaging, breast ultrasound images, artificial intelligence, access to mammography
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Submitted version of manuscript accepted at MICCAI 2024. This preprint has not undergone peer review or any post-submission improvements or corrections

点击查看摘要

Abstract:Detecting and classifying lesions in breast ultrasound images is a promising application of artificial intelligence (AI) for reducing the burden of cancer in regions with limited access to mammography. Such AI systems are more likely to be useful in a clinical setting if their predictions can be explained to a radiologist. This work proposes an explainable AI model that provides interpretable predictions using a standard lexicon from the American College of Radiology’s Breast Imaging and Reporting Data System (BI-RADS). The model is a deep neural network featuring a concept bottleneck layer in which known BI-RADS features are predicted before making a final cancer classification. This enables radiologists to easily review the predictions of the AI system and potentially fix errors in real time by modifying the concept predictions. In experiments, a model is developed on 8,854 images from 994 women with expert annotations and histological cancer labels. The model outperforms state-of-the-art lesion detection frameworks with 48.9 average precision on the held-out testing set, and for cancer classification, concept intervention is shown to increase performance from 0.876 to 0.885 area under the receiver operating characteristic curve. Training and evaluation code is available at this https URL.

[LG-195] External Model Motivated Agents: Reinforcement Learning for Enhanced Environment Sampling

链接: https://arxiv.org/abs/2407.00264
作者: Rishav Bhagat,Jonathan Balloch,Zhiyu Lin,Julia Kim,Mark Riedl
关键词: Unlike reinforcement learning, humans remain capable, remain capable multitaskers, Unlike reinforcement, humans remain
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unlike reinforcement learning (RL) agents, humans remain capable multitaskers in changing environments. In spite of only experiencing the world through their own observations and interactions, people know how to balance focusing on tasks with learning about how changes may affect their understanding of the world. This is possible by choosing to solve tasks in ways that are interesting and generally informative beyond just the current task. Motivated by this, we propose an agent influence framework for RL agents to improve the adaptation efficiency of external models in changing environments without any changes to the agent’s rewards. Our formulation is composed of two self-contained modules: interest fields and behavior shaping via interest fields. We implement an uncertainty-based interest field algorithm as well as a skill-sampling-based behavior-shaping algorithm to use in testing this framework. Our results show that our method outperforms the baselines in terms of external model adaptation on metrics that measure both efficiency and performance.

[LG-196] One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts

链接: https://arxiv.org/abs/2407.00256
作者: Ruochen Wang,Sohyun An,Minhao Cheng,Tianyi Zhou,Sung Ju Hwang,Cho-Jui Hsieh
关键词: Large Language Models, Large Language, Language Models, exhibit strong generalization, strong generalization capabilities
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2024. code available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong generalization capabilities to novel tasks when prompted with language instructions and in-context demos. Since this ability sensitively depends on the quality of prompts, various methods have been explored to automate the instruction design. While these methods demonstrated promising results, they also restricted the searched prompt to one instruction. Such simplification significantly limits their capacity, as a single demo-free instruction might not be able to cover the entire complex problem space of the targeted task. To alleviate this issue, we adopt the Mixture-of-Expert paradigm and divide the problem space into a set of sub-regions; Each sub-region is governed by a specialized expert, equipped with both an instruction and a set of demos. A two-phase process is developed to construct the specialized expert for each region: (1) demo assignment: Inspired by the theoretical connection between in-context learning and kernel regression, we group demos into experts based on their semantic similarity; (2) instruction assignment: A region-based joint search of an instruction per expert complements the demos assigned to it, yielding a synergistic effect. The resulting method, codenamed Mixture-of-Prompts (MoP), achieves an average win rate of 81% against prior arts across several major benchmarks.

[LG-197] Learning Closed Signal Flow Graphs

链接: https://arxiv.org/abs/2407.00245
作者: Ekaterina Piotrovskaya,Leo Lobski,Fabio Zanasi
关键词: closed signal flow, signal flow graphs, signal flow, flow graphs, graphical model
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures. An extended abstract for Learning and Automata workshop (LearnAut 2024)

点击查看摘要

Abstract:We develop a learning algorithm for closed signal flow graphs - a graphical model of signal transducers. The algorithm relies on the correspondence between closed signal flow graphs and weighted finite automata on a singleton alphabet. We demonstrate that this procedure results in a genuine reduction of complexity: our algorithm fares better than existing learning algorithms for weighted automata restricted to the case of a singleton alphabet.

[LG-198] Closed-Form Test Functions for Biophysical Sequence Optimization Algorithms

链接: https://arxiv.org/abs/2407.00236
作者: Samuel Stanton,Robert Alberstein,Nathan Frey,Andrew Watkins,Kyunghyun Cho
关键词: natural language processing, applications involving biophysical, involving biophysical data, machine learning, computer vision
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:There is a growing body of work seeking to replicate the success of machine learning (ML) on domains like computer vision (CV) and natural language processing (NLP) to applications involving biophysical data. One of the key ingredients of prior successes in CV and NLP was the broad acceptance of difficult benchmarks that distilled key subproblems into approachable tasks that any junior researcher could investigate, but good benchmarks for biophysical domains are rare. This scarcity is partially due to a narrow focus on benchmarks which simulate biophysical data; we propose instead to carefully abstract biophysical problems into simpler ones with key geometric similarities. In particular we propose a new class of closed-form test functions for biophysical sequence optimization, which we call Ehrlich functions. We provide empirical results demonstrating these functions are interesting objects of study and can be non-trivial to solve with a standard genetic optimization baseline.

[LG-199] Methodology to Deploy CNN-Based Computer Vision Models on Immersive Wearable Devices

链接: https://arxiv.org/abs/2407.00233
作者: Kaveh Malek(1),Fernando Moreu(2), ((1) Department of Mechanical Engineering, University of New Mexico, New Mexico, (2) Department of Civil, Construction and Environmental Engineering, University of New Mexico, New Mexico)
关键词: Convolutional Neural Network, Convolutional Neural, Neural Network, Augmented Reality, addressed by Augmented
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 10 pages 8 figures 4300 words

点击查看摘要

Abstract:Convolutional Neural Network (CNN) models often lack the ability to incorporate human input, which can be addressed by Augmented Reality (AR) headsets. However, current AR headsets face limitations in processing power, which has prevented researchers from performing real-time, complex image recognition tasks using CNNs in AR headsets. This paper presents a method to deploy CNN models on AR headsets by training them on computers and transferring the optimized weight matrices to the headset. The approach transforms the image data and CNN layers into a one-dimensional format suitable for the AR platform. We demonstrate this method by training the LeNet-5 CNN model on the MNIST dataset using PyTorch and deploying it on a HoloLens AR headset. The results show that the model maintains an accuracy of approximately 98%, similar to its performance on a computer. This integration of CNN and AR enables real-time image processing on AR headsets, allowing for the incorporation of human input into AI models.

[LG-200] LLM Critics Help Catch LLM Bugs

链接: https://arxiv.org/abs/2407.00215
作者: Nat McAleese,Rai Michael Pokorny,Juan Felipe Ceron Uribe,Evgenia Nitishinskaya,Maja Trebacz,Jan Leike
关键词: Reinforcement learning, evaluate model output, correctly evaluate model, fundamentally limited, Reinforcement
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. To improve human evaluation ability and overcome that limitation this work trains “critic” models that help humans to more accurately evaluate model-written code. These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. We further confirm that our fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as “flawless”, even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model. Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.

[LG-201] radeoffs When Considering Deep Reinforcement Learning for Contingency Management in Advanced Air Mobility

链接: https://arxiv.org/abs/2407.00197
作者: Luis E. Alvarez,Marc W. Brittain,Steven D. Young
关键词: Advanced Air Mobility, Air Mobility, Advanced Air, rapid evolution globally, Air transportation
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Air transportation is undergoing a rapid evolution globally with the introduction of Advanced Air Mobility (AAM) and with it comes novel challenges and opportunities for transforming aviation. As AAM operations introduce increasing heterogeneity in vehicle capabilities and density, increased levels of automation are likely necessary to achieve operational safety and efficiency goals. This paper focuses on one example where increased automation has been suggested. Autonomous operations will need contingency management systems that can monitor evolving risk across a span of interrelated (or interdependent) hazards and, if necessary, execute appropriate control interventions via supervised or automated decision making. Accommodating this complex environment may require automated functions (autonomy) that apply artificial intelligence (AI) techniques that can adapt and respond to a quickly changing environment. This paper explores the use of Deep Reinforcement Learning (DRL) which has shown promising performance in complex and high-dimensional environments where the objective can be constructed as a sequential decision-making problem. An extension of a prior formulation of the contingency management problem as a Markov Decision Process (MDP) is presented and uses a DRL framework to train agents that mitigate hazards present in the simulation environment. A comparison of these learning-based agents and classical techniques is presented in terms of their performance, verification difficulties, and development process.

[LG-202] he impact of model size on catastrophic forgetting in Online Continual Learning

链接: https://arxiv.org/abs/2407.00176
作者: Eunhae Lee
关键词: Continual Learning performance, Continual Learning, Online Continual Learning, Continual Learning efficacy, Learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study investigates the impact of model size on Online Continual Learning performance, with a focus on catastrophic forgetting. Employing ResNet architectures of varying sizes, the research examines how network depth and width affect model performance in class-incremental learning using the SplitCIFAR-10 dataset. Key findings reveal that larger models do not guarantee better Continual Learning performance; in fact, they often struggle more in adapting to new tasks, particularly in online settings. These results challenge the notion that larger models inherently mitigate catastrophic forgetting, highlighting the nuanced relationship between model size and Continual Learning efficacy. This study contributes to a deeper understanding of model scalability and its practical implications in Continual Learning scenarios.

[LG-203] Dataset Representativeness and Downstream Task Fairness

链接: https://arxiv.org/abs/2407.00170
作者: Victor Borza,Andrew Estornell,Chien-Ju Ho,Bradley Malin,Yevgeniy Vorobeychik
关键词: meaningful clinical trials, running meaningful clinical, society collects data, range of applications, clinical trials
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 48 pages, 32 figures

点击查看摘要

Abstract:Our society collects data on people for a wide range of applications, from building a census for policy evaluation to running meaningful clinical trials. To collect data, we typically sample individuals with the goal of accurately representing a population of interest. However, current sampling processes often collect data opportunistically from data sources, which can lead to datasets that are biased and not representative, i.e., the collected dataset does not accurately reflect the distribution of demographics of the true population. This is a concern because subgroups within the population can be under- or over-represented in a dataset, which may harm generalizability and lead to an unequal distribution of benefits and harms from downstream tasks that use such datasets (e.g., algorithmic bias in medical decision-making algorithms). In this paper, we assess the relationship between dataset representativeness and group-fairness of classifiers trained on that dataset. We demonstrate that there is a natural tension between dataset representativeness and classifier fairness; empirically we observe that training datasets with better representativeness can frequently result in classifiers with higher rates of unfairness. We provide some intuition as to why this occurs via a set of theoretical results in the case of univariate classifiers. We also find that over-sampling underrepresented groups can result in classifiers which exhibit greater bias to those groups. Lastly, we observe that fairness-aware sampling strategies (i.e., those which are specifically designed to select data with high downstream fairness) will often over-sample members of majority groups. These results demonstrate that the relationship between dataset representativeness and downstream classifier fairness is complex; balancing these two quantities requires special care from both model- and dataset-designers.

[LG-204] Localizing Anomalies via Multiscale Score Matching Analysis

链接: https://arxiv.org/abs/2407.00148
作者: Ahsan Mahmood,Junier Oliva,Martin Styner
关键词: remain critical challenges, Multiscale Score Matching, Score Matching Analysis, imaging remain critical, challenges in healthcare
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection and localization in medical imaging remain critical challenges in healthcare. This paper introduces Spatial-MSMA (Multiscale Score Matching Analysis), a novel unsupervised method for anomaly localization in volumetric brain MRIs. Building upon the MSMA framework, our approach incorporates spatial information and conditional likelihoods to enhance anomaly detection capabilities. We employ a flexible normalizing flow model conditioned on patch positions and global image features to estimate patch-wise anomaly scores. The method is evaluated on a dataset of 1,650 T1- and T2-weighted brain MRIs from typically developing children, with simulated lesions added to the test set. Spatial-MSMA significantly outperforms existing methods, including reconstruction-based, generative-based, and interpretation-based approaches, in lesion detection and segmentation tasks. Our model achieves superior performance in both distance-based metrics (99th percentile Hausdorff Distance: 7.05 \pm 0.61 , Mean Surface Distance: 2.10 \pm 0.43 ) and component-wise metrics (True Positive Rate: 0.83 \pm 0.01 , Positive Predictive Value: 0.96 \pm 0.01 ). These results demonstrate Spatial-MSMA’s potential for accurate and interpretable anomaly localization in medical imaging, with implications for improved diagnosis and treatment planning in clinical settings. Our code is available at~\urlthis https URL.

[LG-205] Predicting Elevated Risk of Hospitalization Following Emergency Department Discharges

链接: https://arxiv.org/abs/2407.00147
作者: Dat Hong,Philip M. Polgreen,Alberto Maria Segre
关键词: proper diagnosis, follow closely, symptoms of missed, missed opportunities, opportunities to form
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hospitalizations that follow closely on the heels of one or more emergency department visits are often symptoms of missed opportunities to form a proper diagnosis. These diagnostic errors imply a failure to recognize the need for hospitalization and deliver appropriate care, and thus also bear important connotations for patient safety. In this paper, we show how data mining techniques can be applied to a large existing hospitalization data set to learn useful models that predict these upcoming hospitalizations with high accuracy. Specifically, we use an ensemble of logistics regression, naïve Bayes and association rule classifiers to successfully predict hospitalization within 3, 7 and 14 days of an emergency department discharge. Aside from high accuracy, one of the advantages of the techniques proposed here is that the resulting classifier is easily inspected and interpreted by humans so that the learned rules can be readily operationalized. These rules can then be easily distributed and applied directly by physicians in emergency department settings to predict the risk of early admission prior to discharging their emergency department patients.

[LG-206] InfoNCE: Identifying the Gap Between Theory and Practice

链接: https://arxiv.org/abs/2407.00143
作者: Evgenia Rusak,Patrik Reizinger,Attila Juhos,Oliver Bringmann,Roland S. Zimmermann,Wieland Brendel
关键词: learned representations uncover, contrastive learning, work on contrastive, learned representations, ground-truth latent factors
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Previous theoretical work on contrastive learning (CL) with InfoNCE showed that, under certain assumptions, the learned representations uncover the ground-truth latent factors. We argue these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they assume that within a positive pair, all latent factors either vary to a similar extent, or that some do not vary at all. However, in practice, positive pairs are often generated using augmentations such as strong cropping to just a few pixels. Hence, a more realistic assumption is that all latent factors change, with a continuum of variability across these factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Additionally, we explore and discuss further mismatches between theoretical assumptions and practical implementations, including extensions to hard negative mining and loss ensembles.

[LG-207] Graph Neural Networks for Gut Microbiome Metaomic data: A preliminary work

链接: https://arxiv.org/abs/2407.00142
作者: Christopher Irwin,Flavio Mignone,Stefania Montani,Luigi Portinale
关键词: complex metaomic data, metaomic data due, crucial for human, human health, presents challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The gut microbiome, crucial for human health, presents challenges in analyzing its complex metaomic data due to high dimensionality and sparsity. Traditional methods struggle to capture its intricate relationships. We investigate graph neural networks (GNNs) for this task, aiming to derive meaningful representations of individual gut microbiomes. Unlike methods relying solely on taxa abundance, we directly leverage phylogenetic relationships, in order to obtain a generalized encoder for taxa networks. The representation learnt from the encoder are then used to train a model for phenotype prediction such as Inflammatory Bowel Disease (IBD).

[LG-208] owards Secure and Efficient Data Scheduling for Vehicular Social Networks

链接: https://arxiv.org/abs/2407.00141
作者: Youhua Xia,Tiehua Zhang,Jiong Jin,Ying He,Fei Yu
关键词: significant challenge due, Efficient data transmission, vehicular environments poses, vehicular social networks, Efficient data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient data transmission scheduling within vehicular environments poses a significant challenge due to the high mobility of such networks. Contemporary research predominantly centers on crafting cooperative scheduling algorithms tailored for vehicular networks. Notwithstanding, the intricacies of orchestrating scheduling in vehicular social networks both effectively and efficiently remain formidable. This paper introduces an innovative learning-based algorithm for scheduling data transmission that prioritizes efficiency and security within vehicular social networks. The algorithm first uses a specifically constructed neural network to enhance data processing capabilities. After this, it incorporates a Q-learning paradigm during the data transmission phase to optimize the information exchange, the privacy of which is safeguarded by differential privacy through the communication process. Comparative experiments demonstrate the superior performance of the proposed Q-learning enhanced scheduling algorithm relative to existing state-of-the-art scheduling algorithms in the context of vehicular social networks.

[LG-209] ModeConv: A Novel Convolution for Distinguishing Anomalous and Normal Structural Behavior

链接: https://arxiv.org/abs/2407.00140
作者: Melanie Schaller,Daniel Schlör,Andreas Hotho
关键词: environmental factors induce, factors induce vibrations, degradation over time, traffic and environmental, environmental factors
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:External influences such as traffic and environmental factors induce vibrations in structures, leading to material degradation over time. These vibrations result in cracks due to the material’s lack of plasticity compromising structural integrity. Detecting such damage requires the installation of vibration sensors to capture the internal dynamics. However, distinguishing relevant eigenmodes from external noise necessitates the use of Deep Learning models. The detection of changes in eigenmodes can be used to anticipate these shifts in material properties and to discern between normal and anomalous structural behavior. Eigenmodes, representing characteristic vibration patterns, provide insights into structural dynamics and deviations from expected states. Thus, we propose ModeConv to automatically capture and analyze changes in eigenmodes, facilitating effective anomaly detection in structures and material properties. In the conducted experiments, ModeConv demonstrates computational efficiency improvements, resulting in reduced runtime for model calculations. The novel ModeConv neural network layer is tailored for temporal graph neural networks, in which every node represents one sensor. ModeConv employs a singular value decomposition based convolutional filter design for complex numbers and leverages modal transformation in lieu of Fourier or Laplace transformations in spectral graph convolutions. We include a mathematical complexity analysis illustrating the runtime reduction.

[LG-210] A Simple Attention-Based Mechanism for Bimodal Emotion Classification

链接: https://arxiv.org/abs/2407.00134
作者: Mazen Elabd,Sardar Jaf
关键词: Big data, learning important features, Big, important features, learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 15 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Big data contain rich information for machine learning algorithms to utilize when learning important features during classification tasks. Human beings express their emotion using certain words, speech (tone, pitch, speed) or facial expression. Artificial Intelligence approach to emotion classification are largely based on learning from textual information. However, public datasets containing text and speech data provide sufficient resources to train machine learning algorithms for the tack of emotion classification. In this paper, we present novel bimodal deep learning-based architectures enhanced with attention mechanism trained and tested on text and speech data for emotion classification. We report details of different deep learning based architectures and show the performance of each architecture including rigorous error analyses. Our finding suggests that deep learning based architectures trained on different types of data (text and speech) outperform architectures trained only on text or speech. Our proposed attention-based bimodal architecture outperforms several state-of-the-art systems in emotion classification.

[LG-211] RepAct: The Re-parameterizable Adaptive Activation Function

链接: https://arxiv.org/abs/2407.00131
作者: Xian Wu,Qingchuan Tao,Shuang Wang
关键词: efficient artificial intelligence, Addressing the imperative, study presents RepAct, re-parameterizable adaptive activation, activation function tailored
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Addressing the imperative need for efficient artificial intelligence in IoT and edge computing, this study presents RepAct, a re-parameterizable adaptive activation function tailored for optimizing lightweight neural networks within the computational limitations of edge devices. By employing a multi-branch structure with learnable adaptive weights, RepAct enriches feature processing and enhances cross-layer interpretability. When evaluated on tasks such as image classification and object detection, RepAct notably surpassed conventional activation functions in lightweight networks, delivering up to a 7.92% accuracy boost on MobileNetV3-Small for the ImageNet100 dataset, while maintaining computational complexity on par with HardSwish. This innovative approach not only maximizes model parameter efficiency but also significantly improves the performance and understanding capabilities of lightweight neural networks, demonstrating its potential for real-time edge computing applications.

[LG-212] When Search Engine Services meet Large Language Models: Visions and Challenges

链接: https://arxiv.org/abs/2407.00128
作者: Haoyi Xiong,Jiang Bian,Yuchen Li,Xuhong Li,Mengnan Du,Shuaiqiang Wang,Dawei Yin,Sumi Helal
关键词: Combining Large Language, Large Language Models, Combining Large, Large Language, Language Models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Combining Large Language Models (LLMs) with search engine services marks a significant shift in the field of services computing, opening up new possibilities to enhance how we search for and retrieve information, understand content, and interact with internet services. This paper conducts an in-depth examination of how integrating LLMs with search engines can mutually benefit both technologies. We focus on two main areas: using search engines to improve LLMs (Search4LLM) and enhancing search engine functions using LLMs (LLM4Search). For Search4LLM, we investigate how search engines can provide diverse high-quality datasets for pre-training of LLMs, how they can use the most relevant documents to help LLMs learn to answer queries more accurately, how training LLMs with Learning-To-Rank (LTR) tasks can enhance their ability to respond with greater precision, and how incorporating recent search results can make LLM-generated content more accurate and current. In terms of LLM4Search, we examine how LLMs can be used to summarize content for better indexing by search engines, improve query outcomes through optimization, enhance the ranking of search results by analyzing document relevance, and help in annotating data for learning-to-rank tasks in various learning contexts. However, this promising integration comes with its challenges, which include addressing potential biases and ethical issues in training models, managing the computational and other costs of incorporating LLMs into search services, and continuously updating LLM training with the ever-changing web content. We discuss these challenges and chart out required research directions to address them. We also discuss broader implications for service computing, such as scalability, privacy concerns, and the need to adapt search engine architectures for these advanced models.

[LG-213] Multi-Species Object Detection in Drone Imagery for Population Monitoring of Endangered Animals

链接: https://arxiv.org/abs/2407.00127
作者: Sowmya Sankaran
关键词: Animal populations worldwide, accurately count endangered, count endangered species, rapidly declining, populations worldwide
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Animal populations worldwide are rapidly declining, and a technology that can accurately count endangered species could be vital for monitoring population changes over several years. This research focused on fine-tuning object detection models for drone images to create accurate counts of animal species. Hundreds of images taken using a drone and large, openly available drone-image datasets were used to fine-tune machine learning models with the baseline YOLOv8 architecture. We trained 30 different models, with the largest having 43.7 million parameters and 365 layers, and used hyperparameter tuning and data augmentation techniques to improve accuracy. While the state-of-the-art YOLOv8 baseline had only 0.7% accuracy on a dataset of safari animals, our models had 95% accuracy on the same dataset. Finally, we deployed the models on the Jetson Orin Nano for demonstration of low-power real-time species detection for easy inference on drones.

[LG-214] Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks

链接: https://arxiv.org/abs/2407.00121
作者: Ibrahim Abdelaziz,Kinjal Basu,Mayank Agarwal,Sadhana Kumaravel,Matthew Stallone,Rameswar Panda,Yara Rizk,GP Bhargav,Maxwell Crouse,Chulaka Gunasekara,Shajith Ikbal,Sachin Joshi,Hima Karanam,Vineet Kumar,Asim Munawar,Sumit Neelam,Dinesh Raghu,Udit Sharma,Adriana Meza Soria,Dheeraj Sreedhar,Praveen Venkateswaran,Merve Unuvar,David Cox,Salim Roukos,Luis Lastras,Pavan Kapanipathi
关键词: Large language models, recently shown tremendous, shown tremendous promise, Large language, function calling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We present a comprehensive evaluation on multiple out-of-domain datasets comparing GRANITE-20B-FUNCTIONCALLING to more than 15 other best proprietary and open models. GRANITE-20B-FUNCTIONCALLING provides the best performance among all open models on the Berkeley Function Calling Leaderboard and fourth overall. As a result of the diverse tasks and datasets used for training our model, we show that GRANITE-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.

[LG-215] Automated Web-Based Malaria Detection System with Machine Learning and Deep Learning Techniques

链接: https://arxiv.org/abs/2407.00120
作者: Abraham G Taye,Sador Yemane,Eshetu Negash,Yared Minwuyelet,Moges Abebe,Melkamu Hunegnaw Asmare
关键词: global health burden, causing widespread suffering, significant global health, Malaria parasites pose, health burden
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Malaria parasites pose a significant global health burden, causing widespread suffering and mortality. Detecting malaria infection accurately is crucial for effective treatment and control. However, existing automated detection techniques have shown limitations in terms of accuracy and generalizability. Many studies have focused on specific features without exploring more comprehensive approaches. In our case, we formulate a deep learning technique for malaria-infected cell classification using traditional CNNs and transfer learning models notably VGG19, InceptionV3, and Xception. The models were trained using NIH datasets and tested using different performance metrics such as accuracy, precision, recall, and F1-score. The test results showed that deep CNNs achieved the highest accuracy – 97%, followed by Xception with an accuracy of 95%. A machine learning model SVM achieved an accuracy of 83%, while an Inception-V3 achieved an accuracy of 94%. Furthermore, the system can be accessed through a web interface, where users can upload blood smear images for malaria detection.

[LG-216] Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations

链接: https://arxiv.org/abs/2407.00119
作者: Yuntao Shou,Wei Ai,Jiayi Du,Tao Meng,Haiyan Liu
关键词: genuine emotional state, graph neural networks, aims to analyze, multi-modal emotion recognition, analyze the genuine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 3 tables

点击查看摘要

Abstract:The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52% and 35%, respectively.

[LG-217] From Efficient Multimodal Models to World Models: A Survey

链接: https://arxiv.org/abs/2407.00118
作者: Xinji Mai,Zeng Tao,Junxiong Lin,Haoran Wang,Yang Chang,Yanlan Kang,Yan Wang,Wenqiang Zhang
关键词: combining powerful large, powerful large language, large language models, perform complex tasks, Multimodal Large Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal Large Models (MLMs) are becoming a significant research focus, combining powerful large language models with multimodal learning to perform complex tasks across different data modalities. This review explores the latest developments and challenges in MLMs, emphasizing their potential in achieving artificial general intelligence and as a pathway to world models. We provide an overview of key techniques such as Multimodal Chain of Thought (M-COT), Multimodal Instruction Tuning (M-IT), and Multimodal In-Context Learning (M-ICL). Additionally, we discuss both the fundamental and specific technologies of multimodal models, highlighting their applications, input/output modalities, and design characteristics. Despite significant advancements, the development of a unified multimodal model remains elusive. We discuss the integration of 3D generation and embodied intelligence to enhance world simulation capabilities and propose incorporating external rule systems for improved reasoning and decision-making. Finally, we outline future research directions to address these challenges and advance the field.

[LG-218] Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges

链接: https://arxiv.org/abs/2407.00116
作者: Mahmoud Ibrahim,Yasmina Al Khalil,Sina Amirrajab,Chang Suna,Marcel Breeuwer,Josien Pluim,Bart Elen,Gokhan Ertaylan,Michel Dumontiera
关键词: comprehensive systematic review, including imaging, medical data types, paper presents, presents a comprehensive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive systematic review of generative models (GANs, VAEs, DMs, and LLMs) used to synthesize various medical data types, including imaging (dermoscopic, mammographic, ultrasound, CT, MRI, and X-ray), text, time-series, and tabular data (EHR). Unlike previous narrowly focused reviews, our study encompasses a broad array of medical data modalities and explores various generative models. Our search strategy queries databases such as Scopus, PubMed, and ArXiv, focusing on recent works from January 2021 to November 2023, excluding reviews and perspectives. This period emphasizes recent advancements beyond GANs, which have been extensively covered previously. The survey reveals insights from three key aspects: (1) Synthesis applications and purpose of synthesis, (2) generation techniques, and (3) evaluation methods. It highlights clinically valid synthesis applications, demonstrating the potential of synthetic data to tackle diverse clinical requirements. While conditional models incorporating class labels, segmentation masks and image translations are prevalent, there is a gap in utilizing prior clinical knowledge and patient-specific context, suggesting a need for more personalized synthesis approaches and emphasizing the importance of tailoring generative approaches to the unique characteristics of medical data. Additionally, there is a significant gap in using synthetic data beyond augmentation, such as for validation and evaluation of downstream medical AI models. The survey uncovers that the lack of standardized evaluation methodologies tailored to medical images is a barrier to clinical application, underscoring the need for in-depth evaluation approaches, benchmarking, and comparative studies to promote openness and collaboration. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2407.00116 [cs.LG] (or arXiv:2407.00116v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2407.00116 Focus to learn more arXiv-issued DOI via DataCite

[LG-219] Instance Temperature Knowledge Distillation

链接: https://arxiv.org/abs/2407.00115
作者: Zhengbo Zhang,Yuxi Zhou,Jia Gong,Jun Liu,Zhigang Tu
关键词: teacher network incrementally, Knowledge distillation, knowledge transferred, student network, enhances the performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) enhances the performance of a student network by allowing it to learn the knowledge transferred from a teacher network incrementally. Existing methods dynamically adjust the temperature to enable the student network to adapt to the varying learning difficulties at different learning stages of KD. KD is a continuous process, but when adjusting the temperature, these methods consider only the immediate benefits of the operation in the current learning phase and fail to take into account its future returns. To address this issue, we formulate the adjustment of temperature as a sequential decision-making task and propose a method based on reinforcement learning, termed RLKD. Importantly, we design a novel state representation to enable the agent to make more informed action (i.e. instance temperature adjustment). To handle the problem of delayed rewards in our method due to the KD setting, we explore an instance reward calibration approach. In addition,we devise an efficient exploration strategy that enables the agent to learn valuable instance temperature adjustment policy more efficiently. Our framework can serve as a plug-and-play technique to be inserted into various KD methods easily, and we validate its effectiveness on both image classification and object detection tasks. Our code is at this https URL

[LG-220] OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

链接: https://arxiv.org/abs/2407.00114
作者: Zihao Wang,Shaofei Cai,Zhancun Mu,Haowei Lin,Ceyao Zhang,Xuejie Liu,Qing Li,Anji Liu,Xiaojian Ma,Yitao Liang
关键词: open-world instruction-following agents, instruction-following agents, open-world Minecraft, behavior, tokens
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories \tau = o_0 , a_0 , \dots and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.

[LG-221] Personalized Federated Continual Learning via Multi-granularity Prompt

链接: https://arxiv.org/abs/2407.00113
作者: Hao Yu,Xin Yang,Xin Gao,Yan Kang,Hao Wang,Junbo Zhang,Tianrui Li
关键词: Federated Continual Learning, Personalized Federated Continual, poses greater challenges, Federated Continual, Personalized Federated Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by KDD 2024 Research Track

点击查看摘要

Abstract:Personalized Federated Continual Learning (PFCL) is a new practical scenario that poses greater challenges in sharing and personalizing knowledge. PFCL not only relies on knowledge fusion for server aggregation at the global spatial-temporal perspective but also needs model improvement for each client according to the local requirements. Existing methods, whether in Personalized Federated Learning (PFL) or Federated Continual Learning (FCL), have overlooked the multi-granularity representation of knowledge, which can be utilized to overcome Spatial-Temporal Catastrophic Forgetting (STCF) and adopt generalized knowledge to itself by coarse-to-fine human cognitive mechanisms. Moreover, it allows more effectively to personalized shared knowledge, thus serving its own purpose. To this end, we propose a novel concept called multi-granularity prompt, i.e., coarse-grained global prompt acquired through the common model learning process, and fine-grained local prompt used to personalize the generalized representation. The former focuses on efficiently transferring shared global knowledge without spatial forgetting, and the latter emphasizes specific learning of personalized local knowledge to overcome temporal forgetting. In addition, we design a selective prompt fusion mechanism for aggregating knowledge of global prompts distilled from different clients. By the exclusive fusion of coarse-grained knowledge, we achieve the transmission and refinement of common knowledge among clients, further enhancing the performance of personalization. Extensive experiments demonstrate the effectiveness of the proposed method in addressing STCF as well as improving personalized performance. Our code now is available at this https URL.

[LG-222] Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models

链接: https://arxiv.org/abs/2407.00111
作者: Ben Fauber
关键词: instruction fine-tuned pretrained, fine-tuned pretrained generative, pretrained generative small, generative small language, small language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We describe the accurate prediction of ligand-protein interaction (LPI) affinities, also known as drug-target interactions (DTI), with instruction fine-tuned pretrained generative small language models (SLMs). We achieved accurate predictions for a range of affinity values associated with ligand-protein interactions on out-of-sample data in a zero-shot setting. Only the SMILES string of the ligand and the amino acid sequence of the protein were used as the model inputs. Our results demonstrate a clear improvement over machine learning (ML) and free-energy perturbation (FEP+) based methods in accurately predicting a range of ligand-protein interaction affinities, which can be leveraged to further accelerate drug discovery campaigns against challenging therapeutic targets.

[LG-223] A Case Study on Contextual Machine Translation in a Professional Scenario of Subtitling

链接: https://arxiv.org/abs/2407.00108
作者: Sebastian Vincent,Charlotte Prescott,Chris Bayliss,Chris Oakley,Carolina Scarton
关键词: enhance translation quality, Incorporating extra-textual context, translation quality, Incorporating extra-textual, machine translation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: Accepted to EAMT 2024

点击查看摘要

Abstract:Incorporating extra-textual context such as film metadata into the machine translation (MT) pipeline can enhance translation quality, as indicated by automatic evaluation in recent work. However, the positive impact of such systems in industry remains unproven. We report on an industrial case study carried out to investigate the benefit of MT in a professional scenario of translating TV subtitles with a focus on how leveraging extra-textual context impacts post-editing. We found that post-editors marked significantly fewer context-related errors when correcting the outputs of MTCue, the context-aware model, as opposed to non-contextual models. We also present the results of a survey of the employed post-editors, which highlights contextual inadequacy as a significant gap consistently observed in MT. Our findings strengthen the motivation for further work within fully contextual MT.

[LG-224] WineGraph: A Graph Representation For Food-Wine Pairing

链接: https://arxiv.org/abs/2407.00107
作者: Zuzanna Gawrysiak,Agata Żywot,Agnieszka Ławrynowicz
关键词: present WineGraph, extended version, heterogeneous graph incorporating, graph incorporating wine, incorporating wine data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present WineGraph, an extended version of FlavorGraph, a heterogeneous graph incorporating wine data into its structure. This integration enables food-wine pairing based on taste and sommelier-defined rules. Leveraging a food dataset comprising 500,000 reviews and a wine reviews dataset with over 130,000 entries, we computed taste descriptors for both food and wine. This information was then utilised to pair food items with wine and augment FlavorGraph with additional data. The results demonstrate the potential of heterogeneous graphs to acquire supplementary information, proving beneficial for wine pairing.

[LG-225] UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

链接: https://arxiv.org/abs/2407.00106
作者: Ilia Shumailov,Jamie Hayes,Eleni Triantafillou,Guillermo Ortiz-Jimenez,Nicolas Papernot,Matthew Jagielski,Itay Yona,Heidi Howard,Eugene Bagdasaryan
关键词: Exact unlearning, allowed a user, user to retract, retract their data, data from machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Exact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate the impractical costs associated with exact unlearning. More recently unlearning is often discussed as an approach for removal of impermissible knowledge i.e. knowledge that the model should not possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the model does not have a certain malicious capability, then it cannot be used for the associated malicious purpose. In this paper we revisit the paradigm in which unlearning is used for in Large Language Models (LLMs) and highlight an underlying inconsistency arising from in-context learning. Unlearning can be an effective control mechanism for the training phase, yet it does not prevent the model from performing an impermissible act during inference. We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications.

[LG-226] Multiple Kronecker RLS fusion-based link propagation for drug-side effect prediction

链接: https://arxiv.org/abs/2407.00105
作者: Yuqing Qian,Ziyu Zheng,Prayag Tiwari,Yijie Ding,Quan Zou
关键词: Drug-side effect prediction, field of pharmacology, Drug-side effect, essential area, area of research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Transactions on Machine Learning Research (TMLR 2024)

点击查看摘要

Abstract:Drug-side effect prediction has become an essential area of research in the field of pharmacology. As the use of medications continues to rise, so does the importance of understanding and mitigating the potential risks associated with them. At present, researchers have turned to data-driven methods to predict drug-side effects. Drug-side effect prediction is a link prediction problem, and the related data can be described from various perspectives. To process these kinds of data, a multi-view method, called Multiple Kronecker RLS fusion-based link propagation (MKronRLSF-LP), is proposed. MKronRLSF-LP extends the Kron-RLS by finding the consensus partitions and multiple graph Laplacian constraints in the multi-view setting. Both of these multi-view settings contribute to a higher quality result. Extensive experiments have been conducted on drug-side effect datasets, and our empirical results provide evidence that our approach is effective and robust.

[LG-227] AI-Driven Skin Cancer Diagnosis: Grad-CAM and Expert Annotations for Enhanced Interpretability

链接: https://arxiv.org/abs/2407.00104
作者: Iván Matas,Carmen Serrano,Francisca Silva,Amalia Serrano,Tomás Toledo-Pastrana,Begoña Acha
关键词: optimizing resource utilization, BCC, provide interpretable support, BCC dermoscopic features, BCC dermoscopic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Image and Video Processing (eess.IV)
*备注: 8 pages, 4 figures, 4 tables, under review

点击查看摘要

Abstract:An AI tool has been developed to provide interpretable support for the diagnosis of BCC via teledermatology, thus speeding up referrals and optimizing resource utilization. The interpretability is provided in two ways: on the one hand, the main BCC dermoscopic patterns are found in the image to justify the BCC/Non BCC classification. Secondly, based on the common visual XAI Grad-CAM, a clinically inspired visual explanation is developed where the relevant features for diagnosis are located. Since there is no established ground truth for BCC dermoscopic features, a standard reference is inferred from the diagnosis of four dermatologists using an Expectation Maximization (EM) based algorithm. The results demonstrate significant improvements in classification accuracy and interpretability, positioning this approach as a valuable tool for early BCC detection and referral to dermatologists. The BCC/non-BCC classification achieved an accuracy rate of 90%. For Clinically-inspired XAI results, the detection of BCC patterns useful to clinicians reaches 99% accuracy. As for the Clinically-inspired Visual XAI results, the mean of the Grad-CAM normalized value within the manually segmented clinical features is 0.57, while outside this region it is 0.16. This indicates that the model struggles to accurately identify the regions of the BCC patterns. These results prove the ability of the AI tool to provide a useful explanation.

[LG-228] Curriculum Learning with Quality-Driven Data Selection

链接: https://arxiv.org/abs/2407.00102
作者: Biao Wu,Fang Meng,Ling Chen
关键词: Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, impressive multimodal capabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The impressive multimodal capabilities demonstrated by OpenAI’s GPT-4 have generated significant interest in the development of Multimodal Large Language Models (MLLMs). Visual instruction tuning of MLLMs with machine-generated instruction-following data has shown to enhance zero-shot capabilities across various tasks. However, there has been limited exploration into controlling the quality of the instruction data.Current methodologies for data selection in MLLMs often rely on single, unreliable scores or use downstream tasks for selection, which is time-consuming and can lead to potential overfitting on the chosen evaluation datasets. To mitigate these limitations, we propose a novel data selection methodology that utilizes image-text correlation and model perplexity to evaluate and select data of varying quality. This approach leverages the distinct distribution of these two attributes, mapping data quality into a two-dimensional space that allows for the selection of data based on their location within this distribution. By utilizing this space, we can analyze the impact of task type settings, used as prompts, on data quality. Additionally, this space can be used to construct multi-stage subsets of varying quality to facilitate curriculum learning. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in five commonly assessed capabilities compared to using the complete dataset. Our codes, data, and models are publicly available at: \urlhttps://anonymous.4open.science/r/EHIT-31B4

[LG-229] Hybrid Approach to Parallel Stochastic Gradient Descent

链接: https://arxiv.org/abs/2407.00101
作者: Aakash Sudhirbhai Vora,Dhrumil Chetankumar Joshi,Aksh Kantibhai Patel
关键词: Stochastic Gradient Descent, Stochastic Gradient, Gradient Descent, large datasets, reduce the training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Stochastic Gradient Descent is used for large datasets to train models to reduce the training time. On top of that data parallelism is widely used as a method to efficiently train neural networks using multiple worker nodes in parallel. Synchronous and asynchronous approach to data parallelism is used by most systems to train the model in parallel. However, both of them have their drawbacks. We propose a third approach to data parallelism which is a hybrid between synchronous and asynchronous approaches, using both approaches to train the neural network. When the threshold function is selected appropriately to gradually shift all parameter aggregation from asynchronous to synchronous, we show that in a given time period our hybrid approach outperforms both asynchronous and synchronous approaches.

[LG-230] Enhancing In-Context Learning via Implicit Demonstration Augmentation

链接: https://arxiv.org/abs/2407.00100
作者: Xiaoling Zhou,Wei Ye,Yidong Wang,Chaoya Jiang,Zhemg Lee,Rui Xie,Shikun Zhang
关键词: enables large pre-trained, pre-trained language models, large pre-trained language, ICL effectiveness heavily, in-context learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by ACL 2024 Main 19 pages,10 figures

点击查看摘要

Abstract:The emergence of in-context learning (ICL) enables large pre-trained language models (PLMs) to make predictions for unseen inputs without updating parameters. Despite its potential, ICL’s effectiveness heavily relies on the quality, quantity, and permutation of demonstrations, commonly leading to suboptimal and unstable performance. In this paper, we tackle this challenge for the first time from the perspective of demonstration augmentation. Specifically, we start with enriching representations of demonstrations by leveraging their deep feature distribution. We then theoretically reveal that when the number of augmented copies approaches infinity, the augmentation is approximately equal to a novel logit calibration mechanism integrated with specific statistical properties. This insight results in a simple yet highly efficient method that significantly improves the average and worst-case accuracy across diverse PLMs and tasks. Moreover, our method effectively reduces performance variance among varying demonstrations, permutations, and templates, and displays the capability to address imbalanced class distributions.

[LG-231] Learning to Rank for Maps at Airbnb

链接: https://arxiv.org/abs/2407.00091
作者: Malay Haldar,Hongwei Zhang,Kedar Bellare,Sherry Chen,Soumyadip Banerjee,Xiaotang Wang,Mustafa Abdool,Huiji Gao,Pavan Tapadia,Liwei He,Sanjeev Katariya
关键词: two-sided marketplace, brings together hosts, rent with prospective, prospective guests, Airbnb brings
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As a two-sided marketplace, Airbnb brings together hosts who own listings for rent with prospective guests from around the globe. Results from a guest’s search for listings are displayed primarily through two interfaces: (1) as a list of rectangular cards that contain on them the listing image, price, rating, and other details, referred to as list-results (2) as oval pins on a map showing the listing price, called map-results. Both these interfaces, since their inception, have used the same ranking algorithm that orders listings by their booking probabilities and selects the top listings for display. But some of the basic assumptions underlying ranking, built for a world where search results are presented as lists, simply break down for maps. This paper describes how we rebuilt ranking for maps by revising the mathematical foundations of how users interact with search results. Our iterative and experiment-driven approach led us through a path full of twists and turns, ending in a unified theory for the two interfaces. Our journey shows how assumptions taken for granted when designing machine learning algorithms may not apply equally across all user interfaces, and how they can be adapted. The net impact was one of the largest improvements in user experience for Airbnb which we discuss as a series of experimental validations.

[LG-232] ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback

链接: https://arxiv.org/abs/2407.00087
作者: Ju-Seung Byun,Jiyun Chun,Jihyung Kil,Andrew Perrault
关键词: Large Multimodal Models, Large Multimodal, comprehending human instructions, demonstrate remarkable results, excel at comprehending
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) excel at comprehending human instructions and demonstrate remarkable results across a broad spectrum of tasks. Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) further refine LLMs by aligning them with specific preferences. These methods primarily use ranking-based feedback for entire generations. With advanced AI models (Teacher), such as GPT-4 and Claude 3 Opus, we can request various types of detailed feedback that are expensive for humans to provide. We propose a two-stage algorithm ARES that Alternates REinforcement Learning (RL) and Supervised Fine-Tuning (SFT). First, we request the Teacher to score how much each sentence contributes to solving the problem in a Chain-of-Thought (CoT). This sentence-level feedback allows us to consider individual valuable segments, providing more granular rewards for the RL procedure. Second, we ask the Teacher to correct the wrong reasoning after the RL stage. The RL procedure requires massive efforts for hyperparameter tuning and often generates errors like repetitive words and incomplete sentences. With the correction feedback, we stabilize the RL fine-tuned model through SFT. We conduct experiments on multi-model dataset ScienceQA and A-OKVQA to demonstrate the effectiveness of our proposal. ARES rationale reasoning achieves around 70% win rate against baseline models judged by GPT-4o. Additionally, we observe that the improved rationale reasoning leads to a 2.5% increase in inference answer accuracy on average for the multi-modal datasets.

[LG-233] Compressing Search with Language Models

链接: https://arxiv.org/abs/2407.00085
作者: Thomas Mulc,Jennifer L. Steele
关键词: Millions of people, search data, Search, Google Search data, Google Search
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Millions of people turn to Google Search each day for information on things as diverse as new cars or flu symptoms. The terms that they enter contain valuable information on their daily intent and activities, but the information in these search terms has been difficult to fully leverage. User-defined categorical filters have been the most common way to shrink the dimensionality of search data to a tractable size for analysis and modeling. In this paper we present a new approach to reducing the dimensionality of search data while retaining much of the information in the individual terms without user-defined rules. Our contributions are two-fold: 1) we introduce SLaM Compression, a way to quantify search terms using pre-trained language models and create a representation of search data that has low dimensionality, is memory efficient, and effectively acts as a summary of search, and 2) we present CoSMo, a Constrained Search Model for estimating real world events using only search data. We demonstrate the efficacy of our contributions by estimating with high accuracy U.S. automobile sales and U.S. flu rates using only Google Search data.

[LG-234] Adapting Job Recommendations to User Preference Drift with Behavioral-Semantic Fusion Learning

链接: https://arxiv.org/abs/2407.00082
作者: Xiao Han,Chen Zhu,Xiao Hu,Chuan Qin,Xiangyu Zhao,Hengshu Zhu
关键词: Job recommender systems, aligning job opportunities, recommender systems, systems are crucial, crucial for aligning
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by KDD 24 Research Track

点击查看摘要

Abstract:Job recommender systems are crucial for aligning job opportunities with job-seekers in online job-seeking. However, users tend to adjust their job preferences to secure employment opportunities continually, which limits the performance of job recommendations. The inherent frequency of preference drift poses a challenge to promptly and precisely capture user preferences. To address this issue, we propose a novel session-based framework, BISTRO, to timely model user preference through fusion learning of semantic and behavioral information. Specifically, BISTRO is composed of three stages: 1) coarse-grained semantic clustering, 2) fine-grained job preference extraction, and 3) personalized top- k job recommendation. Initially, BISTRO segments the user interaction sequence into sessions and leverages session-based semantic clustering to achieve broad identification of person-job matching. Subsequently, we design a hypergraph wavelet learning method to capture the nuanced job preference drift. To mitigate the effect of noise in interactions caused by frequent preference drift, we innovatively propose an adaptive wavelet filtering technique to remove noisy interaction. Finally, a recurrent neural network is utilized to analyze session-based interaction for inferring personalized preferences. Extensive experiments on three real-world offline recruitment datasets demonstrate the significant performances of our framework. Significantly, BISTRO also excels in online experiments, affirming its effectiveness in live recruitment settings. This dual success underscores the robustness and adaptability of BISTRO. The source code is available at this https URL.

[LG-235] Semantic Revolution from Communications to Orchestration for 6G: Challenges Enablers and Research Directions

链接: https://arxiv.org/abs/2407.00081
作者: Masoud Shokrnezhad,Hamidreza Mazandarani,Tarik Taleb,Jaeseung Song,Richard Li
关键词: digital entities presents, context of emerging, interactions involving, involving a myriad, digital entities
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted at IEEE Network magazine special issue: Goal-oriented Semantic Communication and Networking

点击查看摘要

Abstract:In the context of emerging 6G services, the realization of everything-to-everything interactions involving a myriad of physical and digital entities presents a crucial challenge. This challenge is exacerbated by resource scarcity in communication infrastructures, necessitating innovative solutions for effective service implementation. Exploring the potential of Semantic Communications (SemCom) to enhance point-to-point physical layer efficiency shows great promise in addressing this challenge. However, achieving efficient SemCom requires overcoming the significant hurdle of knowledge sharing between semantic decoders and encoders, particularly in the dynamic and non-stationary environment with stringent end-to-end quality requirements. To bridge this gap in existing literature, this paper introduces the Knowledge Base Management And Orchestration (KB-MANO) framework. Rooted in the concepts of Computing-Network Convergence (CNC) and lifelong learning, KB-MANO is crafted for the allocation of network and computing resources dedicated to updating and redistributing KBs across the system. The primary objective is to minimize the impact of knowledge management activities on actual service provisioning. A proof-of-concept is proposed to showcase the integration of KB-MANO with resource allocation in radio access networks. Finally, the paper offers insights into future research directions, emphasizing the transformative potential of semantic-oriented communication systems in the realm of 6G technology.

[LG-236] Decentralized Task Offloading and Load-Balancing for Mobile Edge Computing in Dense Networks

链接: https://arxiv.org/abs/2407.00080
作者: Mariam Yahya,Alexander Conzelmann,Setareh Maghsudi
关键词: decentralized task offloading, edge servers, numerous devices, set of edge, decentralized task
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We study the problem of decentralized task offloading and load-balancing in a dense network with numerous devices and a set of edge servers. Solving this problem optimally is complicated due to the unknown network information and random task sizes. The shared network resources also influence the users’ decisions and resource distribution. Our solution combines the mean field multi-agent multi-armed bandit (MAB) game with a load-balancing technique that adjusts the servers’ rewards to achieve a target population profile despite the distributed user decision-making. Numerical results demonstrate the efficacy of our approach and the convergence to the target load distribution.

[LG-237] Differentially Private Graph Diffusion with Applications in Personalized PageRanks

链接: https://arxiv.org/abs/2407.00077
作者: Rongzhe Wei,Eli Chien,Pan Li
关键词: iteratively propagates real-valued, propagates real-valued substances, iteratively propagates, propagates real-valued, real-valued substances
类目: Information Retrieval (cs.IR); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph diffusion, which iteratively propagates real-valued substances among the graph, is used in numerous graph/network-involved applications. However, releasing diffusion vectors may reveal sensitive linking information in the data such as transaction information in financial network data. However, protecting the privacy of graph data is challenging due to its interconnected nature. This work proposes a novel graph diffusion framework with edge-level different privacy guarantees by using noisy diffusion iterates. The algorithm injects Laplace noise per diffusion iteration and adopts a degree-based thresholding function to mitigate the high sensitivity induced by low-degree nodes. Our privacy loss analysis is based on Privacy Amplification by Iteration (PABI), which to our best knowledge, is the first effort that analyzes PABI with Laplace noise and provides relevant applications. We also introduce a novel Infinity-Wasserstein distance tracking method, which tightens the analysis of privacy leakage and makes PABI more applicable in practice. We evaluate this framework by applying it to Personalized Pagerank computation for ranking tasks. Experiments on real-world network data demonstrate the superiority of our method under stringent privacy conditions.

[LG-238] Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference

链接: https://arxiv.org/abs/2407.00075
作者: Anton Xue,Avishree Khare,Rajeev Alur,Surbhi Goel,Eric Wong
关键词: subvert language models, propositional Horn logic, language models, models, large language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study how to subvert language models from following the rules. We model rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if P and Q , then R " for some propositions P , Q , and R . We prove that although transformers can faithfully abide by such rules, maliciously crafted prompts can nevertheless mislead even theoretically constructed models. Empirically, we find that attacks on our theoretical models mirror popular attacks on large language models. Our work suggests that studying smaller theoretical models can help understand the behavior of large language models in rule-based settings like logical reasoning and jailbreak attacks.

[LG-239] Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization

链接: https://arxiv.org/abs/2407.00071
作者: Mert Esencan,Tarun Advaith Kumar,Ata Akbari Asanjan,P. Aaron Lott,Masoud Mohseni,Can Unlu,Davide Venturelli,Alan Ho
关键词: Recent Large Language, Large Language Models, Recent Large, Language Models, Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Recent Large Language Models (LLMs) have demonstrated impressive capabilities at tasks that require human intelligence and are a significant step towards human-like artificial intelligence (AI). Yet the performance of LLMs at reasoning tasks have been subpar and the reasoning capability of LLMs is a matter of significant debate. While it has been shown that the choice of the prompting technique to the LLM can alter its performance on a multitude of tasks, including reasoning, the best performing techniques require human-made prompts with the knowledge of the tasks at hand. We introduce a framework for what we call Combinatorial Reasoning (CR), a fully-automated prompting method, where reasons are sampled from an LLM pipeline and mapped into a Quadratic Unconstrained Binary Optimization (QUBO) problem. The framework investigates whether QUBO solutions can be profitably used to select a useful subset of the reasons to construct a Chain-of-Thought style prompt. We explore the acceleration of CR with specialized solvers. We also investigate the performance of simpler zero-shot strategies such as linear majority rule or random selection of reasons. Our preliminary study indicates that coupling a combinatorial solver to generative AI pipelines is an interesting avenue for AI reasoning and elucidates design principles for future CR methods.

[LG-240] Perceptron Collaborative Filtering

链接: https://arxiv.org/abs/2407.00067
作者: Arya Chakraborty
关键词: making automatic predictions, achieve similar results, implementing collaborative filtering, multivariate logistic regression, logistic regression classifiers
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:While multivariate logistic regression classifiers are a great way of implementing collaborative filtering - a method of making automatic predictions about the interests of a user by collecting preferences or taste information from many other users, we can also achieve similar results using neural networks. A recommender system is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular user. A perceptron or a neural network is a machine learning model designed for fitting complex datasets using backpropagation and gradient descent. When coupled with advanced optimization techniques, the model may prove to be a great substitute for classical logistic classifiers. The optimizations include feature scaling, mean normalization, regularization, hyperparameter tuning and using stochastic/mini-batch gradient descent instead of regular gradient descent. In this use case, we will use the perceptron in the recommender system to fit the parameters i.e., the data from a multitude of users and use it to predict the preference/interest of a particular user.

[LG-241] Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

链接: https://arxiv.org/abs/2407.00066
作者: Rickard Brüel-Gabrielsson,Jiacheng Zhu,Onkar Bhardwaj,Leshem Choshen,Kristjan Greenewald,Mikhail Yurochkin,Justin Solomon
关键词: Fine-tuning large language, large language models, yielding numerous copies, Fine-tuning large, LLM differing
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) with low-rank adapters (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRA adapters. We consider compressing adapters individually via SVD and propose a method for joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. Our experiments with up to 500 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 75% of the throughput of serving a single LoRA.

[LG-242] An Interpretable Alternative to Neural Representation Learning for Rating Prediction – Transparent Latent Class Modeling of User Reviews

链接: https://arxiv.org/abs/2407.00063
作者: Giuseppe Serra,Peter Tino,Zhao Xu,Xin Yao
关键词: including recommender systems, including recommender, recommender systems, widely adopted, Nowadays
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nowadays, neural network (NN) and deep learning (DL) techniques are widely adopted in many applications, including recommender systems. Given the sparse and stochastic nature of collaborative filtering (CF) data, recent works have critically analyzed the effective improvement of neural-based approaches compared to simpler and often transparent algorithms for recommendation. Previous results showed that NN and DL models can be outperformed by traditional algorithms in many tasks. Moreover, given the largely black-box nature of neural-based methods, interpretable results are not naturally obtained. Following on this debate, we first present a transparent probabilistic model that topologically organizes user and product latent classes based on the review information. In contrast to popular neural techniques for representation learning, we readily obtain a statistical, visualization-friendly tool that can be easily inspected to understand user and product characteristics from a textual-based perspective. Then, given the limitations of common embedding techniques, we investigate the possibility of using the estimated interpretable quantities as model input for a rating prediction task. To contribute to the recent debates, we evaluate our results in terms of both capacity for interpretability and predictive performances in comparison with popular text-based neural approaches. The results demonstrate that the proposed latent class representations can yield competitive predictive performances, compared to popular, but difficult-to-interpret approaches.

[LG-243] One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

链接: https://arxiv.org/abs/2407.00047
作者: Archit Patke,Dhemath Reddy,Saurabh Jha,Haoran Qiu,Christian Pinto,Shengkun Cui,Chandra Narayanaswami,Zbigniew Kalbarczyk,Ravishankar Iyer
关键词: increasingly important workload, cloud providers catering, LLM serving, Large language models, Large language
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract: Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to in production settings. However, existing LLM serving systems focus on optimization objectives such as request serving throughput or request execution latency rather than the end-to-end latency SLOs. Achieving end-to-end SLOs for latency-sensitive requests is challenging due to head-of-line (HOL) blocking in the request queue, which results from bursty arrival rates and insufficient resources. To address the above challenge, we propose QLM, a multi-model queue management framework for LLM serving. QLM uses stochastic programming to orchestrate the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize SLO attainment. Specifically, QLM uses the following LSOs: model swapping, request eviction, GPU-CPU state swapping, load balancing, and warm model start. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2407.00047 [cs.DC] (or arXiv:2407.00047v1 [cs.DC] for this version)

[LG-244] Preble: Efficient Distributed Prompt Scheduling for LLM Serving

链接: https://arxiv.org/abs/2407.00023
作者: Vikranth Srivatsa,Zijian He,Reyna Abhyankar,Dongming Li,Yiying Zhang
关键词: simple user questions, large language models, user questions, large language, evolved beyond simple
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today’s practices include domain-specific instructions, illustration of tool usages, and long context, such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests, and their attention computation results can be reused. However, today’s LLM serving systems treat every request in isolation, missing the opportunity of computation reuse. This paper proposes Preble, the first distributed LLM serving platform that targets and optimizes for prompt sharing. We perform a study on five popular LLM workloads. Based on our study results, we designed a distributed scheduling system that co-optimizes computation reuse and load balancing. Our evaluation of Preble on two to 8 GPUs with real workloads and request arrival patterns on two open-source LLM models shows that Preble outperforms the state-of-the-art average latency by 1.5X to 14.5X and p99 by 2X to 10X. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2407.00023 [cs.DC] (or arXiv:2407.00023v1 [cs.DC] for this version)

[LG-245] Visual Language Model based Cross-modal Semantic Communication Systems

链接: https://arxiv.org/abs/2407.00020
作者: Feibo Jiang,Chuanguo Tang,Li Dong,Kezhi Wang,Kun Yang,Cunhua Pan
关键词: Shannon physical capacity, physical capacity limits, Cross-modal Semantic Communication, transcending the Shannon, Shannon physical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:Semantic Communication (SC) has emerged as a novel communication paradigm in recent years, successfully transcending the Shannon physical capacity limits through innovative semantic transmission concepts. Nevertheless, extant Image Semantic Communication (ISC) systems face several challenges in dynamic environments, including low semantic density, catastrophic forgetting, and uncertain Signal-to-Noise Ratio (SNR). To address these challenges, we propose a novel Vision-Language Model-based Cross-modal Semantic Communication (VLM-CSC) system. The VLM-CSC comprises three novel components: (1) Cross-modal Knowledge Base (CKB) is used to extract high-density textual semantics from the semantically sparse image at the transmitter and reconstruct the original image based on textual semantics at the receiver. The transmission of high-density semantics contributes to alleviating bandwidth pressure. (2) Memory-assisted Encoder and Decoder (MED) employ a hybrid long/short-term memory mechanism, enabling the semantic encoder and decoder to overcome catastrophic forgetting in dynamic environments when there is a drift in the distribution of semantic features. (3) Noise Attention Module (NAM) employs attention mechanisms to adaptively adjust the semantic coding and the channel coding based on SNR, ensuring the robustness of the CSC system. The experimental simulations validate the effectiveness, adaptability, and robustness of the CSC system.

[LG-246] Enhancing Computational Efficiency in Multiscale Systems Using Deep Learning of Coordinates and Flow Maps

链接: https://arxiv.org/abs/2407.00011
作者: Asif Hamid,Danish Rafiq,Shahkar Ahmad Nahvi,Mohammad Abid Bazaz
关键词: show macroscopic coherent, macroscopic coherent behavior, coherent behavior due, Complex systems, agents like molecules
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Complex systems often show macroscopic coherent behavior due to the interactions of microscopic agents like molecules, cells, or individuals in a population with their environment. However, simulating such systems poses several computational challenges during simulation as the underlying dynamics vary and span wide spatiotemporal scales of interest. To capture the fast-evolving features, finer time steps are required while ensuring that the simulation time is long enough to capture the slow-scale behavior, making the analyses computationally unmanageable. This paper showcases how deep learning techniques can be used to develop a precise time-stepping approach for multiscale systems using the joint discovery of coordinates and flow maps. While the former allows us to represent the multiscale dynamics on a representative basis, the latter enables the iterative time-stepping estimation of the reduced variables. The resulting framework achieves state-of-the-art predictive accuracy while incurring lesser computational costs. We demonstrate this ability of the proposed scheme on the large-scale Fitzhugh Nagumo neuron model and the 1D Kuramoto-Sivashinsky equation in the chaotic regime.

[LG-247] Graph Neural Networks and Reinforcement Learning for Proactive Application Image Placement

链接: https://arxiv.org/abs/2407.00007
作者: Antonios Makris,Theodoros Theodoropoulos,Evangelos Psomakelis,Emanuele Carlini,Matteo Mordacchini,Patrizio Dazzi,Konstantinos Tserpes
关键词: Edge computing, shift from Cloud, Cloud-Edge continuum presents, Cloud Computing, Cloud-Edge continuum
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The shift from Cloud Computing to a Cloud-Edge continuum presents new opportunities and challenges for data-intensive and interactive applications. Edge computing has garnered a lot of attention from both industry and academia in recent years, emerging as a key enabler for meeting the increasingly strict demands of Next Generation applications. In Edge computing the computations are placed closer to the end-users, to facilitate low-latency and high-bandwidth applications and services. However, the distributed, dynamic, and heterogeneous nature of Edge computing, presents a significant challenge for service placement. A critical aspect of Edge computing involves managing the placement of applications within the network system to minimize each application’s runtime, considering the resources available on system devices and the capabilities of the system’s network. The placement of application images must be proactively planned to minimize image tranfer time, and meet the strict demands of the applications. In this regard, this paper proposes an approach for proactive image placement that combines Graph Neural Networks and actor-critic Reinforcement Learning, which is evaluated empirically and compared against various solutions. The findings indicate that although the proposed approach may result in longer execution times in certain scenarios, it consistently achieves superior outcomes in terms of application placement.

[LG-248] Centerline Boundary Dice Loss for Vascular Segmentation

链接: https://arxiv.org/abs/2407.01517
作者: Pengcheng Shi,Jiesi Hu,Yanwu Yang,Zilve Gao,Wei Liu,Ting Ma
关键词: medical imaging plays, functional assessments, medical imaging, imaging plays, plays a crucial
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by MICCAI 2024

点击查看摘要

Abstract:Vascular segmentation in medical imaging plays a crucial role in analysing morphological and functional assessments. Traditional methods, like the centerline Dice (clDice) loss, ensure topology preservation but falter in capturing geometric details, especially under translation and deformation. The combination of clDice with traditional Dice loss can lead to diameter imbalance, favoring larger vessels. Addressing these challenges, we introduce the centerline boundary Dice (cbDice) loss function, which harmonizes topological integrity and geometric nuances, ensuring consistent segmentation across various vessel sizes. cbDice enriches the clDice approach by including boundary-aware aspects, thereby improving geometric detail recognition. It matches the performance of the boundary difference over union (B-DoU) loss through a mask-distance-based approach, enhancing traslation sensitivity. Crucially, cbDice incorporates radius information from vascular skeletons, enabling uniform adaptation to vascular diameter changes and maintaining balance in branch growth and fracture impacts. Furthermore, we conducted a theoretical analysis of clDice variants (cl-X-Dice). We validated cbDice’s efficacy on three diverse vascular segmentation datasets, encompassing both 2D and 3D, and binary and multi-class segmentation. Particularly, the method integrated with cbDice demonstrated outstanding performance on the MICCAI 2023 TopCoW Challenge dataset. Our code is made publicly available at: this https URL.

[LG-249] Neurovascular Segmentation in sOCT with Deep Learning and Synthetic Training Data

链接: https://arxiv.org/abs/2407.01419
作者: Etienne Chollet,Yaël Balbastre,Chiara Mauri,Caroline Magnain,Bruce Fischl,Hui Wang
关键词: Microvascular anatomy, neurological disorders, Microvascular, comprehensive three-dimensional vascular, three-dimensional vascular network
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 10 figures

点击查看摘要

Abstract:Microvascular anatomy is known to be involved in various neurological disorders. However, understanding these disorders is hindered by the lack of imaging modalities capable of capturing the comprehensive three-dimensional vascular network structure at microscopic resolution. With a lateral resolution of = 20 \textmum and ability to reconstruct large tissue blocks up to tens of cubic centimeters, serial-section optical coherence tomography (sOCT) is well suited for this task. This method uses intrinsic optical properties to visualize the vessels and therefore does not possess a specific contrast, which complicates the extraction of accurate vascular models. The performance of traditional vessel segmentation methods is heavily degraded in the presence of substantial noise and imaging artifacts and is sensitive to domain shifts, while convolutional neural networks (CNNs) require extensive labeled data and are also sensitive the precise intensity characteristics of the data that they are trained on. Building on the emerging field of synthesis-based training, this study demonstrates a synthesis engine for neurovascular segmentation in sOCT images. Characterized by minimal priors and high variance sampling, our highly generalizable method tested on five distinct sOCT acquisitions eliminates the need for manual annotations while attaining human-level precision. Our approach comprises two phases: label synthesis and label-to-image transformation. We demonstrate the efficacy of the former by comparing it to several more realistic sets of training labels, and the latter by an ablation study of synthetic noise and artifact models.

[LG-250] Deep Dive into MRI: Exploring Deep Learning Applications in 0.55T and 7T MRI

链接: https://arxiv.org/abs/2407.01318
作者: Ana Carolina Alves,André Ferreira,Behrus Puladi,Jan Egger,Victor Alves
关键词: magnetic resonance imaging, involving ionising radiation, ionising radiation exposure, techniques involving ionising, providing a safe
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The development of magnetic resonance imaging (MRI) for medical imaging has provided a leap forward in diagnosis, providing a safe, non-invasive alternative to techniques involving ionising radiation exposure for diagnostic purposes. It was described by Block and Purcel in 1946, and it was not until 1980 that the first clinical application of MRI became available. Since that time the MRI has gone through many advances and has altered the way diagnosing procedures are performed. Due to its ability to improve constantly, MRI has become a commonly used practice among several specialisations in medicine. Particularly starting 0.55T and 7T MRI technologies have pointed out enhanced preservation of image detail and advanced tissue characterisation. This review examines the integration of deep learning (DL) techniques into these MRI modalities, disseminating and exploring the study applications. It highlights how DL contributes to 0.55T and 7T MRI data, showcasing the potential of DL in improving and refining these technologies. The review ends with a brief overview of how MRI technology will evolve in the coming years.

[LG-251] On Statistical Rates and Provably Efficient Criteria of Latent Diffusion Transformers (DiTs)

链接: https://arxiv.org/abs/2407.01079
作者: Jerry Yao-Chieh Hu,Weimin Wu,Zhuoru Li,Zhao Song,Han Liu
关键词: latent DiTs, linear latent space, textbf, time latent DiTs, latent space assumption
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the statistical and computational limits of latent \textbfDiffusion \textbfTransformers (\textbfDiTs) under the low-dimensional linear latent space assumption. Statistically, we study the universal approximation and sample complexity of the DiTs score function, as well as the distribution recovery property of the initial data. Specifically, under mild data assumptions, we derive an approximation error bound for the score network of latent DiTs, which is sub-linear in the latent space dimension. Additionally, we derive the corresponding sample complexity bound and show that the data distribution generated from the estimated score function converges toward a proximate area of the original one. Computationally, we characterize the hardness of both forward inference and backward computation of latent DiTs, assuming the Strong Exponential Time Hypothesis (SETH). For forward inference, we identify efficient criteria for all possible latent DiTs inference algorithms and showcase our theory by pushing the efficiency toward almost-linear time inference. For backward computation, we leverage the low-rank structure within the gradient computation of DiTs training for possible algorithmic speedup. Specifically, we show that such speedup achieves almost-linear time latent DiTs training by casting the DiTs gradient as a series of chained low-rank approximations with bounded error. Under the low-dimensional assumption, we show that the convergence rate and the computational efficiency are both dominated by the dimension of the subspace, suggesting that latent DiTs have the potential to bypass the challenges associated with the high dimensionality of initial data.

[LG-252] Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests

链接: https://arxiv.org/abs/2407.01036
作者: Pallavi Basu,Ron Berman
关键词: testers conducting large-scale, false discovery rate, tests prioritize lifts, testers conducting, conducting large-scale tests
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A/B testers conducting large-scale tests prioritize lifts and want to be able to control false rejections of the null. This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control. We build an empirical Bayes solution for the problem via the greedy knapsack approach. We derive an oracle rule based on ranking the ratio of expected lifts and the cost of wrong rejections using the local false discovery rate (lfdr) statistic. Our oracle decision rule is valid and optimal for large-scale tests. Further, we establish asymptotic validity for the data-driven procedure and demonstrate finite-sample validity in experimental studies. We also demonstrate the merit of the proposed method over other FDR control methods. Finally, we discuss an application to actual Optimizely experiments.

[LG-253] Bayesian Entropy Neural Networks for Physics-Aware Prediction

链接: https://arxiv.org/abs/2407.01015
作者: Rahul Rathnakumar,Jiayu Huang,Hao Yan,Yongming Liu
关键词: non-data sample information, scenarios requiring flexible, incorporate non-data sample, Bayesian Entropy Neural, requiring flexible model
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:This paper addresses the need for deep learning models to integrate well-defined constraints into their outputs, driven by their application in surrogate models, learning with limited data and partial information, and scenarios requiring flexible model behavior to incorporate non-data sample information. We introduce Bayesian Entropy Neural Networks (BENN), a framework grounded in Maximum Entropy (MaxEnt) principles, designed to impose constraints on Bayesian Neural Network (BNN) predictions. BENN is capable of constraining not only the predicted values but also their derivatives and variances, ensuring a more robust and reliable model output. To achieve simultaneous uncertainty quantification and constraint satisfaction, we employ the method of multipliers approach. This allows for the concurrent estimation of neural network parameters and the Lagrangian multipliers associated with the constraints. Our experiments, spanning diverse applications such as beam deflection modeling and microstructure generation, demonstrate the effectiveness of BENN. The results highlight significant improvements over traditional BNNs and showcase competitive performance relative to contemporary constrained deep learning methods.

[LG-254] Macroeconomic Forecasting with Large Language Models

链接: https://arxiv.org/abs/2407.00890
作者: Andrea Carriero,Davide Pettenuzzo,Shubhranshu Shekhar
关键词: Large Language Models, Language Models, Large Language, comparative analysis evaluating, accuracy of Large
类目: Econometrics (econ.EM); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a comparative analysis evaluating the accuracy of Large Language Models (LLMs) against traditional macro time series forecasting approaches. In recent times, LLMs have surged in popularity for forecasting due to their ability to capture intricate patterns in data and quickly adapt across very different domains. However, their effectiveness in forecasting macroeconomic time series data compared to conventional methods remains an area of interest. To address this, we conduct a rigorous evaluation of LLMs against traditional macro forecasting methods, using as common ground the FRED-MD database. Our findings provide valuable insights into the strengths and limitations of LLMs in forecasting macroeconomic time series, shedding light on their applicability in real-world scenarios

[LG-255] A Unified Approach to Extract Intepretable Rules from Tree Ensembles via Integer Programming

链接: https://arxiv.org/abs/2407.00843
作者: Lorenzo Bonasera,Emilio Carrizosa
关键词: popular machine learning, machine learning model, Tree ensemble, represent a popular, popular machine
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tree ensemble methods represent a popular machine learning model, known for their effectiveness in supervised classification and regression tasks. Their performance derives from aggregating predictions of multiple decision trees, which are renowned for their interpretability properties. However, tree ensemble methods do not reliably exhibit interpretable output. Our work aims to extract an optimized list of rules from a trained tree ensemble, providing the user with a condensed, interpretable model that retains most of the predictive power of the full model. Our approach consists of solving a clean and neat set partitioning problem formulated through Integer Programming. The proposed method works with either tabular or time series data, for both classification and regression tasks, and does not require parameter tuning under the most common setting. Through rigorous computational experiments, we offer statistically significant evidence that our method is competitive with other rule extraction methods and effectively handles time series.

[LG-256] A data-driven approach to modeling brain activity using differential equations

链接: https://arxiv.org/abs/2407.00824
作者: Kuratov Andrey(1) ((1) HSE University, Moscow)
关键词: complete solutions, research focuses, innovative task, traditional methods, extracting equations
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research focuses on an innovative task of extracting equations from incomplete data, moving away from traditional methods used for complete solutions. The study addresses the challenge of extracting equations from data, particularly in the study of brain activity using electrophysiological data, which is often limited by insufficient information. The study provides a brief review of existing open-source equation derivation approaches in the context of modeling brain activity. The section below introduces a novel algorithm that employs incomplete data and prior domain knowledge to recover differential equations. The algorithm’s practicality in real-world scenarios is demonstrated through its application on both synthetic and real datasets.

[LG-257] Advantages of quantum support vector machine in cross-domain classification of quantum states

链接: https://arxiv.org/abs/2407.00774
作者: Diksha Sharma,Vivek Balasaheb Sabale,Parvinder Singh,Atul Kumar
关键词: versus separability paradigm, entanglement versus separability, quantum machine learning, separability paradigm, machine learning
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this study, we use cross-domain classification using quantum machine learning for quantum advantages to address the entanglement versus separability paradigm. We further demonstrate the efficient classification of Bell diagonal states into zero and non-zero discord classes. The inherited structure of quantum states and its relation with a particular class of quantum states are exploited to intuitively approach the classification of different domain testing states, referred here as crossdomain classification. In addition, we extend our analysis to evaluate the robustness of our model for the analyzed problem using random unitary transformations. Using numerical analysis, our results clearly demonstrate the potential of QSVM for classifying quantum states across the multidimensional Hilbert space.

[LG-258] Analysis of Modern Computer Vision Models for Blood Cell Classification

链接: https://arxiv.org/abs/2407.00759
作者: Alexander Kim(1),Ryan Kim(2) ((1) University of Illinois Urbana-Champaign, (2) William Fremd High School)
关键词: white blood cells, related blood components, white blood, blood cells, related blood
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:The accurate classification of white blood cells and related blood components is crucial for medical diagnoses. While traditional manual examinations and automated hematology analyzers have been widely used, they are often slow and prone to errors. Recent advancements in deep learning have shown promise for addressing these limitations. Earlier studies have demonstrated the viability of convolutional neural networks such as DenseNet, ResNet, and VGGNet for this task. Building on these foundations, our work employs more recent and efficient models to achieve rapid and accurate results. Specifically, this study used state-of-the-art architectures, including MaxVit, EfficientVit, EfficientNet, EfficientNetV2, and MobileNetV3. This study aimed to evaluate the performance of these models in WBC classification, potentially offering a more efficient and reliable alternative to current methods. Our approach not only addresses the speed and accuracy concerns of traditional techniques but also explores the applicability of innovative deep learning models in hematological analysis.

[LG-259] Quantum Circuit Synthesis and Compilation Optimization: Overview and Prospects

链接: https://arxiv.org/abs/2407.00736
作者: Yan Ge,Wu Wenjie,Chen Yuheng,Pan Kaisen,Lu Xudong,Zhou Zixiang,Wang Yuhan,Wang Ruocheng,Yan Junchi
关键词: current computational power, computational power bottlenecks, Quantum, post-Moore era, computing is regarded
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 32 page, 3 figures, 3 tables

点击查看摘要

Abstract:Quantum computing is regarded as a promising paradigm that may overcome the current computational power bottlenecks in the post-Moore era. The increasing maturity of quantum processors, especially superconducting ones, provides more possibilities for the development and implementation of quantum algorithms. As the crucial stages for quantum algorithm implementation, the logic circuit design and quantum compiling have also received significant attention, which covers key technologies such as quantum logic circuit synthesis (also widely known as quantum architecture search) and optimization, as well as qubit mapping and routing. Recent studies suggest that the scale and precision of related algorithms are steadily increasing, especially with the integration of artificial intelligence methods. In this survey, we systematically review and summarize a vast body of literature, exploring the feasibility of an integrated design and optimization scheme that spans from the algorithmic level to quantum hardware, combining the steps of logic circuit design and compilation optimization. Leveraging the exceptional cognitive and learning capabilities of AI algorithms, one can reduce manual design costs, enhance the precision and efficiency of execution, and facilitate the implementation and validation of the superiority of quantum algorithms on hardware.

[LG-260] Generative prediction of flow field based on the diffusion model

链接: https://arxiv.org/abs/2407.00735
作者: Jiajun Hu,Zhen Lu,Yue Yang
关键词: diffusion model, model, utilizes the input, shape to predict, CNN model
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a geometry-to-flow diffusion model that utilizes the input of obstacle shape to predict a flow field past the obstacle. The model is based on a learnable Markov transition kernel to recover the data distribution from the Gaussian distribution. The Markov process is conditioned on the obstacle geometry, estimating the noise to be removed at each step, implemented via a U-Net. A cross-attention mechanism incorporates the geometry as a prompt. We train the geometry-to-flow diffusion model using a dataset of flows past simple obstacles, including the circle, ellipse, rectangle, and triangle. For comparison, the CNN model is trained using the same dataset. Tests are carried out on flows past obstacles with simple and complex geometries, representing interpolation and extrapolation on the geometry condition, respectively. In the test set, challenging scenarios include a cross and characters `PKU’. Generated flow fields show that the geometry-to-flow diffusion model is superior to the CNN model in predicting instantaneous flow fields and handling complex geometries. Quantitative analysis of the model accuracy and divergence in the fields demonstrate the high robustness of the diffusion model, indicating that the diffusion model learns physical laws implicitly.

[LG-261] D-CDLF: Decomposition of Common and Distinctive Latent Factors for Multi-view High-dimensional Data

链接: https://arxiv.org/abs/2407.00730
作者: Hai Shu
关键词: distinctive latent factors, common latent factors, common-source matrix generated, distinctive-source matrix generated, latent factors
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A typical approach to the joint analysis of multiple high-dimensional data views is to decompose each view’s data matrix into three parts: a low-rank common-source matrix generated by common latent factors of all data views, a low-rank distinctive-source matrix generated by distinctive latent factors of the corresponding data view, and an additive noise matrix. Existing decomposition methods often focus on the uncorrelatedness between the common latent factors and distinctive latent factors, but inadequately address the equally necessary uncorrelatedness between distinctive latent factors from different data views. We propose a novel decomposition method, called Decomposition of Common and Distinctive Latent Factors (D-CDLF), to effectively achieve both types of uncorrelatedness for two-view data. We also discuss the estimation of the D-CDLF under high-dimensional settings.

[LG-262] Particle Semi-Implicit Variational Inference

链接: https://arxiv.org/abs/2407.00649
作者: Jen Ning Lim,Adam M. Johansen
关键词: Semi-implicit variational inference, Semi-implicit variational, Particle Variational Inference, variational inference, enriches the expressiveness
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-implicit variational inference (SIVI) enriches the expressiveness of variational families by utilizing a kernel and a mixing distribution to hierarchically define the variational distribution. Existing SIVI methods parameterize the mixing distribution using implicit distributions, leading to intractable variational densities. As a result, directly maximizing the evidence lower bound (ELBO) is not possible and so, they resort to either: optimizing bounds on the ELBO, employing costly inner-loop Markov chain Monte Carlo runs, or solving minimax objectives. In this paper, we propose a novel method for SIVI called Particle Variational Inference (PVI) which employs empirical measures to approximate the optimal mixing distributions characterized as the minimizer of a natural free energy functional via a particle approximation of an Euclidean–Wasserstein gradient flow. This approach means that, unlike prior works, PVI can directly optimize the ELBO; furthermore, it makes no parametric assumption about the mixing distribution. Our empirical results demonstrate that PVI performs favourably against other SIVI methods across various tasks. Moreover, we provide a theoretical analysis of the behaviour of the gradient flow of a related free energy functional: establishing the existence and uniqueness of solutions as well as propagation of chaos results.

[LG-263] Clusterpath Gaussian Graphical Modeling

链接: https://arxiv.org/abs/2407.00644
作者: D. J. W. Touw,A. Alfons,P. J. F. Groenen,I. Wilms
关键词: visualizing conditional dependencies, Graphical models serve, Gaussian Graphical Model, serve as effective, effective tools
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 43 pages, 11 figures

点击查看摘要

Abstract:Graphical models serve as effective tools for visualizing conditional dependencies between variables. However, as the number of variables grows, interpretation becomes increasingly difficult, and estimation uncertainty increases due to the large number of parameters relative to the number of observations. To address these challenges, we introduce the Clusterpath estimator of the Gaussian Graphical Model (CGGM) that encourages variable clustering in the graphical model in a data-driven way. Through the use of a clusterpath penalty, we group variables together, which in turn results in a block-structured precision matrix whose block structure remains preserved in the covariance matrix. We present a computationally efficient implementation of the CGGM estimator by using a cyclic block coordinate descent algorithm. In simulations, we show that CGGM not only matches, but oftentimes outperforms other state-of-the-art methods for variable clustering in graphical models. We also demonstrate CGGM’s practical advantages and versatility on a diverse collection of empirical applications.

[LG-264] Accelerating Longitudinal MRI using Prior Informed Latent Diffusion

链接: https://arxiv.org/abs/2407.00537
作者: Yonatan Urman,Zachary Shah,Ashwin Kumar,Bruno P.Soares,Kawin Setsompop
关键词: soft-tissue imaging modality, ionization-free soft-tissue imaging, imaging modality, widely used ionization-free, ionization-free soft-tissue
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:MRI is a widely used ionization-free soft-tissue imaging modality, often employed repeatedly over a patient’s lifetime. However, prolonged scanning durations, among other issues, can limit availability and accessibility. In this work, we aim to substantially reduce scan times by leveraging prior scans of the same patient. These prior scans typically contain considerable shared information with the current scan, thereby enabling higher acceleration rates when appropriately utilized. We propose a prior informed reconstruction method with a trained diffusion model in conjunction with data-consistency steps. Our method can be trained with unlabeled image data, eliminating the need for a dataset of either k-space measurements or paired longitudinal scans as is required of other learning-based methods. We demonstrate superiority of our method over previously suggested approaches in effectively utilizing prior information without over-biasing prior consistency, which we validate on both an open-source dataset of healthy patients as well as several longitudinal cases of clinical interest.

[LG-265] Weighted mesh algorithms for general Markov decision processes: Convergence and tractability

链接: https://arxiv.org/abs/2407.00388
作者: Denis Belomestny,John Schoenmakers
关键词: Markov Decision Processes, finite-horizon Markov Decision, Decision Processes, Markov Decision, finite-horizon Markov
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a mesh-type approach for tackling discrete-time, finite-horizon Markov Decision Processes (MDPs) characterized by state and action spaces that are general, encompassing both finite and infinite (yet suitably regular) subsets of Euclidean space. In particular, for bounded state and action spaces, our algorithm achieves a computational complexity that is tractable in the sense of Novak and Wozniakowski, and is polynomial in the time horizon. For unbounded state space the algorithm is “semi-tractable” in the sense that the complexity is proportional to \epsilon^-c with some dimension independent c\geq2 , for achieving an accuracy \epsilon , and polynomial in the time horizon with degree linear in the underlying dimension. As such the proposed approach has some flavor of the randomization method by Rust which deals with infinite horizon MDPs and uniform sampling in compact state space. However, the present approach is essentially different due to the finite horizon and a simulation procedure due to general transition distributions, and more general in the sense that it encompasses unbounded state space. To demonstrate the effectiveness of our algorithm, we provide illustrations based on Linear-Quadratic Gaussian (LQG) control problems.

[LG-266] Machine Learning Models for Dengue Forecasting in Singapore

链接: https://arxiv.org/abs/2407.00332
作者: Zi Iun Lai,Wai Kit Fung,Enquan Chew
关键词: traditionally endemic regions, endemic regions, fastest growing, emerging prevalence, prevalence beyond traditionally
类目: Other Quantitative Biology (q-bio.OT); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:With emerging prevalence beyond traditionally endemic regions, the global burden of dengue disease is forecasted to be one of the fastest growing. With limited direct treatment or vaccination currently available, prevention through vector control is widely believed to be the most effective form of managing outbreaks. This study examines traditional state space models (moving average, autoregressive, ARIMA, SARIMA), supervised learning techniques (XGBoost, SVM, KNN) and deep networks (LSTM, CNN, ConvLSTM) for forecasting weekly dengue cases in Singapore. Meteorological data and search engine trends were included as features for ML techniques. Forecasts using CNNs yielded lowest RMSE in weekly cases in 2019.

[LG-267] Deconvolving Complex Neuronal Networks into Interpretable Task-Specific Connectomes

链接: https://arxiv.org/abs/2407.00201
作者: Yifan Wang,Vikram Ravindra,Ananth Grama
关键词: Task-specific functional MRI, images provide excellent, provide excellent modalities, functional MRI, canonical networks
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Task-specific functional MRI (fMRI) images provide excellent modalities for studying the neuronal basis of cognitive processes. We use fMRI data to formulate and solve the problem of deconvolving task-specific aggregate neuronal networks into a set of basic building blocks called canonical networks, to use these networks for functional characterization, and to characterize the physiological basis of these responses by mapping them to regions of the brain. Our results show excellent task-specificity of canonical networks, i.e., the expression of a small number of canonical networks can be used to accurately predict tasks; generalizability across cohorts, i.e., canonical networks are conserved across diverse populations, studies, and acquisition protocols; and that canonical networks have strong anatomical and physiological basis. From a methods perspective, the problem of identifying these canonical networks poses challenges rooted in the high dimensionality, small sample size, acquisition variability, and noise. Our deconvolution technique is based on non-negative matrix factorization (NMF) that identifies canonical networks as factors of a suitably constructed matrix. We demonstrate that our method scales to large datasets, yields stable and accurate factors, and is robust to noise.

[LG-268] DCSM 2.0: Deep Conditional Shape Models for Data Efficient Segmentation

链接: https://arxiv.org/abs/2407.00186
作者: Athira J Jacob,Puneet Sharma,Daniel Rueckert
关键词: image analyses workflows, medical image analyses, Conditional Shape Models, Deep Conditional Shape, analyses workflows
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Best oral paper award at ISBI 2024

点击查看摘要

Abstract:Segmentation is often the first step in many medical image analyses workflows. Deep learning approaches, while giving state-of-the-art accuracies, are data intensive and do not scale well to low data regimes. We introduce Deep Conditional Shape Models 2.0, which uses an edge detector, along with an implicit shape function conditioned on edge maps, to leverage cross-modality shape information. The shape function is trained exclusively on a source domain (contrasted CT) and applied to the target domain of interest (3D echocardiography). We demonstrate data efficiency in the target domain by varying the amounts of training data used in the edge detection stage. We observe that DCSM 2.0 outperforms the baseline at all data levels in terms of Hausdorff distances, and while using 50% or less of the training data in terms of average mesh distance, and at 10% or less of the data with the dice coefficient. The method scales well to low data regimes, with gains of up to 5% in dice coefficient, 2.58 mm in average surface distance and 21.02 mm in Hausdorff distance when using just 2% (22 volumes) of the training data.

[LG-269] Permutation invariant multi-output Gaussian Processes for drug combination prediction in cancer

链接: https://arxiv.org/abs/2407.00175
作者: Leiv Rønneberg,Vidhi Lalchand,Paul D. W. Kirk
关键词: active application field, machine learning, active application, application field, field in machine
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Dose-response prediction in cancer is an active application field in machine learning. Using large libraries of \textitin-vitro drug sensitivity screens, the goal is to develop accurate predictive models that can be used to guide experimental design or inform treatment decisions. Building on previous work that makes use of permutation invariant multi-output Gaussian Processes in the context of dose-response prediction for drug combinations, we develop a variational approximation to these models. The variational approximation enables a more scalable model that provides uncertainty quantification and naturally handles missing data. Furthermore, we propose using a deep generative model to encode the chemical space in a continuous manner, enabling prediction for new drugs and new combinations. We demonstrate the performance of our model in a simple setting using a high-throughput dataset and show that the model is able to efficiently borrow information across outputs.

[LG-270] Machine learning meets mass spectrometry: a focused perspective

链接: https://arxiv.org/abs/2407.00117
作者: Daniil A. Boiko,Valentine P. Ananikov
关键词: product quality control, industrial product quality, Mass spectrometry, life sciences, processes in medicine
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:Mass spectrometry is a widely used method to study molecules and processes in medicine, life sciences, chemistry, catalysis, and industrial product quality control, among many other applications. One of the main features of some mass spectrometry techniques is the extensive level of characterization (especially when coupled with chromatography and ion mobility methods, or a part of tandem mass spectrometry experiment) and a large amount of generated data per measurement. Terabyte scales can be easily reached with mass spectrometry studies. Consequently, mass spectrometry has faced the challenge of a high level of data disappearance. Researchers often neglect and then altogether lose access to the rich information mass spectrometry experiments could provide. With the development of machine learning methods, the opportunity arises to unlock the potential of these data, enabling previously inaccessible discoveries. The present perspective highlights reevaluation of mass spectrometry data analysis in the new generation of methods and describes significant challenges in the field, particularly related to problems involving the use of electrospray ionization. We argue that further applications of machine learning raise new requirements for instrumentation (increasing throughput and information density, decreasing pricing, and making more automation-friendly software), and once met, the field may experience significant transformation.

[LG-271] Optimal Transport for Latent Integration with An Application to Heterogeneous Neuronal Activity Data

链接: https://arxiv.org/abs/2407.00099
作者: Yubai Yuan,Babak Shahbaba,Norbert Fortin,Keiland Cooper,Qing Nie,Annie Qu
关键词: Detecting dynamic patterns, task-specific responses shared, Detecting dynamic, science and neuroscience, task-specific responses
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Detecting dynamic patterns of task-specific responses shared across heterogeneous datasets is an essential and challenging problem in many scientific applications in medical science and neuroscience. In our motivating example of rodent electrophysiological data, identifying the dynamical patterns in neuronal activity associated with ongoing cognitive demands and behavior is key to uncovering the neural mechanisms of memory. One of the greatest challenges in investigating a cross-subject biological process is that the systematic heterogeneity across individuals could significantly undermine the power of existing machine learning methods to identify the underlying biological dynamics. In addition, many technically challenging neurobiological experiments are conducted on only a handful of subjects where rich longitudinal data are available for each subject. The low sample sizes of such experiments could further reduce the power to detect common dynamic patterns among subjects. In this paper, we propose a novel heterogeneous data integration framework based on optimal transport to extract shared patterns in complex biological processes. The key advantages of the proposed method are that it can increase discriminating power in identifying common patterns by reducing heterogeneity unrelated to the signal by aligning the extracted latent spatiotemporal information across subjects. Our approach is effective even with a small number of subjects, and does not require auxiliary matching information for the alignment. In particular, our method can align longitudinal data across heterogeneous subjects in a common latent space to capture the dynamics of shared patterns while utilizing temporal dependency within subjects.

[LG-272] FoldToken2: Learning compact invariant and generative protein structure language

链接: https://arxiv.org/abs/2407.00050
作者: Zhangyang Gao,Cheng Tan,Stan Z. Li
关键词: posed long term, long term challenges, coordinates has posed, posed long, long term
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The equivalent nature of 3D coordinates has posed long term challenges in protein structure representation learning, alignment, and generation. Can we create a compact and invariant language that equivalently represents protein structures? Towards this goal, we propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structures. From FoldToken1 to FoldToken2, we improve three key components: (1) invariant structure encoder, (2) vector-quantized compressor, and (3) equivalent structure decoder. We evaluate FoldToken2 on the protein structure reconstruction task and show that it outperforms previous FoldToken1 by 20% in TMScore and 81% in RMSD. FoldToken2 probably be the first method that works well on both single-chain and multi-chain protein structures quantization. We believe that FoldToken2 will inspire further improvement in protein structure representation learning, structure alignment, and structure generation tasks.

[LG-273] A Machine Learning Approach for Identifying Anatomical Biomarkers of Early Mild Cognitive Impairment

链接: https://arxiv.org/abs/2407.00040
作者: Alwani Liyana Ahmad,Jose Sanchez-Bornot,Roberto C. Sotero,Damien Coyle,Zamzuri Idris,Ibrahima Faye
关键词: progressive neurodegenerative disorder, Alzheimer Disease Neuroinformatics, Alzheimer Disease, Disease Neuroinformatics Initiative, motor functions
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 27 pages, 5 figures

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is a progressive neurodegenerative disorder that primarily affects the aging population by impairing cognitive and motor functions. Early detection of AD through accessible methodologies like magnetic resonance imaging (MRI) is vital for developing effective interventions to halt or slow the disease’s progression. This study aims to perform a comprehensive analysis of machine learning techniques for selecting MRI-based biomarkers and classifying individuals into healthy controls (HC) and unstable controls (uHC) who later show mild cognitive impairment within five years. The research utilizes MRI data from the Alzheimer’s Disease Neuroinformatics Initiative (ADNI) and the Open Access Series of Imaging Studies 3 (OASIS-3), focusing on both HC and uHC participants. The study addresses the challenges of imbalanced data by testing classification methods on balanced and unbalanced datasets, and harmonizes data using polynomial regression to mitigate nuisance variables like age, gender, and intracranial volume. Results indicate that Gaussian Naive Bayes and RusBoost classifiers shows an optimal performance, achieving accuracies of up to 76.46% and 72.48% respectively on the ADNI dataset. For the OASIS-3 dataset, Kernel Naive Bayes and RusBoost yield accuracies ranging from 64.66% to 75.71%, improving further in age-matched datasets. Brain regions like the entorhinal cortex, hippocampus, lateral ventricle, and lateral orbitofrontal cortex are identified as significantly impacted during early cognitive decline. Despite limitations such as small sample sizes, the study’s harmonization approach enhances the robustness of biomarker selection, suggesting the potential of this semi-automatic machine learning pipeline for early AD detection using MRI.

[LG-274] Harnessing XGBoost for Robust Biomarker Selection of Obsessive-Compulsive Disorder (OCD) from Adolescent Brain Cognitive Development (ABCD) data

链接: https://arxiv.org/abs/2407.00028
作者: Xinyu Shen,Qimin Zhang,Huili Zheng,Weiwei Qi
关键词: Brain Cognitive Development, Adolescent Brain Cognitive, Cognitive Development, Adolescent Brain, Brain Cognitive
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This study evaluates the performance of various supervised machine learning models in analyzing highly correlated neural signaling data from the Adolescent Brain Cognitive Development (ABCD) Study, with a focus on predicting obsessive-compulsive disorder scales. We simulated a dataset to mimic the correlation structures commonly found in imaging data and evaluated logistic regression, elastic networks, random forests, and XGBoost on their ability to handle multicollinearity and accurately identify predictive features. Our study aims to guide the selection of appropriate machine learning methods for processing neuroimaging data, highlighting models that best capture underlying signals in high feature correlations and prioritize clinically relevant features associated with Obsessive-Compulsive Disorder (OCD).

[LG-275] Multi-objective generative AI for designing novel brain-targeting small molecules

链接: https://arxiv.org/abs/2407.00004
作者: Ayush Noori,Iñaki Arango,William E. Byrd,Nada Amin
关键词: central nervous system, successful central nervous, CNS drug design, BBB permeable drugs, blood-brain barrier
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 20 pages, 4 figures, Generative and Experimental Perspectives for Biomolecular Design Workshop at the 12th International Conference on Learning Representations

点击查看摘要

Abstract:The strict selectivity of the blood-brain barrier (BBB) represents one of the most formidable challenges to successful central nervous system (CNS) drug delivery. Computational methods to generate BBB permeable drugs in silico may be valuable tools in the CNS drug design pipeline. However, in real-world applications, BBB penetration alone is insufficient; rather, after transiting the BBB, molecules must bind to a specific target or receptor in the brain and must also be safe and non-toxic. To discover small molecules that concurrently satisfy these constraints, we use multi-objective generative AI to synthesize drug-like BBB-permeable small molecules. Specifically, we computationally synthesize molecules with predicted binding affinity against dopamine receptor D2, the primary target for many clinically effective antipsychotic drugs. After training several graph neural network-based property predictors, we adapt SyntheMol (Swanson et al., 2024), a recently developed Monte Carlo Tree Search-based algorithm for antibiotic design, to perform a multi-objective guided traversal over an easily synthesizable molecular space. We design a library of 26,581 novel and diverse small molecules containing hits with high predicted BBB permeability and favorable predicted safety and toxicity profiles, and that could readily be synthesized for experimental validation in the wet lab. We also validate top scoring molecules with molecular docking simulation against the D2 receptor and demonstrate predicted binding affinity on par with risperidone, a clinically prescribed D2-targeting antipsychotic. In the future, the SyntheMol-based computational approach described here may enable the discovery of novel neurotherapeutics for currently intractable disorders of the CNS.

[LG-276] Protein property prediction with uncertainties

链接: https://arxiv.org/abs/2407.00002
作者: Peter Mørch Groth,Mads Herbert Kerrn,Lars Olsen,Jesper Salomon,Wouter Boomsma
关键词: recent years, variant effects, considerable progress, progress in recent, reliable uncertainty estimates
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 10 pages (33 in total with appendix), 3 figures (19 figures in total with appendix)

点击查看摘要

Abstract:Reliable prediction of variant effects in proteins has seen considerable progress in recent years. The increasing availability of data in this regime has improved both the prediction performance and our ability to track progress in the field, measured in terms of prediction accuracy averaged over many datasets. For practical use in protein engineering, it is important that we can also provide reliable uncertainty estimates for our predictions, but such metrics are rarely reported. We here provide a Gaussian process regression model, Kermut, which obtains state-of-the-art performance for protein property prediction while also offering estimates of uncertainty through its posterior. We proceed by assessing the quality of these uncertainty estimates. Our results show that the model provides meaningful overall calibration, but that accurate instance-specific uncertainty quantification remains challenging. We hope that this will encourage future work in this promising direction.

信息检索

[IR-0] ColPali: Efficient Document Retrieval with Vision Language Models

链接: https://arxiv.org/abs/2407.01449
作者: Manuel Faysse,Hugues Sibille,Tony Wu,Gautier Viaud,Céline Hudelot,Pierre Colombo
关键词: Retrieval Augmented Generation, document retrieval, visually rich structures, information through text, modern document retrieval
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, ColPali, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.

[IR-1] POST: Email Archival Processing and Flagging Stack for Incident Responders

链接: https://arxiv.org/abs/2407.01433
作者: Jeffrey Fairbanks
关键词: points of compromise, main points, security and awareness, awareness being estimated, Natural Language Processing
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. For further information or questions please reach out to fairbanks6@llnl.gov

点击查看摘要

Abstract:Phishing is one of the main points of compromise, with email security and awareness being estimated at \ 50-100B in 2022. There is great need for email forensics capability to quickly search for malicious content. A novel solution POST is proposed. POST is an API driven serverless email archival, processing, and flagging workflow for both large and small organizations that collects and parses all email, flags emails using state of the art Natural Language Processing and Machine Learning, allows full email searching on every aspect of an email, and provides a cost savings of up to 68.6%.

[IR-2] A Global-Local Attention Mechanism for Relation Classification

链接: https://arxiv.org/abs/2407.01424
作者: Yiping Sun
关键词: involves identifying connections, Relation classification, involves identifying, crucial component, identifying connections
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: This paper has been accepted by the 2024 20th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)

点击查看摘要

Abstract:Relation classification, a crucial component of relation extraction, involves identifying connections between two entities. Previous studies have predominantly focused on integrating the attention mechanism into relation classification at a global scale, overlooking the importance of the local context. To address this gap, this paper introduces a novel global-local attention mechanism for relation classification, which enhances global attention with a localized focus. Additionally, we propose innovative hard and soft localization mechanisms to identify potential keywords for local attention. By incorporating both hard and soft localization strategies, our approach offers a more nuanced and comprehensive understanding of the contextual cues that contribute to effective relation classification. Our experimental results on the SemEval-2010 Task 8 dataset highlight the superior performance of our method compared to previous attention-based approaches in relation classification.

[IR-3] Optimization of Retrieval-Augmented Generation Context with Outlier Detection

链接: https://arxiv.org/abs/2407.01403
作者: Vitaly Bulgakov
关键词: prompt context required, Large Language Model, retrieved LLM responses, question-answering systems, reduce the size
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we focus on methods to reduce the size and improve the quality of the prompt context required for question-answering systems. Attempts to increase the number of retrieved chunked documents and thereby enlarge the context related to the query can significantly complicate the processing and decrease the performance of a Large Language Model (LLM) when generating responses to queries. It is well known that a large set of documents retrieved from a database in response to a query may contain irrelevant information, which often leads to hallucinations in the resulting answers. Our goal is to select the most semantically relevant documents, treating the discarded ones as outliers. We propose and evaluate several methods for identifying outliers by creating features that utilize the distances of embedding vectors, retrieved from the vector database, to both the centroid and the query vectors. The methods were evaluated by comparing the similarities of the retrieved LLM responses to ground-truth answers obtained using the OpenAI GPT-4o model. It was found that the greatest improvements were achieved with increasing complexity of the questions and answers.

[IR-4] Evaluation of Temporal Change in IR Test Collections

链接: https://arxiv.org/abs/2407.01373
作者: Jüri Keller,Timo Breuer,Philipp Schaer
关键词: Cranfield paradigm, retrieval, Information retrieval, retrieval systems, Information retrieval systems
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Information retrieval systems have been evaluated using the Cranfield paradigm for many years. This paradigm allows a systematic, fair, and reproducible evaluation of different retrieval methods in fixed experimental environments. However, real-world retrieval systems must cope with dynamic environments and temporal changes that affect the document collection, topical trends, and the individual user’s perception of what is considered relevant. Yet, the temporal dimension in IR evaluations is still understudied. To this end, this work investigates how the temporal generalizability of effectiveness evaluations can be assessed. As a conceptual model, we generalize Cranfield-type experiments to the temporal context by classifying the change in the essential components according to the create, update, and delete operations of persistent storage known from CRUD. From the different types of change different evaluation scenarios are derived and it is outlined what they imply. Based on these scenarios, renowned state-of-the-art retrieval systems are tested and it is investigated how the retrieval effectiveness changes on different levels of granularity. We show that the proposed measures can be well adapted to describe the changes in the retrieval results. The experiments conducted confirm that the retrieval effectiveness strongly depends on the evaluation scenario investigated. We find that not only the average retrieval performance of single systems but also the relative system performance are strongly affected by the components that change and to what extent these components changed. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2407.01373 [cs.IR] (or arXiv:2407.01373v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2407.01373 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Proceedings of the 2024 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '24), July 13, 2024, Washington, DC, USA Related DOI: https://doi.org/10.1145/3664190.3672530 Focus to learn more DOI(s) linking to related resources

[IR-5] BERGEN: A Benchmarking Library for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2407.01102
作者: David Rau,Hervé Déjean,Nadezhda Chirkova,Thibault Formal,Shuai Wang,Vassilina Nikoulina,Stéphane Clinchant
关键词: Large Language Models, enhance Large Language, Large Language, Language Models, Retrieval-Augmented Generation
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 29 pages

点击查看摘要

Abstract:Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets. Our open-source library BERGEN is available under \urlthis https URL.

[IR-6] Deep Domain Specialisation for single-model multi-domain learning to rank

链接: https://arxiv.org/abs/2407.01069
作者: Paul Missault,Abdelmaseeh Felfel
关键词: Information Retrieval, in-domain data yields, geographic regions, train separate ranking, in-domain data
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Information Retrieval (IR) practitioners often train separate ranking models for different domains (geographic regions, languages, stores, websites,…) as it is believed that exclusively training on in-domain data yields the best performance when sufficient data is available. Despite their performance gains, training multiple models comes at a higher cost to train, maintain and update compared to having only a single model responsible for all domains. Our work explores consolidated ranking models that serve multiple domains. Specifically, we propose a novel architecture of Deep Domain Specialisation (DDS) to consolidate multiple domains into a single model. We compare our proposal against Deep Domain Adaptation (DDA) and a set of baseline for multi-domain models. In our experiments, DDS performed the best overall while requiring fewer parameters per domain as other baselines. We show the efficacy of our method both with offline experimentation and on a large-scale online experiment on Amazon customer traffic.

[IR-7] ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions

链接: https://arxiv.org/abs/2407.00942
作者: Jingheng Ye,Yong Jiang,Xiaobin Wang,Yinghui Li,Yangning Li,Hai-Tao Zheng,Pengjun Xie,Fei Huang
关键词: tailored product searching, clarification question generation, strategic clarification question, e-commercial scenario, paper introduces
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 17 pages, 13 tables, 6 figures. Under review

点击查看摘要

Abstract:This paper introduces the task of product demand clarification within an e-commercial scenario, where the user commences the conversation with ambiguous queries and the task-oriented agent is designed to achieve more accurate and tailored product searching by asking clarification questions. To address this task, we propose ProductAgent, a conversational information seeking agent equipped with abilities of strategic clarification question generation and dynamic product retrieval. Specifically, we develop the agent with strategies for product feature summarization, query generation, and product retrieval. Furthermore, we propose the benchmark called PROCLARE to evaluate the agent’s performance both automatically and qualitatively with the aid of a LLM-driven user simulator. Experiments show that ProductAgent interacts positively with the user and enhances retrieval performance with increasing dialogue turns, where user demands become gradually more explicit and detailed. All the source codes will be released after the review anonymity period.

[IR-8] Unified Dual-Intent Translation for Joint Modeling of Search and Recommendation

链接: https://arxiv.org/abs/2407.00912
作者: Yuting Zhang,Yiqing Wu,Ruidong Han,Ying Sun,Yongchun Zhu,Xiang Li,Wei Lin,Fuzhen Zhuang,Zhulin An,Yongjun Xu
关键词: demand intents, intents, numerous options, Recommendation, discovering their preferred
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommendation systems, which assist users in discovering their preferred items among numerous options, have served billions of users across various online platforms. Intuitively, users’ interactions with items are highly driven by their unchanging inherent intents (e.g., always preferring high-quality items) and changing demand intents (e.g., wanting a T-shirt in summer but a down jacket in winter). However, both types of intents are implicitly expressed in recommendation scenario, posing challenges in leveraging them for accurate intent-aware recommendations. Fortunately, in search scenario, often found alongside recommendation on the same online platform, users express their demand intents explicitly through their query words. Intuitively, in both scenarios, a user shares the same inherent intent and the interactions may be influenced by the same demand intent. It is therefore feasible to utilize the interaction data from both scenarios to reinforce the dual intents for joint intent-aware modeling. But the joint modeling should deal with two problems: 1) accurately modeling users’ implicit demand intents in recommendation; 2) modeling the relation between the dual intents and the interactive items. To address these problems, we propose a novel model named Unified Dual-Intents Translation for joint modeling of Search and Recommendation (UDITSR). To accurately simulate users’ demand intents in recommendation, we utilize real queries from search data as supervision information to guide its generation. To explicitly model the relation among the triplet inherent intent, demand intent, interactive item, we propose a dual-intent translation propagation mechanism to learn the triplet in the same semantic space via embedding translations. Extensive experiments demonstrate that UDITSR outperforms SOTA baselines both in search and recommendation tasks.

[IR-9] Heterogeneous Graph-based Framework with Disentangled Representations Learning for Multi-target Cross Domain Recommendation

链接: https://arxiv.org/abs/2407.00909
作者: Xiaopeng Liu,Juan Zhang,Chongqi Ren,Shenghui Xu,Zhaoming Pan,Zhimin Zhang
关键词: data sparsity problem, Cross-Domain Recommendation, recommendation system, CDR, critical solution
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:CDR (Cross-Domain Recommendation), i.e., leveraging information from multiple domains, is a critical solution to data sparsity problem in recommendation system. The majority of previous research either focused on single-target CDR (STCDR) by utilizing data from the source domains to improve the model’s performance on the target domain, or applied dual-target CDR (DTCDR) by integrating data from the source and target domains. In addition, multi-target CDR (MTCDR) is a generalization of DTCDR, which is able to capture the link among different domains. In this paper we present HGDR (Heterogeneous Graph-based Framework with Disentangled Representations Learning), an end-to-end heterogeneous network architecture where graph convolutional layers are applied to model relations among different domains, meanwhile utilizes the idea of disentangling representation for domain-shared and domain-specifc information. First, a shared heterogeneous graph is generated by gathering users and items from several domains without any further side information. Second, we use HGDR to compute disentangled representations for users and items in all domains.Experiments on real-world datasets and online A/B tests prove that our proposed model can transmit information among domains effectively and reach the SOTA performance.

[IR-10] Prediction of Sentinel-2 multi-band imagery with attention BiLSTM for continuous earth surface monitoring

链接: https://arxiv.org/abs/2407.00834
作者: Weiying Zhao,Natalia Efremova
关键词: effective agricultural management, time series analysis, Continuous monitoring, forecasting crop conditions, Long Short-Term Memory
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Continuous monitoring of crops and forecasting crop conditions through time series analysis is crucial for effective agricultural management. This study proposes a framework based on an attention Bidirectional Long Short-Term Memory (BiLSTM) network for predicting multiband images. Our model can forecast target images on user-defined dates, including future dates and periods characterized by persistent cloud cover. By focusing on short sequences within a sequence-to-one forecasting framework, the model leverages advanced attention mechanisms to enhance prediction accuracy. Our experimental results demonstrate the model’s superior performance in predicting NDVI, multiple vegetation indices, and all Sentinel-2 bands, highlighting its potential for improving remote sensing data continuity and reliability.

[IR-11] Enhancing Travel Decision-Making: A Contrastive Learning Approach for Personalized Review Rankings in Accommodations

链接: https://arxiv.org/abs/2407.00787
作者: Reda Igebaria,Eran Fainman,Sarai Mizrachi,Moran Beladev,Fengjun Wang
关键词: User-generated reviews significantly, influence consumer decisions, significantly influence consumer, reviews significantly influence, User-generated reviews
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:User-generated reviews significantly influence consumer decisions, particularly in the travel domain when selecting accommodations. This paper contribution comprising two main elements. Firstly, we present a novel dataset of authentic guest reviews sourced from a prominent online travel platform, totaling over two million reviews from 50,000 distinct accommodations. Secondly, we propose an innovative approach for personalized review ranking. Our method employs contrastive learning to intricately capture the relationship between a review and the contextual information of its respective reviewer. Through a comprehensive experimental study, we demonstrate that our approach surpasses several baselines across all reported metrics. Augmented by a comparative analysis, we showcase the efficacy of our method in elevating personalized review ranking. The implications of our research extend beyond the travel domain, with potential applications in other sectors where personalized review ranking is paramount, such as online e-commerce platforms.

[IR-12] Dense Retrieval with Continuous Explicit Feedback for Systematic Review Screening Prioritisation

链接: https://arxiv.org/abs/2407.00635
作者: Xinyu Mao,Shengyao Zhuang,Bevan Koopman,Guido Zuccon
关键词: identify relevant documents, identify relevant, high recall, early positions, screening prioritisation
类目: Information Retrieval (cs.IR)
*备注: Accepted at SIGIR 2024

点击查看摘要

Abstract:The goal of screening prioritisation in systematic reviews is to identify relevant documents with high recall and rank them in early positions for review. This saves reviewing effort if paired with a stopping criterion, and speeds up review completion if performed alongside downstream tasks. Recent studies have shown that neural models have good potential on this task, but their time-consuming fine-tuning and inference discourage their widespread use for screening prioritisation. In this paper, we propose an alternative approach that still relies on neural models, but leverages dense representations and relevance feedback to enhance screening prioritisation, without the need for costly model fine-tuning and inference. This method exploits continuous relevance feedback from reviewers during document screening to efficiently update the dense query representation, which is then applied to rank the remaining documents to be screened. We evaluate this approach across the CLEF TAR datasets for this task. Results suggest that the investigated dense query-driven approach is more efficient than directly using neural models and shows promising effectiveness compared to previous methods developed on the considered datasets. Our code is available at this https URL.

[IR-13] Answering real-world clinical questions using large language model based systems

链接: https://arxiv.org/abs/2407.00541
作者: Yen Sia Low(1),Michael L. Jackson(1),Rebecca J. Hyde(1),Robert E. Brown(1),Neil M. Sanghavi(1),Julian D. Baldwin(1),C. William Pike(1),Jananee Muralidharan(1),Gavin Hui(1 and 2),Natasha Alexander(3),Hadeel Hassan(3),Rahul V. Nene(4),Morgan Pike(5),Courtney J. Pokrzywa(6),Shivam Vedak(7),Adam Paul Yan(3),Dong-han Yao(7),Amy R. Zipursky(3),Christina Dinh(1),Philip Ballentine(1),Dan C. Derieg(1),Vladimir Polony(1),Rehan N. Chawdry(1),Jordan Davies(1),Brigham B. Hyde(1),Nigam H. Shah(1 and 7),Saurabh Gombar(1 and 8) ((1) Atropos Health, New York NY, USA, (2) Department of Medicine, University of California, Los Angeles CA, USA, (3) Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada, (4) Department of Emergency Medicine, University of California, San Diego CA, USA, (5) Department of Emergency Medicine, University of Michigan, Ann Arbor MI, USA, (6) Department of Surgery, Columbia University, New York NY, USA, (7) Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA (8) Department of Pathology, Stanford University, Stanford CA, USA)
关键词: guide healthcare decisions, contextualizing existing research, guide healthcare, healthcare decisions, difficulty in contextualizing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 28 pages (2 figures, 3 tables) inclusive of 8 pages of supplemental materials (4 supplemental figures and 4 supplemental tables)

点击查看摘要

Abstract:Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.

[IR-14] owards Statistically Significant Taxonomy Aware Co-location Pattern Detection

链接: https://arxiv.org/abs/2407.00317
作者: Subhankar Ghosh,Arun Sharma,Jayant Gupta,Shashi Shekhar
关键词: Boolean spatial feature, spatial feature types, feature types, collection of Boolean, Boolean spatial
类目: Information Retrieval (cs.IR); Applications (stat.AP)
*备注: Accepted in The 16th Conference on Spatial Information Theory (COSIT) 2024

点击查看摘要

Abstract:Given a collection of Boolean spatial feature types, their instances, a neighborhood relation (e.g., proximity), and a hierarchical taxonomy of the feature types, the goal is to find the subsets of feature types or their parents whose spatial interaction is statistically significant. This problem is for taxonomy-reliant applications such as ecology (e.g., finding new symbiotic relationships across the food chain), spatial pathology (e.g., immunotherapy for cancer), retail, etc. The problem is computationally challenging due to the exponential number of candidate co-location patterns generated by the taxonomy. Most approaches for co-location pattern detection overlook the hierarchical relationships among spatial features, and the statistical significance of the detected patterns is not always considered, leading to potential false discoveries. This paper introduces two methods for incorporating taxonomies and assessing the statistical significance of co-location patterns. The baseline approach iteratively checks the significance of co-locations between leaf nodes or their ancestors in the taxonomy. Using the Benjamini-Hochberg procedure, an advanced approach is proposed to control the false discovery rate. This approach effectively reduces the risk of false discoveries while maintaining the power to detect true co-location patterns. Experimental evaluation and case study results show the effectiveness of the approach.

[IR-15] When Search Engine Services meet Large Language Models: Visions and Challenges

链接: https://arxiv.org/abs/2407.00128
作者: Haoyi Xiong,Jiang Bian,Yuchen Li,Xuhong Li,Mengnan Du,Shuaiqiang Wang,Dawei Yin,Sumi Helal
关键词: Combining Large Language, Large Language Models, Combining Large, Large Language, Language Models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Combining Large Language Models (LLMs) with search engine services marks a significant shift in the field of services computing, opening up new possibilities to enhance how we search for and retrieve information, understand content, and interact with internet services. This paper conducts an in-depth examination of how integrating LLMs with search engines can mutually benefit both technologies. We focus on two main areas: using search engines to improve LLMs (Search4LLM) and enhancing search engine functions using LLMs (LLM4Search). For Search4LLM, we investigate how search engines can provide diverse high-quality datasets for pre-training of LLMs, how they can use the most relevant documents to help LLMs learn to answer queries more accurately, how training LLMs with Learning-To-Rank (LTR) tasks can enhance their ability to respond with greater precision, and how incorporating recent search results can make LLM-generated content more accurate and current. In terms of LLM4Search, we examine how LLMs can be used to summarize content for better indexing by search engines, improve query outcomes through optimization, enhance the ranking of search results by analyzing document relevance, and help in annotating data for learning-to-rank tasks in various learning contexts. However, this promising integration comes with its challenges, which include addressing potential biases and ethical issues in training models, managing the computational and other costs of incorporating LLMs into search services, and continuously updating LLM training with the ever-changing web content. We discuss these challenges and chart out required research directions to address them. We also discuss broader implications for service computing, such as scalability, privacy concerns, and the need to adapt search engine architectures for these advanced models.

[IR-16] AI-Driven Skin Cancer Diagnosis: Grad-CAM and Expert Annotations for Enhanced Interpretability

链接: https://arxiv.org/abs/2407.00104
作者: Iván Matas,Carmen Serrano,Francisca Silva,Amalia Serrano,Tomás Toledo-Pastrana,Begoña Acha
关键词: optimizing resource utilization, BCC, provide interpretable support, BCC dermoscopic features, BCC dermoscopic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Image and Video Processing (eess.IV)
*备注: 8 pages, 4 figures, 4 tables, under review

点击查看摘要

Abstract:An AI tool has been developed to provide interpretable support for the diagnosis of BCC via teledermatology, thus speeding up referrals and optimizing resource utilization. The interpretability is provided in two ways: on the one hand, the main BCC dermoscopic patterns are found in the image to justify the BCC/Non BCC classification. Secondly, based on the common visual XAI Grad-CAM, a clinically inspired visual explanation is developed where the relevant features for diagnosis are located. Since there is no established ground truth for BCC dermoscopic features, a standard reference is inferred from the diagnosis of four dermatologists using an Expectation Maximization (EM) based algorithm. The results demonstrate significant improvements in classification accuracy and interpretability, positioning this approach as a valuable tool for early BCC detection and referral to dermatologists. The BCC/non-BCC classification achieved an accuracy rate of 90%. For Clinically-inspired XAI results, the detection of BCC patterns useful to clinicians reaches 99% accuracy. As for the Clinically-inspired Visual XAI results, the mean of the Grad-CAM normalized value within the manually segmented clinical features is 0.57, while outside this region it is 0.16. This indicates that the model struggles to accurately identify the regions of the BCC patterns. These results prove the ability of the AI tool to provide a useful explanation.

[IR-17] Predictive accuracy of recommender algorithms

链接: https://arxiv.org/abs/2407.00097
作者: William Noffsinger
关键词: smaller ranked set, Recommender systems present, Recommender systems, items based, item characteristics
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems present a customized list of items based upon user or item characteristics with the objective of reducing a large number of possible choices to a smaller ranked set most likely to appeal to the user. A variety of algorithms for recommender systems have been developed and refined including applications of deep learning neural networks. Recent research reports point to a need to perform carefully controlled experiments to gain insights about the relative accuracy of different recommender algorithms, because studies evaluating different methods have not used a common set of benchmark data sets, baseline models, and evaluation metrics. This investigation used publicly available sources of ratings data with a suite of three conventional recommender algorithms and two deep learning (DL) algorithms in controlled experiments to assess their comparative accuracy. Results for the non-DL algorithms conformed well to published results and benchmarks. The two DL algorithms did not perform as well and illuminated known challenges implementing DL recommender algorithms as reported in the literature. Model overfitting is discussed as a potential explanation for the weaker performance of the DL algorithms and several regularization strategies are reviewed as possible approaches to improve predictive error. Findings justify the need for further research in the use of deep learning models for recommender systems.

[IR-18] Learning to Rank for Maps at Airbnb

链接: https://arxiv.org/abs/2407.00091
作者: Malay Haldar,Hongwei Zhang,Kedar Bellare,Sherry Chen,Soumyadip Banerjee,Xiaotang Wang,Mustafa Abdool,Huiji Gao,Pavan Tapadia,Liwei He,Sanjeev Katariya
关键词: two-sided marketplace, brings together hosts, rent with prospective, prospective guests, Airbnb brings
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As a two-sided marketplace, Airbnb brings together hosts who own listings for rent with prospective guests from around the globe. Results from a guest’s search for listings are displayed primarily through two interfaces: (1) as a list of rectangular cards that contain on them the listing image, price, rating, and other details, referred to as list-results (2) as oval pins on a map showing the listing price, called map-results. Both these interfaces, since their inception, have used the same ranking algorithm that orders listings by their booking probabilities and selects the top listings for display. But some of the basic assumptions underlying ranking, built for a world where search results are presented as lists, simply break down for maps. This paper describes how we rebuilt ranking for maps by revising the mathematical foundations of how users interact with search results. Our iterative and experiment-driven approach led us through a path full of twists and turns, ending in a unified theory for the two interfaces. Our journey shows how assumptions taken for granted when designing machine learning algorithms may not apply equally across all user interfaces, and how they can be adapted. The net impact was one of the largest improvements in user experience for Airbnb which we discuss as a series of experimental validations.

[IR-19] Compressing Search with Language Models

链接: https://arxiv.org/abs/2407.00085
作者: Thomas Mulc,Jennifer L. Steele
关键词: Millions of people, search data, Search, Google Search data, Google Search
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Millions of people turn to Google Search each day for information on things as diverse as new cars or flu symptoms. The terms that they enter contain valuable information on their daily intent and activities, but the information in these search terms has been difficult to fully leverage. User-defined categorical filters have been the most common way to shrink the dimensionality of search data to a tractable size for analysis and modeling. In this paper we present a new approach to reducing the dimensionality of search data while retaining much of the information in the individual terms without user-defined rules. Our contributions are two-fold: 1) we introduce SLaM Compression, a way to quantify search terms using pre-trained language models and create a representation of search data that has low dimensionality, is memory efficient, and effectively acts as a summary of search, and 2) we present CoSMo, a Constrained Search Model for estimating real world events using only search data. We demonstrate the efficacy of our contributions by estimating with high accuracy U.S. automobile sales and U.S. flu rates using only Google Search data.

[IR-20] Adapting Job Recommendations to User Preference Drift with Behavioral-Semantic Fusion Learning

链接: https://arxiv.org/abs/2407.00082
作者: Xiao Han,Chen Zhu,Xiao Hu,Chuan Qin,Xiangyu Zhao,Hengshu Zhu
关键词: Job recommender systems, aligning job opportunities, recommender systems, systems are crucial, crucial for aligning
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by KDD 24 Research Track

点击查看摘要

Abstract:Job recommender systems are crucial for aligning job opportunities with job-seekers in online job-seeking. However, users tend to adjust their job preferences to secure employment opportunities continually, which limits the performance of job recommendations. The inherent frequency of preference drift poses a challenge to promptly and precisely capture user preferences. To address this issue, we propose a novel session-based framework, BISTRO, to timely model user preference through fusion learning of semantic and behavioral information. Specifically, BISTRO is composed of three stages: 1) coarse-grained semantic clustering, 2) fine-grained job preference extraction, and 3) personalized top- k job recommendation. Initially, BISTRO segments the user interaction sequence into sessions and leverages session-based semantic clustering to achieve broad identification of person-job matching. Subsequently, we design a hypergraph wavelet learning method to capture the nuanced job preference drift. To mitigate the effect of noise in interactions caused by frequent preference drift, we innovatively propose an adaptive wavelet filtering technique to remove noisy interaction. Finally, a recurrent neural network is utilized to analyze session-based interaction for inferring personalized preferences. Extensive experiments on three real-world offline recruitment datasets demonstrate the significant performances of our framework. Significantly, BISTRO also excels in online experiments, affirming its effectiveness in live recruitment settings. This dual success underscores the robustness and adaptability of BISTRO. The source code is available at this https URL.

[IR-21] Differentially Private Graph Diffusion with Applications in Personalized PageRanks

链接: https://arxiv.org/abs/2407.00077
作者: Rongzhe Wei,Eli Chien,Pan Li
关键词: iteratively propagates real-valued, propagates real-valued substances, iteratively propagates, propagates real-valued, real-valued substances
类目: Information Retrieval (cs.IR); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph diffusion, which iteratively propagates real-valued substances among the graph, is used in numerous graph/network-involved applications. However, releasing diffusion vectors may reveal sensitive linking information in the data such as transaction information in financial network data. However, protecting the privacy of graph data is challenging due to its interconnected nature. This work proposes a novel graph diffusion framework with edge-level different privacy guarantees by using noisy diffusion iterates. The algorithm injects Laplace noise per diffusion iteration and adopts a degree-based thresholding function to mitigate the high sensitivity induced by low-degree nodes. Our privacy loss analysis is based on Privacy Amplification by Iteration (PABI), which to our best knowledge, is the first effort that analyzes PABI with Laplace noise and provides relevant applications. We also introduce a novel Infinity-Wasserstein distance tracking method, which tightens the analysis of privacy leakage and makes PABI more applicable in practice. We evaluate this framework by applying it to Personalized Pagerank computation for ranking tasks. Experiments on real-world network data demonstrate the superiority of our method under stringent privacy conditions.

[IR-22] Pistis-RAG: A Scalable Cascading Framework Towards Trustworthy Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2407.00072
作者: Yu Bai,Yukai Miao,Li Chen,Dan Li,Yanyu Ren,Hongtao Xie,Ce Yang,Xuhui Cai
关键词: Pistis symbolized good, symbolized good faith, Greek mythology, Pistis symbolized, good faith
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In Greek mythology, Pistis symbolized good faith, trust, and reliability, echoing the core principles of RAG in LLM systems. Pistis-RAG, a scalable multi-stage framework, effectively addresses the challenges of large-scale retrieval-augmented generation (RAG). Each stage plays a distinct role: matching refines the search space, pre-ranking prioritizes semantically relevant documents, and ranking aligns with the large language model’s (LLM) preferences. The reasoning and aggregating stage supports the implementation of complex chain-of-thought (CoT) methods within this cascading structure. We argue that the lack of strong alignment between LLMs and the external knowledge ranking methods used in RAG tasks is relevant to the reliance on the model-centric paradigm in RAG frameworks. A content-centric approach would prioritize seamless integration between the LLMs and external information sources, optimizing the content transformation process for each specific task. Critically, our ranking stage deviates from traditional RAG approaches by recognizing that semantic relevance alone may not directly translate to improved generation. This is due to the sensitivity of the few-shot prompt order, as highlighted in prior work \citelu2021fantastically. Current RAG frameworks fail to account for this crucial factor. We introduce a novel ranking stage specifically designed for RAG systems. It adheres to information retrieval principles while considering the unique business scenario captured by LLM preferences and user feedback. Our approach integrates in-context learning (ICL) methods and reasoning steps to incorporate user feedback, ensuring efficient alignment. Experiments on the MMLU benchmark demonstrate a 9.3% performance improvement. The model and code will be open-sourced on GitHub. Experiments on real-world, large-scale data validate our framework’s scalability.

[IR-23] Perceptron Collaborative Filtering

链接: https://arxiv.org/abs/2407.00067
作者: Arya Chakraborty
关键词: making automatic predictions, achieve similar results, implementing collaborative filtering, multivariate logistic regression, logistic regression classifiers
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:While multivariate logistic regression classifiers are a great way of implementing collaborative filtering - a method of making automatic predictions about the interests of a user by collecting preferences or taste information from many other users, we can also achieve similar results using neural networks. A recommender system is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular user. A perceptron or a neural network is a machine learning model designed for fitting complex datasets using backpropagation and gradient descent. When coupled with advanced optimization techniques, the model may prove to be a great substitute for classical logistic classifiers. The optimizations include feature scaling, mean normalization, regularization, hyperparameter tuning and using stochastic/mini-batch gradient descent instead of regular gradient descent. In this use case, we will use the perceptron in the recommender system to fit the parameters i.e., the data from a multitude of users and use it to predict the preference/interest of a particular user.

[IR-24] Constraint based Modeling according to Reference Design

链接: https://arxiv.org/abs/2407.00064
作者: Erik Heiland,Peter Hillmann,Andreas Karcher
关键词: Reference models, Reference, models, essential element, element to ensured
类目: Databases (cs.DB); Information Retrieval (cs.IR); Information Theory (cs.IT); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Reference models in form of best practices are an essential element to ensured knowledge as design for reuse. Popular modeling approaches do not offer mechanisms to embed reference models in a supporting way, let alone a repository of it. Therefore, it is hardly possible to profit from this expertise. The problem is that the reference models are not described formally enough to be helpful in developing solutions. Consequently, the challenge is about the process, how a user can be supported in designing dedicated solutions assisted by reference models. In this paper, we present a generic approach for the formal description of reference models using semantic technologies and their application. Our modeling assistant allows the construction of solution models using different techniques based on reference building blocks. This environment enables the subsequent verification of the developed designs against the reference models for conformity. Therefore, our reference modeling assistant highlights the interdependency. The application of these techniques contributes to the formalization of requirements and finally to quality assurance in context of maturity model. It is possible to use multiple reference models in context of system of system designs. The approach is evaluated in industrial area and it can be integrated into different modeling landscapes.

[IR-25] An Interpretable Alternative to Neural Representation Learning for Rating Prediction – Transparent Latent Class Modeling of User Reviews

链接: https://arxiv.org/abs/2407.00063
作者: Giuseppe Serra,Peter Tino,Zhao Xu,Xin Yao
关键词: including recommender systems, including recommender, recommender systems, widely adopted, Nowadays
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nowadays, neural network (NN) and deep learning (DL) techniques are widely adopted in many applications, including recommender systems. Given the sparse and stochastic nature of collaborative filtering (CF) data, recent works have critically analyzed the effective improvement of neural-based approaches compared to simpler and often transparent algorithms for recommendation. Previous results showed that NN and DL models can be outperformed by traditional algorithms in many tasks. Moreover, given the largely black-box nature of neural-based methods, interpretable results are not naturally obtained. Following on this debate, we first present a transparent probabilistic model that topologically organizes user and product latent classes based on the review information. In contrast to popular neural techniques for representation learning, we readily obtain a statistical, visualization-friendly tool that can be easily inspected to understand user and product characteristics from a textual-based perspective. Then, given the limitations of common embedding techniques, we investigate the possibility of using the estimated interpretable quantities as model input for a rating prediction task. To contribute to the recent debates, we evaluate our results in terms of both capacity for interpretability and predictive performances in comparison with popular text-based neural approaches. The results demonstrate that the proposed latent class representations can yield competitive predictive performances, compared to popular, but difficult-to-interpret approaches.

[IR-26] A First Principles Approach to Trust-Based Recommendation Systems

链接: https://arxiv.org/abs/2407.00062
作者: Paras Stefanopoulos,Ahad N. Zehmakan,Sourin Chatterjee
关键词: paper explores recommender, explores recommender systems, item rating, paper explores, explores recommender
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper explores recommender systems in social networks which leverage information such as item rating, intra-item similarities, and trust graph. We demonstrate that item-rating information is more influential than other information types in a collaborative filtering approach. The trust graph-based approaches were found to be more robust to network adversarial attacks due to hard-to-manipulate trust structures. Intra-item information, although sub-optimal in isolation, enhances the consistency of predictions and lower-end performance when fused with other information forms. Additionally, the Weighted Average framework is introduced, enabling the construction of recommendation systems around any user-to-user similarity metric.

[IR-27] MMBee: Live Streaming Gift-Sending Recommendations via Multi-Modal Fusion and Behaviour Expansion

链接: https://arxiv.org/abs/2407.00056
作者: Jiaxin Deng,Shiyao Wang,Yuchen Wang,Jiansong Qi,Liqin Zhao,Guorui Zhou,Gaofeng Meng
关键词: increasingly popular due, Live streaming services, increasingly popular, Live streaming, live streaming gifting
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Accepted at KDD 2024

点击查看摘要

Abstract:Live streaming services are becoming increasingly popular due to real-time interactions and entertainment. Viewers can chat and send comments or virtual gifts to express their preferences for the streamers. Accurately modeling the gifting interaction not only enhances users’ experience but also increases streamers’ revenue. Previous studies on live streaming gifting prediction treat this task as a conventional recommendation problem, and model users’ preferences using categorical data and observed historical behaviors. However, it is challenging to precisely describe the real-time content changes in live streaming using limited categorical information. Moreover, due to the sparsity of gifting behaviors, capturing the preferences and intentions of users is quite difficult. In this work, we propose MMBee based on real-time Multi-Modal Fusion and Behaviour Expansion to address these issues. Specifically, we first present a Multi-modal Fusion Module with Learnable Query (MFQ) to perceive the dynamic content of streaming segments and process complex multi-modal interactions, including images, text comments and speech. To alleviate the sparsity issue of gifting behaviors, we present a novel Graph-guided Interest Expansion (GIE) approach that learns both user and streamer representations on large-scale gifting graphs with multi-modal attributes. Comprehensive experiment results show that MMBee achieves significant performance improvements on both public datasets and Kuaishou real-world streaming datasets and the effectiveness has been further validated through online A/B experiments. MMBee has been deployed and is serving hundreds of millions of users at Kuaishou.

[IR-28] A Document-based Knowledge Discovery with Microservices Architecture

链接: https://arxiv.org/abs/2407.00053
作者: Habtom Kahsay Gidey,Mario Kesseler,Patrick Stangl,Peter Hillmann,Andreas Karcher
关键词: digitally stored data, organizations lies, conversion of analog, digitally stored, analog data
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The first step towards digitalization within organizations lies in digitization - the conversion of analog data into digitally stored data. This basic step is the prerequisite for all following activities like the digitalization of processes or the servitization of products or offerings. However, digitization itself often leads to ‘data-rich’ but ‘knowledge-poor’ material. Knowledge discovery and knowledge extraction as approaches try to increase the usefulness of digitized data. In this paper, we point out the key challenges in the context of knowledge discovery and present an approach to addressing these using a microservices architecture. Our solution led to a conceptual design focusing on keyword extraction, similarity calculation of documents, database queries in natural language, and programming language independent provision of the extracted information. In addition, the conceptual design provides referential design guidelines for integrating processes and applications for semi-automatic learning, editing, and visualization of ontologies. The concept also uses a microservices architecture to address non-functional requirements, such as scalability and resilience. The evaluation of the specified requirements is performed using a demonstrator that implements the concept. Furthermore, this modern approach is used in the German patent office in an extended version.

[IR-29] JungleGPT: Designing and Optimizing Compound AI Systems for E-Commerce

链接: https://arxiv.org/abs/2407.00038
作者: Sherry Ruan,Tian Zhao
关键词: customer service, significantly advanced, industry by powering, personalized recommendations, recommendations and customer
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:LLMs have significantly advanced the e-commerce industry by powering applications such as personalized recommendations and customer service. However, most current efforts focus solely on monolithic LLMs and fall short in addressing the complexity and scale of real-world e-commerce scenarios. In this work, we present JungleGPT, the first compound AI system tailored for real-world e-commerce applications. We outline the system’s design and the techniques used to optimize its performance for practical use cases, which have proven to reduce inference costs to less than 1% of what they would be with a powerful, monolithic LLM.

人工智能

[AI-0] Scalable Nested Optimization for Deep Learning

链接: https://arxiv.org/abs/2407.01526
作者: Jonathan Lorraine
关键词: updating a single, single loss, Gradient-based optimization, single set, minimize a single
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: View more research details at this https URL

点击查看摘要

Abstract:Gradient-based optimization has been critical to the success of machine learning, updating a single set of parameters to minimize a single loss. A growing number of applications rely on a generalization of this, where we have a bilevel or nested optimization of which subsets of parameters update on different objectives nested inside each other. We focus on motivating examples of hyperparameter optimization and generative adversarial networks. However, naively applying classical methods often fails when we look at solving these nested problems on a large scale. In this thesis, we build tools for nested optimization that scale to deep learning setups.

[AI-1] Empowering 3D Visual Grounding with Reasoning Capabilities

链接: https://arxiv.org/abs/2407.01525
作者: Chenming Zhu,Tai Wang,Wenwei Zhang,Kai Chen,Xihui Liu
关键词: explicit textual descriptions, reason human intentions, Large Language Model, Multi-modal Large Language, implicit instructions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by ECCV24. A comprehensive and hierarchical 3D reasoning grounding benchmark in the era of foundation models. Project page: this https URL

点击查看摘要

Abstract:Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.

[AI-2] Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

链接: https://arxiv.org/abs/2407.01521
作者: Bingliang Zhang,Wenda Chu,Julius Berner,Chenlin Meng,Anima Anandkumar,Yang Song
关键词: solving Bayesian inverse, learned data priors, Bayesian inverse problems, solving Bayesian, recently achieved success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inverse problems, such as phase retrieval. To address this challenge, we propose a new method called Decoupled Annealing Posterior Sampling (DAPS) that relies on a novel noise annealing process. Specifically, we decouple consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior as we reduce noise levels. This approach enables the exploration of a larger solution space, improving the success rate for accurate reconstructions. We demonstrate that DAPS significantly improves sample quality and stability across multiple image restoration tasks, particularly in complicated nonlinear inverse problems. For example, we achieve a PSNR of 30.72dB on the FFHQ 256 dataset for phase retrieval, which is an improvement of 9.12dB compared to existing methods.

[AI-3] owards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

链接: https://arxiv.org/abs/2407.01518
作者: Hao Dong,Eleni Chatzi,Olga Fink
关键词: Multimodal Open-Set Domain, open-set domain generalization, open-set domain, involves recognizing, Multimodal Jigsaw Puzzles
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ECCV 2024, code: this https URL

点击查看摘要

Abstract:The task of open-set domain generalization (OSDG) involves recognizing novel classes within unseen domains, which becomes more challenging with multiple modalities as input. Existing works have only addressed unimodal OSDG within the meta-learning framework, without considering multimodal scenarios. In this work, we introduce a novel approach to address Multimodal Open-Set Domain Generalization (MM-OSDG) for the first time, utilizing self-supervision. To this end, we introduce two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles. These tasks facilitate the learning of multimodal representative features, thereby enhancing generalization and open-class detection capabilities. Additionally, we propose a novel entropy weighting mechanism to balance the loss across different modalities. Furthermore, we extend our approach to tackle also the Multimodal Open-Set Domain Adaptation (MM-OSDA) problem, especially in scenarios where unlabeled data from the target domain is available. Extensive experiments conducted under MM-OSDG, MM-OSDA, and Multimodal Closed-Set DG settings on the EPIC-Kitchens and HAC datasets demonstrate the efficacy and versatility of the proposed approach. Our source code is available at this https URL.

[AI-4] CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

链接: https://arxiv.org/abs/2407.01511
作者: Tianqi Xu,Linyao Chen,Dai-Jie Wu,Yanjun Chen,Zecheng Zhang,Xiang Yao,Zhiqiang Xie,Yongchao Chen,Shilong Liu,Bochen Qian,Philip Torr,Bernard Ghanem,Guohao Li
关键词: Multimodal Language Models, Language Models, Multimodal Language, relies on Multimodal, language with GUI
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 100 tasks in computer desktop and mobile phone environments. We evaluated four advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 35.26%. All framework code, agent code, and task datasets are publicly available at this https URL.

[AI-5] Self-Cognition in Large Language Models: An Exploratory Study

链接: https://arxiv.org/abs/2407.01505
作者: Dongping Chen,Jiawen Shi,Yao Wan,Pan Zhou,Neil Zhenqiang Gong,Lichao Sun
关键词: Large Language Models, Large Language, achieved remarkable success, Language Models, self-cognition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at ICML 2024 Large Language Models and Cognition Workshop

点击查看摘要

Abstract:While Large Language Models (LLMs) have achieved remarkable success across various applications, they also raise concerns regarding self-cognition. In this paper, we perform a pioneering study to explore self-cognition in LLMs. Specifically, we first construct a pool of self-cognition instruction prompts to evaluate where an LLM exhibits self-cognition and four well-designed principles to quantify LLMs’ self-cognition. Our study reveals that 4 of the 48 models on Chatbot Arena–specifically Command R, Claude3-Opus, Llama-3-70b-Instruct, and Reka-core–demonstrate some level of detectable self-cognition. We observe a positive correlation between model size, training data quality, and self-cognition level. Additionally, we also explore the utility and trustworthiness of LLM in the self-cognition state, revealing that the self-cognition state enhances some specific tasks such as creative writing and exaggeration. We believe that our work can serve as an inspiration for further research to study the self-cognition in LLMs.

[AI-6] AI Agents That Matter

链接: https://arxiv.org/abs/2407.01502
作者: Sayash Kapoor,Benedikt Stroebl,Zachary S. Siegel,Nitya Nadgir,Arvind Narayanan
关键词: research direction, exciting new research, accuracy, agents, agent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that hinder their usefulness in real-world applications. First, there is a narrow focus on accuracy without attention to other metrics. As a result, SOTA agents are needlessly complex and costly, and the community has reached mistaken conclusions about the sources of accuracy gains. Our focus on cost in addition to accuracy motivates the new goal of jointly optimizing the two metrics. We design and implement one such optimization, showing its potential to greatly reduce cost while maintaining accuracy. Second, the benchmarking needs of model and downstream developers have been conflated, making it hard to identify which agent would be best suited for a particular application. Third, many agent benchmarks have inadequate holdout sets, and sometimes none at all. This has led to agents that are fragile because they take shortcuts and overfit to the benchmark in various ways. We prescribe a principled framework for avoiding overfitting. Finally, there is a lack of standardization in evaluation practices, leading to a pervasive lack of reproducibility. We hope that the steps we introduce for addressing these shortcomings will spur the development of agents that are useful in the real world and not just accurate on benchmarks.

[AI-7] RegMix: Data Mixture as Regression for Language Model Pre-training

链接: https://arxiv.org/abs/2407.01492
作者: Qian Liu,Xiaosen Zheng,Niklas Muennighoff,Guangtao Zeng,Longxu Dou,Tianyu Pang,Jing Jiang,Min Lin
关键词: large language model, language model pre-training, mixture remains unclear, effective mixture remains, data mixture
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance given their respective mixtures. With the fitted regression model, we simulate the top-ranked mixture and use it to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens of different mixtures to fit the regression model and find the optimal mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Further, our method demonstrates superior performance compared to human selection and achieves results that match or surpass DoReMi, while utilizing only 10% of the compute budget. Our experiments also show that (1) Data mixtures significantly impact performance with single-task performance variations of up to 14.6%; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws, and our approach captures the complexity by considering all domains together. Our code is available at this https URL.

[AI-8] LLM See LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

链接: https://arxiv.org/abs/2407.01490
作者: Luísa Shimabucoro,Sebastian Ruder,Julia Kreutzer,Marzieh Fadaee,Sara Hooker
关键词: synthetic data, synthetic data raises, data, synthetic, widespread adoption
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread adoption of synthetic data raises new questions about how models generating the data can influence other large language models (LLMs) via distilled data. To start, our work exhaustively characterizes the impact of passive inheritance of model properties by systematically studying the consequences of synthetic data integration. We provide one of the most comprehensive studies to-date of how the source of synthetic data shapes models’ internal biases, calibration and generations’ textual attributes and preferences. We find that models are surprisingly sensitive towards certain attributes even when the synthetic data prompts appear “neutral”. which invites the question whether this sensitivity can be exploited for good. Our findings invite the question can we explicitly steer the models towards the properties we want at test time by exploiting the data generation process? This would have historically been considered infeasible due to the cost of collecting data with a specific characteristic or objective in mind. However, improvement in the quality of synthetic data, as well as a shift towards general-purpose models designed to follow a diverse way of instructions, means this question is timely. We propose active inheritance as a term to describe intentionally constraining synthetic data according to a non-differentiable objective. We demonstrate how active inheritance can steer the generation profiles of models towards desirable non-differentiable attributes, e.g. high lexical diversity or low toxicity. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2407.01490 [cs.CL] (or arXiv:2407.01490v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.01490 Focus to learn more arXiv-issued DOI via DataCite

[AI-9] Agentless: Demystifying LLM-based Software Engineering Agents

链接: https://arxiv.org/abs/2407.01489
作者: Chunqiu Steven Xia,Yinlin Deng,Soren Dunn,Lingming Zhang
关键词: including code synthesis, large language models, software development tasks, Recent advancements, software development
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic two-phase process of localization followed by repair, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (27.33%) and lowest cost (\ 0.34) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.

[AI-10] ree Search for Language Model Agents

链接: https://arxiv.org/abs/2407.01476
作者: Jing Yu Koh,Stephen McAleer,Daniel Fried,Ruslan Salakhutdinov
关键词: Autonomous agents powered, Autonomous agents, perform decision-making tasks, demonstrated promise, search
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 11 pages. Models and code available at this https URL

点击查看摘要

Abstract:Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at this https URL.

[AI-11] Graph Neural Network as Computationally Efficient Emulator of Ice-sheet and Sea-level System Model (ISSM)

链接: https://arxiv.org/abs/2407.01464
作者: Younghyun Koo,Maryam Rahnemoonfar
关键词: Sea-level System Model, Stokes equations relevant, System Model, Ice-sheet and Sea-level, Sea-level System
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注: 5 pages, 4 figures, submitted to the 2024 IEEE International Geoscience and Remote Sensing Symposium. arXiv admin note: text overlap with arXiv:2402.05291

点击查看摘要

Abstract:The Ice-sheet and Sea-level System Model (ISSM) provides solutions for Stokes equations relevant to ice sheet dynamics by employing finite element and fine mesh adaption. However, since its finite element method is compatible only with Central Processing Units (CPU), the ISSM has limits on further economizing computational time. Thus, by taking advantage of Graphics Processing Units (GPUs), we design a graph convolutional network (GCN) as a fast emulator for ISSM. The GCN is trained and tested using the 20-year transient ISSM simulations in the Pine Island Glacier (PIG). The GCN reproduces ice thickness and velocity with a correlation coefficient greater than 0.998, outperforming the traditional convolutional neural network (CNN). Additionally, GCN shows 34 times faster computational speed than the CPU-based ISSM modeling. The GPU-based GCN emulator allows us to predict how the PIG will change in the future under different melting rate scenarios with high fidelity and much faster computational time.

[AI-12] Retrieval-augmented generation in multilingual settings

链接: https://arxiv.org/abs/2407.01463
作者: Nadezhda Chirkova,David Rau,Hervé Déjean,Thibault Formal,Stéphane Clinchant,Vassilina Nikoulina
关键词: improving LLM factuality, large language models, studied in English-only, LLM factuality, English-only settings
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into large language models (LLMs) and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate which components and with which adjustments are needed to build a well-performing mRAG pipeline, that can be used as a strong baseline in future works. Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities. The main limitations to be addressed in future works include frequent code-switching in non-Latin alphabet languages, occasional fluency errors, wrong reading of the provided documents, or irrelevant retrieval. We release the code for the resulting mRAG baseline pipeline at this https URL.

[AI-13] On Implications of Scaling Laws on Feature Superposition

链接: https://arxiv.org/abs/2407.01459
作者: Pavan Katta
关键词: theoretical note argues, scaling laws, simultaneously true, results from scaling, theoretical note
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 4 pages, 1 figure

点击查看摘要

Abstract:Using results from scaling laws, this theoretical note argues that the following two statements cannot be simultaneously true: 1. Superposition hypothesis where sparse features are linearly represented across a layer is a complete theory of feature representation. 2. Features are universal, meaning two models trained on the same data and achieving equal performance will learn identical features.

[AI-14] Contractual Reinforcement Learning: Pulling Arms with Invisible Hands

链接: https://arxiv.org/abs/2407.01458
作者: Jibang Wu,Siyu Chen,Mengdi Wang,Huazheng Wang,Haifeng Xu
关键词: enforce data collection, today large scale, large scale machine, direct content creation, machine learning tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:The agency problem emerges in today’s large scale machine learning tasks, where the learners are unable to direct content creation or enforce data collection. In this work, we propose a theoretical framework for aligning economic interests of different stakeholders in the online learning problems through contract design. The problem, termed \emphcontractual reinforcement learning, naturally arises from the classic model of Markov decision processes, where a learning principal seeks to optimally influence the agent’s action policy for their common interests through a set of payment rules contingent on the realization of next state. For the planning problem, we design an efficient dynamic programming algorithm to determine the optimal contracts against the far-sighted agent. For the learning problem, we introduce a generic design of no-regret learning algorithms to untangle the challenges from robust design of contracts to the balance of exploration and exploitation, reducing the complexity analysis to the construction of efficient search algorithms. For several natural classes of problems, we design tailored search algorithms that provably achieve \tildeO(\sqrtT) regret. We also present an algorithm with \tildeO(T^2/3) for the general problem that improves the existing analysis in online contract design with mild technical assumptions.

[AI-15] Information-Theoretic Foundations for Neural Scaling Laws

链接: https://arxiv.org/abs/2407.01456
作者: Hong Jun Jeon,Benjamin Van Roy
关键词: Neural scaling laws, scaling laws, scaling laws aim, training dataset size, Neural scaling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2212.01365

点击查看摘要

Abstract:Neural scaling laws aim to characterize how out-of-sample error behaves as a function of model and training dataset size. Such scaling laws guide allocation of a computational resources between model and data processing to minimize error. However, existing theoretical support for neural scaling laws lacks rigor and clarity, entangling the roles of information and optimization. In this work, we develop rigorous information-theoretic foundations for neural scaling laws. This allows us to characterize scaling laws for data generated by a two-layer neural network of infinite width. We observe that the optimal relation between data and model size is linear, up to logarithmic factors, corroborating large-scale empirical investigations. Concise yet general results of the kind we establish may bring clarity to this topic and inform future investigations.

[AI-16] Needle in the Haystack for Memory Based Large Language Models

链接: https://arxiv.org/abs/2407.01437
作者: Subhajit Chaudhury,Soham Dan,Payel Das,Georgios Kollias,Elliot Nelson
关键词: augmented Large Language, Large Language Model, Large Language, memory augmented Large, augmented Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:In this paper, we demonstrate the benefits of using memory augmented Large Language Model (LLM) architecture in improving the recall abilities of facts from a potentially long context. As a case study we test LARIMAR, a recently proposed LLM architecture which augments a LLM decoder with an external associative memory, on several long-context recall tasks, including passkey and needle-in-the-haystack tests. We demonstrate that the external memory can be adapted at test time to handle contexts much longer than those seen during training, while keeping readouts from the memory recognizable to the trained decoder and without increasing GPU memory footprint. Compared to alternative architectures for long-context recall tasks with models of a comparable parameter count, LARIMAR is able to maintain strong performance without any task-specific training.

[AI-17] Reinforcement Learning-driven Data-intensive Workflow Scheduling for Volunteer Edge-Cloud

链接: https://arxiv.org/abs/2407.01428
作者: Motahare Mounesan,Mauro Lemus,Hemanth Yeddulapalli,Prasad Calyam,Saptarshi Debroy
关键词: community computing paradigm, Volunteer Edge-Cloud, support data-intensive scientific, recent times, community computing
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent times, Volunteer Edge-Cloud (VEC) has gained traction as a cost-effective, community computing paradigm to support data-intensive scientific workflows. However, due to the highly distributed and heterogeneous nature of VEC resources, centralized workflow task scheduling remains a challenge. In this paper, we propose a Reinforcement Learning (RL)-driven data-intensive scientific workflow scheduling approach that takes into consideration: i) workflow requirements, ii) VEC resources’ preference on workflows, and iii) diverse VEC resource policies, to ensure robust resource allocation. We formulate the long-term average performance optimization problem as a Markov Decision Process, which is solved using an event-based Asynchronous Advantage Actor-Critic RL approach. Our extensive simulations and testbed implementations demonstrate our approach’s benefits over popular baseline strategies in terms of workflow requirement satisfaction, VEC preference satisfaction, and available VEC resource utilization.

[AI-18] RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing

链接: https://arxiv.org/abs/2407.01418
作者: Bo Ai,Stephen Tian,Haochen Shi,Yixuan Wang,Cheston Tan,Yunzhu Li,Jiajun Wu
关键词: feedback is critical, critical for understanding, rigid and deformable, tactile-informed dynamics model, Tactile feedback
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Robotics: Science and Systems (RSS), 2024. Project page: this https URL

点击查看摘要

Abstract:Tactile feedback is critical for understanding the dynamics of both rigid and deformable objects in many manipulation tasks, such as non-prehensile manipulation and dense packing. We introduce an approach that combines visual and tactile sensing for robotic manipulation by learning a neural, tactile-informed dynamics model. Our proposed framework, RoboPack, employs a recurrent graph neural network to estimate object states, including particles and object-level latent physics information, from historical visuo-tactile observations and to perform future state predictions. Our tactile-informed dynamics model, learned from real-world data, can solve downstream robotics tasks with model-predictive control. We demonstrate our approach on a real robot equipped with a compliant Soft-Bubble tactile sensor on non-prehensile manipulation and dense packing tasks, where the robot must infer the physics properties of objects from direct and indirect interactions. Trained on only an average of 30 minutes of real-world interaction data per task, our model can perform online adaptation and make touch-informed predictions. Through extensive evaluations in both long-horizon dynamics prediction and real-world manipulation, our method demonstrates superior effectiveness compared to previous learning-based and physics-based simulation systems.

[AI-19] Dynamic Few-Shot Learning for Knowledge Graph Question Answering

链接: https://arxiv.org/abs/2407.01409
作者: Jacopo D’Abramo,Andrea Zugarini,Paolo Torroni
关键词: innovative Question Answering, Large language models, Knowledge Graphs, Question Answering, Answering over Knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models present opportunities for innovative Question Answering over Knowledge Graphs (KGQA). However, they are not inherently designed for query generation. To bridge this gap, solutions have been proposed that rely on fine-tuning or ad-hoc architectures, achieving good results but limited out-of-domain distribution generalization. In this study, we introduce a novel approach called Dynamic Few-Shot Learning (DFSL). DFSL integrates the efficiency of in-context learning and semantic similarity and provides a generally applicable solution for KGQA with state-of-the-art performance. We run an extensive evaluation across multiple benchmark datasets and architecture configurations.

[AI-20] Semantic Compositions Enhance Vision-Language Contrastive Learning

链接: https://arxiv.org/abs/2407.01408
作者: Maxwell Aladago,Lorenzo Torresani,Soroush Vosoughi
关键词: vision-language contrastive learning, leverage within-batch non-matching, within-batch non-matching pairs, contrastive learning, matched image-caption pairs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-C for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification and cross-modal retrieval. The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.

[AI-21] Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters

链接: https://arxiv.org/abs/2407.01406
作者: Daniil Gurgurov,Mareike Hartmann,Simon Ostermann
关键词: named entity recognition, Large Language Models, multilingual Large Language, multilingual Large, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages, KaLLM workshop

点击查看摘要

Abstract:This paper explores the integration of graph knowledge from linguistic ontologies into multilingual Large Language Models (LLMs) using adapters to improve performance for low-resource languages (LRLs) in sentiment analysis (SA) and named entity recognition (NER). Building upon successful parameter-efficient fine-tuning techniques, such as K-ADAPTER and MAD-X, we propose a similar approach for incorporating knowledge from multilingual graphs, connecting concepts in various languages with each other through linguistic relationships, into multilingual LLMs for LRLs. Specifically, we focus on eight LRLs – Maltese, Bulgarian, Indonesian, Nepali, Javanese, Uyghur, Tibetan, and Sinhala – and employ language-specific adapters fine-tuned on data extracted from the language-specific section of ConceptNet, aiming to enable knowledge transfer across the languages covered by the knowledge graph. We compare various fine-tuning objectives, including standard Masked Language Modeling (MLM), MLM with full-word masking, and MLM with targeted masking, to analyse their effectiveness in learning and integrating the extracted graph data. Through empirical evaluation on language-specific tasks, we assess how structured graph knowledge affects the performance of multilingual LLMs for LRLs in SA and NER, providing insights into the potential benefits of adapting language models for low-resource scenarios.

[AI-22] Optimization of Retrieval-Augmented Generation Context with Outlier Detection

链接: https://arxiv.org/abs/2407.01403
作者: Vitaly Bulgakov
关键词: prompt context required, Large Language Model, retrieved LLM responses, question-answering systems, reduce the size
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we focus on methods to reduce the size and improve the quality of the prompt context required for question-answering systems. Attempts to increase the number of retrieved chunked documents and thereby enlarge the context related to the query can significantly complicate the processing and decrease the performance of a Large Language Model (LLM) when generating responses to queries. It is well known that a large set of documents retrieved from a database in response to a query may contain irrelevant information, which often leads to hallucinations in the resulting answers. Our goal is to select the most semantically relevant documents, treating the discarded ones as outliers. We propose and evaluate several methods for identifying outliers by creating features that utilize the distances of embedding vectors, retrieved from the vector database, to both the centroid and the query vectors. The methods were evaluated by comparing the similarities of the retrieved LLM responses to ground-truth answers obtained using the OpenAI GPT-4o model. It was found that the greatest improvements were achieved with increasing complexity of the questions and answers.

[AI-23] Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

链接: https://arxiv.org/abs/2407.01397
作者: Matteo Mosconi,Andriy Sorokin,Aniello Panariello,Angelo Porrello,Jacopo Bonato,Marco Cotogni,Luigi Sabetta,Simone Calderara,Rita Cucchiara
关键词: deep learning models, efficiently and effectively, skeletal data, data allows deep, models to perform
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ICPR 2024

点击查看摘要

Abstract:The use of skeletal data allows deep learning models to perform action recognition efficiently and effectively. Herein, we believe that exploring this problem within the context of Continual Learning is crucial. While numerous studies focus on skeleton-based action recognition from a traditional offline perspective, only a handful venture into online approaches. In this respect, we introduce CHARON (Continual Human Action Recognition On skeletoNs), which maintains consistent performance while operating within an efficient framework. Through techniques like uniform sampling, interpolation, and a memory-efficient training stage based on masking, we achieve improved recognition accuracy while minimizing computational overhead. Our experiments on Split NTU-60 and the proposed Split NTU-120 datasets demonstrate that CHARON sets a new benchmark in this domain. The code is available at this https URL.

[AI-24] Badllama 3: removing safety finetuning from Llama 3 in minutes

链接: https://arxiv.org/abs/2407.01376
作者: Dmitrii Volkov
关键词: extensive LLM safety, LLM safety fine-tuning, extensive LLM, model weights, LLM safety
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

[AI-25] Hyperspectral Pansharpening: Critical Review Tools and Future Perspectives

链接: https://arxiv.org/abs/2407.01355
作者: Matteo Ciotola,Giuseppe Guarino,Gemine Vivone,Giovanni Poggi,Jocelyn Chanussot,Antonio Plaza,Giuseppe Scarpa
关键词: Hyperspectral pansharpening consists, low-resolution hyperspectral image, low-resolution hyperspectral, high-resolution panchromatic band, Hyperspectral pansharpening
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Hyperspectral pansharpening consists of fusing a high-resolution panchromatic band and a low-resolution hyperspectral image to obtain a new image with high resolution in both the spatial and spectral domains. These remote sensing products are valuable for a wide range of applications, driving ever growing research efforts. Nonetheless, results still do not meet application demands. In part, this comes from the technical complexity of the task: compared to multispectral pansharpening, many more bands are involved, in a spectral range only partially covered by the panchromatic component and with overwhelming noise. However, another major limiting factor is the absence of a comprehensive framework for the rapid development and accurate evaluation of new methods. This paper attempts to address this issue. We started by designing a dataset large and diverse enough to allow reliable training (for data-driven methods) and testing of new methods. Then, we selected a set of state-of-the-art methods, following different approaches, characterized by promising performance, and reimplemented them in a single PyTorch framework. Finally, we carried out a critical comparative analysis of all methods, using the most accredited quality indicators. The analysis highlights the main limitations of current solutions in terms of spectral/spatial quality and computational efficiency, and suggests promising research directions. To ensure full reproducibility of the results and support future research, the framework (including codes, evaluation procedures and links to the dataset) is shared on this https URL, as a single Python-based reference benchmark toolbox. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) Cite as: arXiv:2407.01355 [cs.CV] (or arXiv:2407.01355v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2407.01355 Focus to learn more arXiv-issued DOI via DataCite

[AI-26] Coordination Failure in Cooperative Offline MARL

链接: https://arxiv.org/abs/2407.01343
作者: Callum Rhys Tilbury,Claude Formanek,Louise Beyers,Jonathan P. Shock,Arnu Pretorius
关键词: optimal multi-agent control, learn optimal multi-agent, leverages static datasets, multi-agent reinforcement learning, experience to learn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Accepted at the Workshop on Aligning Reinforcement Learning Experimentalists and Theorists (ARLET) at the International Conference on Machine Learning, 2024

点击查看摘要

Abstract:Offline multi-agent reinforcement learning (MARL) leverages static datasets of experience to learn optimal multi-agent control. However, learning from static data presents several unique challenges to overcome. In this paper, we focus on coordination failure and investigate the role of joint actions in multi-agent policy gradients with offline data, focusing on a common setting we refer to as the ‘Best Response Under Data’ (BRUD) approach. By using two-player polynomial games as an analytical tool, we demonstrate a simple yet overlooked failure mode of BRUD-based algorithms, which can lead to catastrophic coordination failure in the offline setting. Building on these insights, we propose an approach to mitigate such failure, by prioritising samples from the dataset based on joint-action similarity during policy learning and demonstrate its effectiveness in detailed experiments. More generally, however, we argue that prioritised dataset sampling is a promising area for innovation in offline MARL that can be combined with other effective approaches such as critic and policy regularisation. Importantly, our work shows how insights drawn from simplified, tractable games can lead to useful, theoretically grounded insights that transfer to more complex contexts. A core dimension of offering is an interactive notebook, from which almost all of our results can be reproduced, in a browser.

[AI-27] Deep Reinforcement Learning for Adverse Garage Scenario Generation

链接: https://arxiv.org/abs/2407.01333
作者: Kai Li
关键词: billion miles, ensure their safety, miles to ensure, simulation testing, Autonomous vehicles
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 14 pages, 17 figures

点击查看摘要

Abstract:Autonomous vehicles need to travel over 11 billion miles to ensure their safety. Therefore, the importance of simulation testing before real-world testing is self-evident. In recent years, the release of 3D simulators for autonomous driving, represented by Carla and CarSim, marks the transition of autonomous driving simulation testing environments from simple 2D overhead views to complex 3D models. During simulation testing, experimenters need to build static scenes and dynamic traffic flows, pedestrian flows, and other experimental elements to construct experimental scenarios. When building static scenes in 3D simulators, experimenters often need to manually construct 3D models, set parameters and attributes, which is time-consuming and labor-intensive. This thesis proposes an automated program generation framework. Based on deep reinforcement learning, this framework can generate different 2D ground script codes, on which 3D model files and map model files are built. The generated 3D ground scenes are displayed in the Carla simulator, where experimenters can use this scene for navigation algorithm simulation testing.

[AI-28] Restyling Unsupervised Concept Based Interpretable Networks with Generative Models

链接: https://arxiv.org/abs/2407.01331
作者: Jayneel Parekh,Quentin Bouniot,Pavlo Mozharovskyi,Alasdair Newson,Florence d’Alché-Buc
关键词: Developing inherently interpretable, Developing inherently, inherently interpretable models, recent years, gained prominence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page available at this https URL

点击查看摘要

Abstract:Developing inherently interpretable models for prediction has gained prominence in recent years. A subclass of these models, wherein the interpretable network relies on learning high-level concepts, are valued because of closeness of concept representations to human communication. However, the visualization and understanding of the learnt unsupervised dictionary of concepts encounters major limitations, specially for large-scale images. We propose here a novel method that relies on mapping the concept features to the latent space of a pretrained generative model. The use of a generative model enables high quality visualization, and naturally lays out an intuitive and interactive procedure for better interpretation of the learnt concepts. Furthermore, leveraging pretrained generative models has the additional advantage of making the training of the system more efficient. We quantitatively ascertain the efficacy of our method in terms of accuracy of the interpretable prediction network, fidelity of reconstruction, as well as faithfulness and consistency of learnt concepts. The experiments are conducted on multiple image recognition benchmarks for large-scale images. Project page available at this https URL

[AI-29] Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning

链接: https://arxiv.org/abs/2407.01320
作者: Haobo Song,Hao Zhao,Soumajit Majumder,Tao Lin
关键词: large pre-trained foundation, Fine-tuning large pre-trained, pre-trained foundation models, large pre-trained, pre-trained foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted at ICLR 2024. Code at this https URL

点击查看摘要

Abstract:Fine-tuning large pre-trained foundation models, such as the 175B GPT-3, has attracted more attention for downstream tasks recently. While parameter-efficient fine-tuning methods have been proposed and proven effective without retraining all model parameters, their performance is limited by the capacity of incremental modules, especially under constrained parameter budgets. \ To overcome this challenge, we propose CapaBoost, a simple yet effective strategy that enhances model capacity by leveraging low-rank updates through parallel weight modules in target layers. By applying static random masks to the shared weight matrix, CapaBoost constructs a diverse set of weight matrices, effectively increasing the rank of incremental weights without adding parameters. Notably, our approach can be seamlessly integrated into various existing parameter-efficient fine-tuning methods. We extensively validate the efficacy of CapaBoost through experiments on diverse downstream tasks, including natural language understanding, question answering, and image classification. Our results demonstrate significant improvements over baselines, without incurring additional computation or storage costs. Our code is available at \urlthis https URL.

[AI-30] Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios

链接: https://arxiv.org/abs/2407.01317
作者: Juan Ignacio Alvarez-Trejos,Beltrán Labrador,Alicia Lozano-Diez
关键词: effectively handling speech, handling speech overlap, neural speaker diarization, speaker diarization systems, overlap handling strengths
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to Odyssey 2024

点击查看摘要

Abstract:End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap. This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities, while maintaining their overlap handling strengths. To achieve this, we propose several methods for incorporating these embeddings along the acoustic features. Furthermore, we delve into an analysis of the correct handling of silence frames, the window length for extracting speaker embeddings and the transformer encoder size. The effectiveness of our proposed approach is thoroughly evaluated on the CallHome dataset for the two-speaker diarization task, with results that demonstrate a significant reduction in diarization error rates achieving a relative improvement of a 10.78% compared to the baseline end-to-end model.

[AI-31] Robot Instance Segmentation with Few Annotations for Grasping

链接: https://arxiv.org/abs/2407.01302
作者: Moshe Kimhi,David Vainshtein,Chaim Baskin,Dotan Di Castro
关键词: manipulate objects relies, objects relies heavily, ability of robots, robots to manipulate, relies heavily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The ability of robots to manipulate objects relies heavily on their aptitude for visual perception. In domains characterized by cluttered scenes and high object variability, most methods call for vast labeled datasets, laboriously hand-annotated, with the aim of training capable models. Once deployed, the challenge of generalizing to unfamiliar objects implies that the model must evolve alongside its domain. To address this, we propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI), allowing a model to learn by observing scene alterations and leverage visual consistency despite temporal gaps without requiring curated data of interaction sequences. As a result, our approach exploits partially annotated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unlabeled still images. We validate our method on two common benchmarks, ARMBench mix-object-tote and OCID, where it achieves state-of-the-art performance. Notably, on ARMBench, we attain an \textAP_50 of 86.37 , almost a 20% improvement over existing work, and obtain remarkable results in scenarios with extremely low annotation, achieving an \textAP_50 score of 84.89 with just 1 % of annotated data compared to 72 presented in ARMBench on the fully annotated counterpart.

[AI-32] Collaborative Performance Prediction for Large Language Models

链接: https://arxiv.org/abs/2407.01300
作者: Qiyuan Zhang,Fuyuan Lyu,Xue Liu,Chen Ma
关键词: NLP research, Comprehensively understanding, challenge in NLP, large language models, diverse downstream tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework, Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from online platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance, an area previously overlooked.

[AI-33] A Collaborative Human-Centred Taxonomy of AI Algorithmic and Automation Harms

链接: https://arxiv.org/abs/2407.01294
作者: Gavin Abercrombie,Djalel Benbouzid,Paolo Giudici,Delaram Golpayegani,Julio Hernandez,Pierre Noro,Harshvardhan Pandit,Eva Paraschou,Charlie Pownall,Jyoti Prajapati,Mark A. Sayre,Ushnish Sengupta,Arthit Suriyawongful,Ruby Thelot,Sofia Vei,Laura Waltersdorfer
关键词: introduces a collaborative, algorithmic and automation, paper introduces, human-centered taxonomy, existing taxonomies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This paper introduces a collaborative, human-centered taxonomy of AI, algorithmic and automation harms. We argue that existing taxonomies, while valuable, can be narrow, unclear, typically cater to practitioners and government, and often overlook the needs of the wider public. Drawing on existing taxonomies and a large repository of documented incidents, we propose a taxonomy that is clear and understandable to a broad set of audiences, as well as being flexible, extensible, and interoperable. Through iterative refinement with topic experts and crowdsourced annotation testing, we propose a taxonomy that can serve as a powerful tool for civil society organisations, educators, policymakers, product teams and the general public. By fostering a greater understanding of the real-world harms of AI and related technologies, we aim to increase understanding, empower NGOs and individuals to identify and report violations, inform policy discussions, and encourage responsible technology development and deployment.

[AI-34] Hypformer: Exploring Efficient Hyperbolic Transformer Fully in Hyperbolic Space

链接: https://arxiv.org/abs/2407.01290
作者: Menglin Yang,Harshit Verma,Delvin Ce Zhang,Jiahong Liu,Irwin King,Rex Ying
关键词: modeling complex structured, hyperbolic Transformer, shown significant potential, Hyperbolic, complex structured data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: KDD 2024

点击查看摘要

Abstract:Hyperbolic geometry have shown significant potential in modeling complex structured data, particularly those with underlying tree-like and hierarchical structures. Despite the impressive performance of various hyperbolic neural networks across numerous domains, research on adapting the Transformer to hyperbolic space remains limited. Previous attempts have mainly focused on modifying self-attention modules in the Transformer. However, these efforts have fallen short of developing a complete hyperbolic Transformer. This stems primarily from: (i) the absence of well-defined modules in hyperbolic space, including linear transformation layers, LayerNorm layers, activation functions, dropout operations, etc. (ii) the quadratic time complexity of the existing hyperbolic self-attention module w.r.t the number of input tokens, which hinders its scalability. To address these challenges, we propose, Hypformer, a novel hyperbolic Transformer based on the Lorentz model of hyperbolic geometry. In Hypformer, we introduce two foundational blocks that define the essential modules of the Transformer in hyperbolic space. Furthermore, we develop a linear self-attention mechanism in hyperbolic space, enabling hyperbolic Transformer to process billion-scale graph data and long-sequence inputs for the first time. Our experimental results confirm the effectiveness and efficiency of Hypformer across various datasets, demonstrating its potential as an effective and scalable solution for large-scale data representation and large models.

[AI-35] We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

链接: https://arxiv.org/abs/2407.01284
作者: Runqi Qiao,Qiuna Tan,Guanting Dong,Minhui Wu,Chong Sun,Xiaoshuai Song,Zhuoma GongQue,Shanglin Lei,Zhe Wei,Miaoxuan Zhang,Runfeng Qiao,Yifan Zhang,Xiao Zong,Yida Xu,Muxi Diao,Zhimin Bao,Chen Li,Honggang Zhang
关键词: Large Multimodal Models, Multimodal Models, Large Multimodal, received widespread attention, Visual mathematical reasoning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: Work in progress

点击查看摘要

Abstract:Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs’ reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at this https URL.

[AI-36] Bridging Smoothness and Approximation: Theoretical Insights into Over-Smoothing in Graph Neural Networks

链接: https://arxiv.org/abs/2407.01281
作者: Guangrui Yang,Jianfei Li,Ming Li,Han Feng,Ding-Xuan Zhou
关键词: Graph Convolutional Networks, approximation, approximation theory, functions defined, GCNs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:In this paper, we explore the approximation theory of functions defined on graphs. Our study builds upon the approximation results derived from the K -functional. We establish a theoretical framework to assess the lower bounds of approximation for target functions using Graph Convolutional Networks (GCNs) and examine the over-smoothing phenomenon commonly observed in these networks. Initially, we introduce the concept of a K -functional on graphs, establishing its equivalence to the modulus of smoothness. We then analyze a typical type of GCN to demonstrate how the high-frequency energy of the output decays, an indicator of over-smoothing. This analysis provides theoretical insights into the nature of over-smoothing within GCNs. Furthermore, we establish a lower bound for the approximation of target functions by GCNs, which is governed by the modulus of smoothness of these functions. This finding offers a new perspective on the approximation capabilities of GCNs. In our numerical experiments, we analyze several widely applied GCNs and observe the phenomenon of energy decay. These observations corroborate our theoretical results on exponential decay order.

[AI-37] Human-Robot Mutual Learning through Affective-Linguistic Interaction and Differential Outcomes Training [Pre-Print]

链接: https://arxiv.org/abs/2407.01280
作者: Emilia Heikkinen,Elsa Silvennoinen,Imran Khan,Zakaria Lemhaouri,Laura Cohen,Lola Cañamero,Robert Lowe
关键词: Large Language Models, Language Models, Large Language, success of Large, differential outcomes training
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 14 pages, with references; 1 figure, 3 tables

点击查看摘要

Abstract:Owing to the recent success of Large Language Models, Modern A.I has been much focused on linguistic interactions with humans but less focused on non-linguistic forms of communication between man and machine. In the present paper, we test how affective-linguistic communication, in combination with differential outcomes training, affects mutual learning in a human-robot context. Taking inspiration from child-caregiver dynamics, our human-robot interaction setup consists of a (simulated) robot attempting to learn how best to communicate internal, homeostatically-controlled needs; while a human “caregiver” attempts to learn the correct object to satisfy the robot’s present communicated need. We studied the effects of i) human training type, and ii) robot reinforcement learning type, to assess mutual learning terminal accuracy and rate of learning (as measured by the average reward achieved by the robot). Our results find mutual learning between a human and a robot is significantly improved with Differential Outcomes Training (DOT) compared to Non-DOT (control) conditions. We find further improvements when the robot uses an exploration-exploitation policy selection, compared to purely exploitation policy selection. These findings have implications for utilizing socially assistive robots (SAR) in therapeutic contexts, e.g. for cognitive interventions, and educational applications.

[AI-38] he African Woman is Rhythmic and Soulful: Evaluation of Open-ended Generation for Implicit Biases

链接: https://arxiv.org/abs/2407.01270
作者: Serene Lim
关键词: Large Language Models, Language Models, Large Language, LLM Decision Bias, demonstrate underlying prejudices
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study investigates the subtle and often concealed biases present in Large Language Models (LLMs), which, despite passing explicit bias tests, can still exhibit implicit biases akin to those observed in humans who profess egalitarian beliefs yet demonstrate underlying prejudices. The challenge of measuring such biases is exacerbated as LLMs become increasingly proprietary, restricting access to their internal mechanisms such as embeddings, which are crucial for applying traditional bias measures. To tackle these issues, this study introduces innovative measures of bias inspired by psychological methodologies: the LLM Implicit Association Test (IAT) Bias and the LLM Decision Bias. The LLM IAT Bias is a prompt-based method designed to unearth implicit biases by simulating the well-known psychological IAT but adapted for use with LLMs. The LLM Decision Bias measure is developed to detect subtle discrimination in decision-making tasks, focusing on how LLMs choose between individuals in various scenarios. Open-ended generation is also utilised through thematic analysis of word generations and storytelling. The experiments revealed biases across gender and racial domains, from discriminatory categorisations to exoticisation. Our findings indicate that the prompt-based measure of implicit bias not only correlates with traditional embedding-based methods but also more effectively predicts downstream behaviors, which are crucially measured by the LLM Decision Bias. This relationship underscores the importance of relative, rather than absolute, evaluations in assessing implicit biases, reflecting psychological insights into human bias assessment. This research contributes to the broader understanding of AI ethics and provides suggestions for continually assessing and mitigating biases in advanced AI systems, emphasising the need for more qualitative and downstream focus.

[AI-39] QUEEN: Query Unlearning against Model Extraction

链接: https://arxiv.org/abs/2407.01251
作者: Huajie Chen,Tianqing Zhu,Lefeng Zhang,Bo Liu,Derui Wang,Wanlei Zhou,Minhui Xue
关键词: Model extraction attacks, Model, deep learning models, Model extraction, piracy model
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model extraction attacks currently pose a non-negligible threat to the security and privacy of deep learning models. By querying the model with a small dataset and usingthe query results as the ground-truth labels, an adversary can steal a piracy model with performance comparable to the original model. Two key issues that cause the threat are, on the one hand, accurate and unlimited queries can be obtained by the adversary; on the other hand, the adversary can aggregate the query results to train the model step by step. The existing defenses usually employ model watermarking or fingerprinting to protect the ownership. However, these methods cannot proactively prevent the violation from happening. To mitigate the threat, we propose QUEEN (QUEry unlEarNing) that proactively launches counterattacks on potential model extraction attacks from the very beginning. To limit the potential threat, QUEEN has sensitivity measurement and outputs perturbation that prevents the adversary from training a piracy model with high performance. In sensitivity measurement, QUEEN measures the single query sensitivity by its distance from the center of its cluster in the feature space. To reduce the learning accuracy of attacks, for the highly sensitive query batch, QUEEN applies query unlearning, which is implemented by gradient reverse to perturb the softmax output such that the piracy model will generate reverse gradients to worsen its performance unconsciously. Experiments show that QUEEN outperforms the state-of-the-art defenses against various model extraction attacks with a relatively low cost to the model accuracy. The artifact is publicly available at https://anonymous.4open.science/r/queen implementation-5408/.

[AI-40] SINKT: A Structure-Aware Inductive Knowledge Tracing Model with Large Language Model

链接: https://arxiv.org/abs/2407.01245
作者: Lingyue Fu,Hao Guan,Kounianhua Du,Jianghao Lin,Wei Xia,Weinan Zhang,Ruiming Tang,Yasheng Wang,Yong Yu
关键词: intelligent tutoring systems, Inductive Knowledge Tracing, Knowledge Tracing, Knowledge Tracing model, aims to determine
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) aims to determine whether students will respond correctly to the next question, which is a crucial task in intelligent tutoring systems (ITS). In educational KT scenarios, transductive ID-based methods often face severe data sparsity and cold start problems, where interactions between individual students and questions are sparse, and new questions and concepts consistently arrive in the database. In addition, existing KT models only implicitly consider the correlation between concepts and questions, lacking direct modeling of the more complex relationships in the heterogeneous graph of concepts and questions. In this paper, we propose a Structure-aware Inductive Knowledge Tracing model with large language model (dubbed SINKT), which, for the first time, introduces large language models (LLMs) and realizes inductive knowledge tracing. Firstly, SINKT utilizes LLMs to introduce structural relationships between concepts and constructs a heterogeneous graph for concepts and questions. Secondly, by encoding concepts and questions with LLMs, SINKT incorporates semantic information to aid prediction. Finally, SINKT predicts the student’s response to the target question by interacting with the student’s knowledge state and the question representation. Experiments on four real-world datasets demonstrate that SINKT achieves state-of-the-art performance among 12 existing transductive KT models. Additionally, we explore the performance of SINKT on the inductive KT task and provide insights into various modules.

[AI-41] SGCCNet: Single-Stage 3D Object Detector With Saliency-Guided Data Augmentation and Confidence Correction Mechanism

链接: https://arxiv.org/abs/2407.01239
作者: Ao Liang,Wenyu Chen,Jian Fang,Huaici Zhao
关键词: attracted widespread research, widespread research interest, research interest due, fast inference speed, inference speed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 16 figures

点击查看摘要

Abstract:The single-stage point-based 3D object detectors have attracted widespread research interest due to their advantages of lightweight and fast inference speed. However, they still face challenges such as inadequate learning of low-quality objects (ILQ) and misalignment between localization accuracy and classification confidence (MLC). In this paper, we propose SGCCNet to alleviate these two issues. For ILQ, SGCCNet adopts a Saliency-Guided Data Augmentation (SGDA) strategy to enhance the robustness of the model on low-quality objects by reducing its reliance on salient features. Specifically, We construct a classification task and then approximate the saliency scores of points by moving points towards the point cloud centroid in a differentiable process. During the training process, SGCCNet will be forced to learn from low saliency features through dropping points. Meanwhile, to avoid internal covariate shift and contextual features forgetting caused by dropping points, we add a geometric normalization module and skip connection block in each stage. For MLC, we design a Confidence Correction Mechanism (CCM) specifically for point-based multi-class detectors. This mechanism corrects the confidence of the current proposal by utilizing the predictions of other key points within the local region in the post-processing stage. Extensive experiments on the KITTI dataset demonstrate the generality and effectiveness of our SGCCNet. On the KITTI \textittest set, SGCCNet achieves 80.82% for the metric of AP_3D on the \textitModerate level, outperforming all other point-based detectors, surpassing IA-SSD and Fast Point R-CNN by 2.35% and 3.42% , respectively. Additionally, SGCCNet demonstrates excellent portability for other point-based detectors

[AI-42] Large Language Models are Zero-Shot Recognizers for Activities of Daily Living

链接: https://arxiv.org/abs/2407.01238
作者: Gabriele Civitarese,Michele Fiori,Priyankar Choudhary,Claudio Bettini
关键词: Daily Living, Large Language Models, energy management, smart home environments, home environments enables
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
*备注: Currently under review

点击查看摘要

Abstract:The sensor-based recognition of Activities of Daily Living (ADLs) in smart home environments enables several applications in the areas of energy management, safety, well-being, and healthcare. ADLs recognition is typically based on deep learning methods requiring large datasets to be trained. Recently, several studies proved that Large Language Models (LLMs) effectively capture common-sense knowledge about human activities. However, the effectiveness of LLMs for ADLs recognition in smart home environments still deserves to be investigated. In this work, we propose ADL-LLM, a novel LLM-based ADLs recognition system. ADLLLM transforms raw sensor data into textual representations, that are processed by an LLM to perform zero-shot ADLs recognition. Moreover, in the scenario where a small labeled dataset is available, ADL-LLM can also be empowered with few-shot prompting. We evaluated ADL-LLM on two public datasets, showing its effectiveness in this domain.

[AI-43] MIRAI: Evaluating LLM Agents for Event Forecasting

链接: https://arxiv.org/abs/2407.01231
作者: Chenchen Ye,Ziniu Hu,Yihe Deng,Zijie Huang,Mingyu Derek Ma,Yanqiao Zhu,Wei Wang
关键词: solve complex problems, LLM agents, Large Language Models, empowered LLM agents, Recent advancements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 66 pages, 8 figures, 6 tables; Website: this https URL

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents’ forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents’ abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents’ capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

[AI-44] Let Hybrid A* Path Planner Obey Traffic Rules: A Deep Reinforcement Learning-Based Planning Framework

链接: https://arxiv.org/abs/2407.01216
作者: Xibo Li,Shruti Patel,Christof Büskens
关键词: Deep reinforcement learning, Deep reinforcement, maximizes self-defined rewards, reinforcement learning, lane change command
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) allows a system to interact with its environment and take actions by training an efficient policy that maximizes self-defined rewards. In autonomous driving, it can be used as a strategy for high-level decision making, whereas low-level algorithms such as the hybrid A* path planning have proven their ability to solve the local trajectory planning problem. In this work, we combine these two methods where the DRL makes high-level decisions such as lane change commands. After obtaining the lane change command, the hybrid A* planner is able to generate a collision-free trajectory to be executed by a model predictive controller (MPC). In addition, the DRL algorithm is able to keep the lane change command consistent within a chosen time-period. Traffic rules are implemented using linear temporal logic (LTL), which is then utilized as a reward function in DRL. Furthermore, we validate the proposed method on a real system to demonstrate its feasibility from simulation to implementation on real hardware.

[AI-45] Revisiting Random Walks for Learning on Graphs

链接: https://arxiv.org/abs/2407.01214
作者: Jinwoo Kim,Olga Zaghen,Ayhan Suleymanzade,Youngmin Ryou,Seunghoon Hong
关键词: directly make vertex-level, random walk neural, walk neural networks, random walk, walk neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 41 pages, 11 figures

点击查看摘要

Abstract:We revisit a simple idea for machine learning on graphs, where a random walk on a graph produces a machine-readable record, and this record is processed by a deep neural network to directly make vertex-level or graph-level predictions. We refer to these stochastic machines as random walk neural networks, and show that we can design them to be isomorphism invariant while capable of universal approximation of graph functions in probability. A useful finding is that almost any kind of record of random walk guarantees probabilistic invariance as long as the vertices are anonymized. This enables us to record random walks in plain text and adopt a language model to read these text records to solve graph tasks. We further establish a parallelism to message passing neural networks using tools from Markov chain theory, and show that over-smoothing in message passing is alleviated by construction in random walk neural networks, while over-squashing manifests as probabilistic under-reaching. We show that random walk neural networks based on pre-trained language models can solve several hard problems on graphs, such as separating strongly regular graphs where the 3-WL test fails, counting substructures, and transductive classification on arXiv citation network without training. Code is available at this https URL.

[AI-46] Efficient Cutting Tool Wear Segmentation Based on Segment Anything Model

链接: https://arxiv.org/abs/2407.01211
作者: Zongshuo Li,Ding Huo,Markus Meurer,Thomas Bergs
关键词: final geometric precision, wear conditions impact, Tool wear conditions, Tool wear, geometric precision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Tool wear conditions impact the surface quality of the workpiece and its final geometric precision. In this research, we propose an efficient tool wear segmentation approach based on Segment Anything Model, which integrates U-Net as an automated prompt generator to streamline the processes of tool wear detection. Our evaluation covered three Point-of-Interest generation methods and further investigated the effects of variations in training dataset sizes and U-Net training intensities on resultant wear segmentation outcomes. The results consistently highlight our approach’s advantage over U-Net, emphasizing its ability to achieve accurate wear segmentation even with limited training datasets. This feature underscores its potential applicability in industrial scenarios where datasets may be limited.

[AI-47] Deep Learning Approach for Enhanced Transferability and Learning Capacity in Tool Wear Estimation

链接: https://arxiv.org/abs/2407.01200
作者: Zongshuo Li,Markus Meurer,Thomas Bergs
关键词: monitoring systems obtain, systems obtain valuable, obtain valuable information, contemporary manufacturing, monitoring systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:As an integral part of contemporary manufacturing, monitoring systems obtain valuable information during machining to oversee the condition of both the process and the machine. Recently, diverse algorithms have been employed to detect tool wear using single or multiple sources of measurements. In this study, a deep learning approach is proposed for estimating tool wear, considering cutting parameters. The model’s accuracy and transferability in tool wear estimation were assessed with milling experiments conducted under varying cutting parameters. The results indicate that the proposed method outperforms conventional methods in terms of both transferability and rapid learning capabilities.

[AI-48] Deep Learning Based Tool Wear Estimation Considering Cutting Conditions

链接: https://arxiv.org/abs/2407.01199
作者: Zongshuo Li,Markus Meurer,Thomas Bergs
关键词: tool wear estimation, wear conditions impact, wear estimation accuracy, Tool wear conditions, Tool wear
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Tool wear conditions impact the final quality of the workpiece. In this study, we propose a deep learning approach based on a convolutional neural network that incorporates cutting conditions as extra model inputs, aiming to improve tool wear estimation accuracy and fulfill industrial demands for zero-shot transferability. Through a series of milling experiments under various cutting parameters, we evaluate the model’s performance in terms of tool wear estimation accuracy and its transferability to new fixed or variable cutting parameters. The results consistently highlight our approach’s advantage over conventional models that omit cutting conditions, maintaining superior performance irrespective of the stability of the wear development or the limitation of the training dataset. This finding underscores its potential applicability in industrial scenarios.

[AI-49] MARS: Multimodal Active Robotic Sensing for Articulated Characterization

链接: https://arxiv.org/abs/2407.01191
作者: Hongliang Zeng,Ping Zhang,Chengjiong Wu,Jiahua Wang,Tingyu Ye,Fang Li
关键词: Precise perception, empowering service robots, empowering service, Precise, Abstract
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Precise perception of articulated objects is vital for empowering service robots. Recent studies mainly focus on point cloud, a single-modal approach, often neglecting vital texture and lighting details and assuming ideal conditions like optimal viewpoints, unrepresentative of real-world scenarios. To address these limitations, we introduce MARS, a novel framework for articulated object characterization. It features a multi-modal fusion module utilizing multi-scale RGB features to enhance point cloud features, coupled with reinforcement learning-based active sensing for autonomous optimization of observation viewpoints. In experiments conducted with various articulated object instances from the PartNet-Mobility dataset, our method outperformed current state-of-the-art methods in joint parameter estimation accuracy. Additionally, through active sensing, MARS further reduces errors, demonstrating enhanced efficiency in handling suboptimal viewpoints. Furthermore, our method effectively generalizes to real-world articulated objects, enhancing robot interactions. Code is available at this https URL.

[AI-50] textMemory3: Language Modeling with Explicit Memory

链接: https://arxiv.org/abs/2407.01178
作者: Hongkang Yang,Zehao Lin,Wenjin Wang,Hao Wu,Zhiyu Li,Bo Tang,Wenqiang Wei,Jinbo Wang,Zeyun Tang,Shichao Song,Chenyang Xi,Yu Yu,Kai Chen,Feiyu Xiong,Linpeng Tang,Weinan E
关键词: large language models, meaningful computation, large language, costly process, process that transports
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining “abstract knowledge”. As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named \textMemory^3 , since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.

[AI-51] Multi-View Black-Box Physical Attacks on Infrared Pedestrian Detectors Using Adversarial Infrared Grid

链接: https://arxiv.org/abs/2407.01168
作者: Kalibinuer Tiliwalidi,Chengyin Hu,Weiwen Shi
关键词: extensive research exists, visible spectrum, research exists, infrared spectrum, attacks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While extensive research exists on physical adversarial attacks within the visible spectrum, studies on such techniques in the infrared spectrum are limited. Infrared object detectors are vital in modern technological applications but are susceptible to adversarial attacks, posing significant security threats. Previous studies using physical perturbations like light bulb arrays and aerogels for white-box attacks, or hot and cold patches for black-box attacks, have proven impractical or limited in multi-view support. To address these issues, we propose the Adversarial Infrared Grid (AdvGrid), which models perturbations in a grid format and uses a genetic algorithm for black-box optimization. These perturbations are cyclically applied to various parts of a pedestrian’s clothing to facilitate multi-view black-box physical attacks on infrared pedestrian detectors. Extensive experiments validate AdvGrid’s effectiveness, stealthiness, and robustness. The method achieves attack success rates of 80.00% in digital environments and 91.86% in physical environments, outperforming baseline methods. Additionally, the average attack success rate exceeds 50% against mainstream detectors, demonstrating AdvGrid’s robustness. Our analyses include ablation studies, transfer attacks, and adversarial defenses, confirming the method’s superiority.

[AI-52] Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models

链接: https://arxiv.org/abs/2407.01157
作者: Shaeke Salman,Md Montasir Bin Shams,Xiuwen Liu
关键词: unprecedented zero-shot capabilities, exhibit unprecedented zero-shot, shared embedding space, models exhibit unprecedented, zero-shot capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 14 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:2401.15568 , arXiv:2402.08473

点击查看摘要

Abstract:Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities can be misaligned. In this paper, we extend and utilize a recently developed effective gradient-based procedure that allows us to match the embedding of a given text by minimally modifying an image. Using the procedure, we show that we can align the embeddings of distinguishable texts to any image through unnoticeable adversarial attacks in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and at the same time visually indistinguishable images can be matched to the embeddings of very different texts. Our technique achieves 100% success rate when it is applied to text datasets and images from multiple sources. Without overcoming the vulnerability, multimodal models cannot robustly align inputs from different modalities in a semantically meaningful way. \textbfWarning: the text data used in this paper are toxic in nature and may be offensive to some readers.

[AI-53] Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition

链接: https://arxiv.org/abs/2407.01143
作者: Oliver Schrüfer,Manuel Milling,Felix Burkhardt,Florian Eyben,Björn Schuller
关键词: important building block, Uncertainty Quantification, identifying faulty predictions, important building, building block
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: accepted for Interspeech 2024, 5 pages

点击查看摘要

Abstract:Uncertainty Quantification (UQ) is an important building block for the reliable use of neural networks in real-world scenarios, as it can be a useful tool in identifying faulty predictions. Speech emotion recognition (SER) models can suffer from particularly many sources of uncertainty, such as the ambiguity of emotions, Out-of-Distribution (OOD) data or, in general, poor recording conditions. Reliable UQ methods are thus of particular interest as in many SER applications no prediction is better than a faulty prediction. While the effects of label ambiguity on uncertainty are well documented in the literature, we focus our work on an evaluation of UQ methods for SER under common challenges in real-world application, such as corrupted signals, and the absence of speech. We show that simple UQ methods can already give an indication of the uncertainty of a prediction and that training with additional OOD data can greatly improve the identification of such signals.

[AI-54] Integrated feature analysis for deep learning interpretation and class activation maps

链接: https://arxiv.org/abs/2407.01142
作者: Yanli Li,Tahereh Hassanzadeh,Denis P. Shamonin,Monique Reijnierse,Annette H.M. van der Helm-van Mil,Berend C. Stoel
关键词: integrated feature analysis, Understanding the decisions, integrated feature, feature analysis, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 11 figures, code available: this https URL This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Understanding the decisions of deep learning (DL) models is essential for the acceptance of DL to risk-sensitive applications. Although methods, like class activation maps (CAMs), give a glimpse into the black box, they do miss some crucial information, thereby limiting its interpretability and merely providing the considered locations of objects. To provide more insight into the models and the influence of datasets, we propose an integrated feature analysis method, which consists of feature distribution analysis and feature decomposition, to look closer into the intermediate features extracted by DL models. This integrated feature analysis could provide information on overfitting, confounders, outliers in datasets, model redundancies and principal features extracted by the models, and provide distribution information to form a common intensity scale, which are missing in current CAM algorithms. The integrated feature analysis was applied to eight different datasets for general validation: photographs of handwritten digits, two datasets of natural images and five medical datasets, including skin photography, ultrasound, CT, X-rays and MRIs. The method was evaluated by calculating the consistency between the CAMs average class activation levels and the logits of the model. Based on the eight datasets, the correlation coefficients through our method were all very close to 100%, and based on the feature decomposition, 5%-25% of features could generate equally informative saliency maps and obtain the same model performances as using all features. This proves the reliability of the integrated feature analysis. As the proposed methods rely on very few assumptions, this is a step towards better model interpretation and a useful extension to existing CAM algorithms. Codes: this https URL

[AI-55] An Empirical Comparison of Generative Approaches for Product Attribute-Value Identification

链接: https://arxiv.org/abs/2407.01137
作者: Kassem Sabeh,Robert Litschko,Mouna Kacimi,Barbara Plank,Johann Gamper
关键词: e-commerce platforms, supporting applications, applications like search, question answering, Product attributes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Product attributes are crucial for e-commerce platforms, supporting applications like search, recommendation, and question answering. The task of Product Attribute and Value Identification (PAVI) involves identifying both attributes and their values from product information. In this paper, we formulate PAVI as a generation task and provide, to the best of our knowledge, the most comprehensive evaluation of PAVI so far. We compare three different attribute-value generation (AVG) strategies based on fine-tuning encoder-decoder models on three datasets. Experiments show that end-to-end AVG approach, which is computationally efficient, outperforms other strategies. However, there are differences depending on model sizes and the underlying language model. The code to reproduce all experiments is available at: this https URL

[AI-56] Investigating the potential of Sparse Mixtures-of-Experts for multi-domain neural machine translation

链接: https://arxiv.org/abs/2407.01126
作者: Nadezhda Chirkova,Vassilina Nikoulina,Jean-Luc Meunier,Alexandre Bérard
关键词: Neural Machine Translation, multi-domain Neural Machine, Machine Translation, Neural Machine, developing efficient models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We focus on multi-domain Neural Machine Translation, with the goal of developing efficient models which can handle data from various domains seen during training and are robust to domains unseen during training. We hypothesize that Sparse Mixture-of-Experts (SMoE) models are a good fit for this task, as they enable efficient model scaling, which helps to accommodate a variety of multi-domain data, and allow flexible sharing of parameters between domains, potentially enabling knowledge transfer between similar domains and limiting negative transfer. We conduct a series of experiments aimed at validating the utility of SMoE for the multi-domain scenario, and find that a straightforward width scaling of Transformer is a simpler and surprisingly more efficient approach in practice, and reaches the same performance level as SMoE. We also search for a better recipe for robustness of multi-domain systems, highlighting the importance of mixing-in a generic domain, i.e. Paracrawl, and introducing a simple technique, domain randomization.

[AI-57] Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?

链接: https://arxiv.org/abs/2407.01119
作者: Guillermo Marco,Julio Gonzalo,Ramón del Castillo,María Teresa Mateo Girona
关键词: Large Language Models, creative text writing, report research results, creative writing skills, outperform average humans
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages 6 figures

点击查看摘要

Abstract:It has become routine to report research results where Large Language Models (LLMs) outperform average humans in a wide range of language-related tasks, and creative text writing is no exception. It seems natural, then, to raise the bid: Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent’s. Then, we prepared an evaluation rubric inspired by Boden’s definition of creativity, and we collected 5,400 manual assessments provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer, and that reaching such level of autonomous creative writing skills probably cannot be reached simply with larger language models.

[AI-58] Proximity Matters: Local Proximity Preserved Balancing for Treatment Effect Estimation

链接: https://arxiv.org/abs/2407.01111
作者: Hao Wang,Zhichao Chen,Yuan Shen,Jiajun Fan,Zhaoran Liu,Degui Yang,Xinggao Liu,Haoxuan Li
关键词: Heterogeneous treatment effect, poses significant challenges, significant challenges due, Heterogeneous treatment, observational data poses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Code is available at https://anonymous.4open.science/status/ncr-B697

点击查看摘要

Abstract:Heterogeneous treatment effect (HTE) estimation from observational data poses significant challenges due to treatment selection bias. Existing methods address this bias by minimizing distribution discrepancies between treatment groups in latent space, focusing on global alignment. However, the fruitful aspect of local proximity, where similar units exhibit similar outcomes, is often overlooked. In this study, we propose Proximity-aware Counterfactual Regression (PCR) to exploit proximity for representation balancing within the HTE estimation context. Specifically, we introduce a local proximity preservation regularizer based on optimal transport to depict the local proximity in discrepancy calculation. Furthermore, to overcome the curse of dimensionality that renders the estimation of discrepancy ineffective, exacerbated by limited data availability for HTE estimation, we develop an informative subspace projector, which trades off minimal distance precision for improved sample complexity. Extensive experiments demonstrate that PCR accurately matches units across different treatment groups, effectively mitigates treatment selection bias, and significantly outperforms competitors. Code is available at https://anonymous.4open.science/status/ncr-B697.

[AI-59] SecGenAI: Enhancing Security of Cloud-based Generative AI Applications within Australian Critical Technologies of National Interest

链接: https://arxiv.org/abs/2407.01110
作者: Christoforus Yoga Haryanto,Minh Hieu Vu,Trung Duc Nguyen,Emily Lomempow,Yulia Nurliana,Sona Taheri
关键词: Australia critical technologies, technologies offers transformative, offers transformative opportunities, unique security challenges, advancement of Generative
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, 9 tables, submitted to the 2024 11th International Conference on Soft Computing Machine Intelligence (ISCMI 2024)

点击查看摘要

Abstract:The rapid advancement of Generative AI (GenAI) technologies offers transformative opportunities within Australia’s critical technologies of national interest while introducing unique security challenges. This paper presents SecGenAI, a comprehensive security framework for cloud-based GenAI applications, with a focus on Retrieval-Augmented Generation (RAG) systems. SecGenAI addresses functional, infrastructure, and governance requirements, integrating end-to-end security analysis to generate specifications emphasizing data privacy, secure deployment, and shared responsibility models. Aligned with Australian Privacy Principles, AI Ethics Principles, and guidelines from the Australian Cyber Security Centre and Digital Transformation Agency, SecGenAI mitigates threats such as data leakage, adversarial attacks, and model inversion. The framework’s novel approach combines advanced machine learning techniques with robust security measures, ensuring compliance with Australian regulations while enhancing the reliability and trustworthiness of GenAI systems. This research contributes to the field of intelligent systems by providing actionable strategies for secure GenAI implementation in industry, fostering innovation in AI applications, and safeguarding national interests.

[AI-60] IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation

链接: https://arxiv.org/abs/2407.01093
作者: Senyu Han,Lu Chen,Li-Min Lin,Zhengshan Xu,Kai Yu
关键词: Large language models, human-like character role-playing, Large language, language model agents, Current language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Accepted by ACL 2024 Main

点击查看摘要

Abstract:Large language models have demonstrated their capabilities in storyline creation and human-like character role-playing. Current language model agents mainly focus on reasonable behaviors from the level of individuals, and their behaviors might be hard to constraint on the level of the whole storyline. In this paper we introduce IBSEN, a director-actor coordinate agent framework that generates drama scripts and makes the plot played by agents more controllable. The director agent writes plot outlines that the user desires to see, instructs the actor agents to role-play their characters, and reschedules the plot when human players participate in the scenario to ensure the plot is progressing towards the objective. To evaluate the framework, we create a novel drama plot that involves several actor agents and check the interactions between them under the instruction of the director agent. Evaluation results show that our framework could generate complete, diverse drama scripts from only a rough outline of plot objectives, meanwhile maintaining the characteristics of characters in the drama. Our codes and prompts are available at this https URL.

[AI-61] Kolmogorov-Arnold Convolutions: Design Principles and Empirical Studies

链接: https://arxiv.org/abs/2407.01092
作者: Ivan Drokin
关键词: sparked significant interest, Kolmogorov-Arnold Networks, scientific community, sparked significant, significant interest
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of Kolmogorov-Arnold Networks (KANs) has sparked significant interest and debate within the scientific community. This paper explores the application of KANs in the domain of computer vision (CV). We examine the convolutional version of KANs, considering various nonlinearity options beyond splines, such as Wavelet transforms and a range of polynomials. We propose a parameter-efficient design for Kolmogorov-Arnold convolutional layers and a parameter-efficient finetuning algorithm for pre-trained KAN models, as well as KAN convolutional versions of self-attention and focal modulation layers. We provide empirical evaluations conducted on MNIST, CIFAR10, CIFAR100, Tiny ImageNet, ImageNet1k, and HAM10000 datasets for image classification tasks. Additionally, we explore segmentation tasks, proposing U-Net-like architectures with KAN convolutions, and achieving state-of-the-art results on BUSI, GlaS, and CVC datasets. We summarized all of our findings in a preliminary design guide of KAN convolutional models for computer vision tasks. Furthermore, we investigate regularization techniques for KANs. All experimental code and implementations of convolutional layers and models, pre-trained on ImageNet1k weights are available on GitHub via this this https URL

[AI-62] Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese

链接: https://arxiv.org/abs/2407.01080
作者: Yunqi Xu,Tianchi Cai,Jiyan Jiang,Xierui Song
关键词: Retrieval Augmented Generation, conventional Retrieval Augmented, Augmented Generation, Retrieval Augmented, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE). Despite the various FCE methods proposed earlier, these methods are evaluated on datasets generated by specific Large Language Models (LLMs). Without a comprehensive benchmark, it remains unexplored how these FCE methods perform on other LLMs with different error distributions or even unseen error types, as these methods may fail to detect the error types generated by other LLMs. To fill this gap, in this paper, we propose the first comprehensive FCE benchmark \emphFace4RAG for RAG independent of the underlying LLM. Our benchmark consists of a synthetic dataset built upon a carefully designed typology for factuality inconsistency error and a real-world dataset constructed from six commonly used LLMs, enabling evaluation of FCE methods on specific error types or real-world error distributions. On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference. To fix this issue, we further propose a new method called \emphL-Face4RAG with two novel designs of logic-preserving answer decomposition and fact-logic FCE. Extensive experiments show L-Face4RAG substantially outperforms previous methods for factual inconsistency detection on a wide range of tasks, notably beyond the RAG task from which it is originally motivated. Both the benchmark and our proposed method are publicly available.\footnote\urlthis https URL\labellink_face4rag

[AI-63] Human-like object concept representations emerge naturally in multimodal large language models

链接: https://arxiv.org/abs/2407.01067
作者: Changde Du,Kaicheng Fu,Bincheng Wen,Yi Sun,Jie Peng,Wei Wei,Ying Gao,Shengpei Wang,Chuncheng Zhang,Jinpeng Li,Shuang Qiu,Le Chang,Huiguang He
关键词: offering crucial insights, Large Language Models, intrigued cognitive scientists, long intrigued cognitive, scientists and neuroscientists
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The conceptualization and categorization of natural objects in the human mind have long intrigued cognitive scientists and neuroscientists, offering crucial insights into human perception and cognition. Recently, the rapid development of Large Language Models (LLMs) has raised the attractive question of whether these models can also develop human-like object representations through exposure to vast amounts of linguistic and multimodal data. In this study, we combined behavioral and neuroimaging analysis methods to uncover how the object concept representations in LLMs correlate with those of humans. By collecting large-scale datasets of 4.7 million triplet judgments from LLM and Multimodal LLM (MLLM), we were able to derive low-dimensional embeddings that capture the underlying similarity structure of 1,854 natural objects. The resulting 66-dimensional embeddings were found to be highly stable and predictive, and exhibited semantic clustering akin to human mental representations. Interestingly, the interpretability of the dimensions underlying these embeddings suggests that LLM and MLLM have developed human-like conceptual representations of natural objects. Further analysis demonstrated strong alignment between the identified model embeddings and neural activity patterns in many functionally defined brain ROIs (e.g., EBA, PPA, RSC and FFA). This provides compelling evidence that the object representations in LLMs, while not identical to those in the human, share fundamental commonalities that reflect key schemas of human conceptual knowledge. This study advances our understanding of machine intelligence and informs the development of more human-like artificial cognitive systems.

[AI-64] Evolutionary Morphology Towards Overconstrained Locomotion via Large-Scale Multi-Terrain Deep Reinforcement Learning

链接: https://arxiv.org/abs/2407.01050
作者: Yenan Chen,Chuye Zhang,Pengxi Gu,Jianuo Qiu,Jiayi Yin,Nuofan Qiu,Guojing Huang,Bangchao Huang,Zishang Zhang,Hui Deng,Wei Zhang,Fang Wan,Chaoyang Song
关键词: morphological transformation remains, transformation remains under-adopted, advanced robotic limbs, well-researched in biology, morphological transformation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 13 pages, 5 figures, Accepted and Presented at ReMAR2024

点击查看摘要

Abstract:While the animals’ Fin-to-Limb evolution has been well-researched in biology, such morphological transformation remains under-adopted in the modern design of advanced robotic limbs. This paper investigates a novel class of overconstrained locomotion from a design and learning perspective inspired by evolutionary morphology, aiming to integrate the concept of `intelligent design under constraints’ - hereafter referred to as constraint-driven design intelligence - in developing modern robotic limbs with superior energy efficiency. We propose a 3D-printable design of robotic limbs parametrically reconfigurable as a classical planar 4-bar linkage, an overconstrained Bennett linkage, and a spherical 4-bar linkage. These limbs adopt a co-axial actuation, identical to the modern legged robot platforms, with the added capability of upgrading into a wheel-legged system. Then, we implemented a large-scale, multi-terrain deep reinforcement learning framework to train these reconfigurable limbs for a comparative analysis of overconstrained locomotion in energy efficiency. Results show that the overconstrained limbs exhibit more efficient locomotion than planar limbs during forward and sideways walking over different terrains, including floors, slopes, and stairs, with or without random noises, by saving at least 22% mechanical energy in completing the traverse task, with the spherical limbs being the least efficient. It also achieves the highest average speed of 0.85 meters per second on flat terrain, which is 20% faster than the planar limbs. This study paves the path for an exciting direction for future research in overconstrained robotics leveraging evolutionary morphology and reconfigurable mechanism intelligence when combined with state-of-the-art methods in deep reinforcement learning.

[AI-65] FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language Models

链接: https://arxiv.org/abs/2407.01046
作者: Yiyuan Li,Shichao Sun,Pengfei Liu
关键词: daily contexts, vital due, imprecise information, information in daily, Fuzzy reasoning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Under review

点击查看摘要

Abstract:Fuzzy reasoning is vital due to the frequent use of imprecise information in daily contexts. However, the ability of current large language models (LLMs) to handle such reasoning remains largely uncharted. In this paper, we introduce a new benchmark, FRoG, for fuzzy reasoning, featuring real-world mathematical word problems that incorporate generalized quantifiers. Our experimental findings reveal that fuzzy reasoning continues to pose significant challenges for LLMs. Moreover, we find that existing methods designed to enhance reasoning do not consistently improve performance in tasks involving fuzzy logic. Additionally, our results show an inverse scaling effect in the performance of LLMs on FRoG. Interestingly, we also demonstrate that strong mathematical reasoning skills are not necessarily indicative of success on our benchmark.

[AI-66] Augmenting Document-level Relation Extraction with Efficient Multi-Supervision

链接: https://arxiv.org/abs/2407.01026
作者: Xiangyu Lin,Weijia Jia,Zhiguo Gong
关键词: low information density, distantly supervised data, document-level relation extraction, relation extraction due, sentence-level relation extraction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite its popularity in sentence-level relation extraction, distantly supervised data is rarely utilized by existing work in document-level relation extraction due to its noisy nature and low information density. Among its current applications, distantly supervised data is mostly used as a whole for pertaining, which is of low time efficiency. To fill in the gap of efficient and robust utilization of distantly supervised training data, we propose Efficient Multi-Supervision for document-level relation extraction, in which we first select a subset of informative documents from the massive dataset by combining distant supervision with expert supervision, then train the model with Multi-Supervision Ranking Loss that integrates the knowledge from multiple sources of supervision to alleviate the effects of noise. The experiments demonstrate the effectiveness of our method in improving the model performance with higher time efficiency than existing baselines.

[AI-67] Embedded Prompt Tuning: Towards Enhanced Calibration of Pretrained Models for Medical Images

链接: https://arxiv.org/abs/2407.01003
作者: Wenqiang Zu,Shenghao Xie,Qing Zhao,Guoqi Li,Lei Ma
关键词: natural imaging downstream, imaging downstream tasks, Foundation models, Foundation models pre-trained, prompt tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 7 figures. arXiv admin note: text overlap with arXiv:2306.09579 , arXiv:2203.12119 by other authors

点击查看摘要

Abstract:Foundation models pre-trained on large-scale data have been widely witnessed to achieve success in various natural imaging downstream tasks. Parameter-efficient fine-tuning (PEFT) methods aim to adapt foundation models to new domains by updating only a small portion of parameters in order to reduce computational overhead. However, the effectiveness of these PEFT methods, especially in cross-domain few-shot scenarios, e.g., medical image analysis, has not been fully explored. In this work, we facilitate the study of the performance of PEFT when adapting foundation models to medical image classification tasks. Furthermore, to alleviate the limitations of prompt introducing ways and approximation capabilities on Transformer architectures of mainstream prompt tuning methods, we propose the Embedded Prompt Tuning (EPT) method by embedding prompt tokens into the expanded channels. We also find that there are anomalies in the feature space distribution of foundation models during pre-training process, and prompt tuning can help mitigate this negative impact. To explain this phenomenon, we also introduce a novel perspective to understand prompt tuning: \textbfPrompt tuning is a distribution calibrator. And we support it by analyzing patch-wise scaling and feature separation operations contained in EPT. Our experiments show that EPT outperforms several state-of-the-art fine-tuning methods by a significant margin on few-shot medical image classification tasks, and completes the fine-tuning process within highly competitive time, indicating EPT is an effective PEFT method. Our code will be released once accepted.

[AI-68] Flood Prediction Using Classical and Quantum Machine Learning Models

链接: https://arxiv.org/abs/2407.01001
作者: Marek Grzesiak,Param Thakkar
关键词: Germany Wupper River, QML models offer, competitive training times, offer competitive training, scalability results show
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:This study investigates the potential of quantum machine learning to improve flood forecasting we focus on daily flood events along Germany’s Wupper River in 2023 our approach combines classical machine learning techniques with QML techniques this hybrid model leverages quantum properties like superposition and entanglement to achieve better accuracy and efficiency classical and QML models are compared based on training time accuracy and scalability results show that QML models offer competitive training times and improved prediction accuracy this research signifies a step towards utilizing quantum technologies for climate change adaptation we emphasize collaboration and continuous innovation to implement this model in real-world flood management ultimately enhancing global resilience against floods

[AI-69] Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents

链接: https://arxiv.org/abs/2407.00993
作者: Shihan Deng,Weikai Xu,Hongda Sun,Wei Liu,Tao Tan,Jianfeng Liu,Ang Li,Jian Luan,Bin Wang,Rui Yan,Shuo Shang
关键词: large language models, LLM-based mobile agents, mobile agents, language models, human-computer interaction
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:With the remarkable advancements of large language models (LLMs), LLM-based agents have become a research hotspot in human-computer interaction. However, there is a scarcity of benchmarks available for LLM-based mobile agents. Benchmarking these agents generally faces three main challenges: (1) The inefficiency of UI-only operations imposes limitations to task evaluation. (2) Specific instructions within a singular application lack adequacy for assessing the multi-dimensional reasoning and decision-making capacities of LLM mobile agents. (3) Current evaluation metrics are insufficient to accurately assess the process of sequential actions. To this end, we propose Mobile-Bench, a novel benchmark for evaluating the capabilities of LLM-based mobile agents. First, we expand conventional UI operations by incorporating 103 collected APIs to accelerate the efficiency of task completion. Subsequently, we collect evaluation data by combining real user queries with augmentation from LLMs. To better evaluate different levels of planning capabilities for mobile agents, our data is categorized into three distinct groups: SAST, SAMT, and MAMT, reflecting varying levels of task complexity. Mobile-Bench comprises 832 data entries, with more than 200 tasks specifically designed to evaluate multi-APP collaboration scenarios. Furthermore, we introduce a more accurate evaluation metric, named CheckPoint, to assess whether LLM-based mobile agents reach essential points during their planning and reasoning steps.

[AI-70] Acceleration method for generating perception failure scenarios based on editing Markov process

链接: https://arxiv.org/abs/2407.00980
作者: Canjie Cai
关键词: perception failure scenarios, future transportation systems, perception failure, autonomous driving technology, failure scenarios
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:With the rapid advancement of autonomous driving technology, self-driving cars have become a central focus in the development of future transportation systems. Scenario generation technology has emerged as a crucial tool for testing and verifying the safety performance of autonomous driving systems. Current research in scenario generation primarily focuses on open roads such as highways, with relatively limited studies on underground parking garages. The unique structural constraints, insufficient lighting, and high-density obstacles in underground parking garages impose greater demands on the perception systems, which are critical to autonomous driving technology. This study proposes an accelerated generation method for perception failure scenarios tailored to the underground parking garage environment, aimed at testing and improving the safety performance of autonomous vehicle (AV) perception algorithms in such settings. The method presented in this paper generates an intelligent testing environment with a high density of perception failure scenarios by learning the interactions between background vehicles (BVs) and autonomous vehicles (AVs) within perception failure scenarios. Furthermore, this method edits the Markov process within the perception failure scenario data to increase the density of critical information in the training data, thereby optimizing the learning and generation of perception failure scenarios. A simulation environment for an underground parking garage was developed using the Carla and Vissim platforms, with Bevfusion employed as the perception algorithm for testing. The study demonstrates that this method can generate an intelligent testing environment with a high density of perception failure scenarios and enhance the safety performance of perception algorithms within this experimental setup. Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2407.00980 [cs.AI] (or arXiv:2407.00980v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2407.00980 Focus to learn more arXiv-issued DOI via DataCite

[AI-71] Hybrid RAG-empowered Multi-modal LLM for Secure Healthcare Data Management: A Diffusion-based Contract Theory Approach

链接: https://arxiv.org/abs/2407.00978
作者: Cheng Su,Jinbo Wen,Jiawen Kang,Yonghua Wang,Hudan Pan,M. Shamim Hossain
关键词: Large Language Models, Multi-modal Large Language, evolving healthcare landscape, rapidly evolving healthcare, data
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Secure data management and effective data sharing have become paramount in the rapidly evolving healthcare landscape. The advancement of generative artificial intelligence has positioned Multi-modal Large Language Models (MLLMs) as crucial tools for managing healthcare data. MLLMs can support multi-modal inputs and generate diverse types of content by leveraging large-scale training on vast amounts of multi-modal data. However, critical challenges persist in developing medical MLLMs, including healthcare data security and freshness issues, affecting the output quality of MLLMs. In this paper, we propose a hybrid Retrieval-Augmented Generation (RAG)-empowered medical MLLMs framework for healthcare data management. This framework leverages a hierarchical cross-chain architecture to facilitate secure data training. Moreover, it enhances the output quality of MLLMs through hybrid RAG, which employs multi-modal metrics to filter various unimodal RAG results and incorporates these retrieval results as additional inputs to MLLMs. Additionally, we employ age of information to indirectly evaluate the data freshness impact of MLLMs and utilize contract theory to incentivize healthcare data holders to share fresh data, mitigating information asymmetry in data sharing. Finally, we utilize a generative diffusion model-based reinforcement learning algorithm to identify the optimal contract for efficient data sharing. Numerical results demonstrate the effectiveness of the proposed schemes, which achieve secure and efficient healthcare data management.

[AI-72] Deep learning for automated detection of breast cancer in deep ultraviolet fluorescence images with diffusion probabilistic model

链接: https://arxiv.org/abs/2407.00967
作者: Sepehr Salem Ghahfarokhi,Tyrell To,Julie Jorns,Tina Yen,Bing Yu,Dong Hye Ye
关键词: Data limitation, applying deep learning, significant challenge, challenge in applying, learning to medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IEEE International Symposium on Biomedical Imaging 2024

点击查看摘要

Abstract:Data limitation is a significant challenge in applying deep learning to medical images. Recently, the diffusion probabilistic model (DPM) has shown the potential to generate high-quality images by converting Gaussian random noise into realistic images. In this paper, we apply the DPM to augment the deep ultraviolet fluorescence (DUV) image dataset with an aim to improve breast cancer classification for intraoperative margin assessment. For classification, we divide the whole surface DUV image into small patches and extract convolutional features for each patch by utilizing the pre-trained ResNet. Then, we feed them into an XGBoost classifier for patch-level decisions and then fuse them with a regional importance map computed by Grad-CAM++ for whole surface-level prediction. Our experimental results show that augmenting the training dataset with the DPM significantly improves breast cancer detection performance in DUV images, increasing accuracy from 93% to 97%, compared to using Affine transformations and ProGAN.

[AI-73] okenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving

链接: https://arxiv.org/abs/2407.00959
作者: Ran Tian,Boyi Li,Xinshuo Weng,Yuxiao Chen,Edward Schmerling,Yue Wang,Boris Ivanovic,Marco Pavone
关键词: minimize human biases, autonomous driving industry, increasingly adopting, learning from sensory, system design
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The autonomous driving industry is increasingly adopting end-to-end learning from sensory inputs to minimize human biases in system design. Traditional end-to-end driving models, however, suffer from long-tail events due to rare or unseen inputs within their training distributions. To address this, we propose TOKEN, a novel Multi-Modal Large Language Model (MM-LLM) that tokenizes the world into object-level knowledge, enabling better utilization of LLM’s reasoning capabilities to enhance autonomous vehicle planning in long-tail scenarios. TOKEN effectively alleviates data scarcity and inefficient tokenization by leveraging a traditional end-to-end driving model to produce condensed and semantically enriched representations of the scene, which are optimized for LLM planning compatibility through deliberate representation and reasoning alignment training stages. Our results demonstrate that TOKEN excels in grounding, reasoning, and planning capabilities, outperforming existing frameworks with a 27% reduction in trajectory L2 error and a 39% decrease in collision rates in long-tail scenarios. Additionally, our work highlights the importance of representation alignment and structured reasoning in sparking the common-sense reasoning capabilities of MM-LLMs for effective planning.

[AI-74] Universal Approximation Theory: The basic theory for large language models

链接: https://arxiv.org/abs/2407.00958
作者: Wei Wang,Qing Li
关键词: artificial intelligence, innovations like ChatGPT, area of focus, focus in artificial, introduction of groundbreaking
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language models have emerged as a critical area of focus in artificial intelligence, particularly with the introduction of groundbreaking innovations like ChatGPT. Large-scale Transformer networks have quickly become the leading approach for advancing natural language processing algorithms. Built on the Transformer architecture, these models enable interactions that closely mimic human communication and, equipped with extensive knowledge, can even assist in guiding human tasks. Despite their impressive capabilities and growing complexity, a key question remains-the theoretical foundations of large language models (LLMs). What makes Transformer so effective for powering intelligent language applications, such as translation and coding? What underlies LLMs’ ability for In-Context Learning (ICL)? How does the LoRA scheme enhance the fine-tuning of LLMs? And what supports the practicality of pruning LLMs? To address these critical questions and explore the technological strategies within LLMs, we leverage the Universal Approximation Theory (UAT) to offer a theoretical backdrop, shedding light on the mechanisms that underpin these advancements.

[AI-75] ask-oriented Over-the-air Computation for Edge-device Co-inference with Balanced Classification Accuracy

链接: https://arxiv.org/abs/2407.00955
作者: Xiang Jiao,Dingzhu Wen,Guangxu Zhu,Wei Jiang,Wu Luo,Yuanming Shi
关键词: Edge-device co-inference, completing inference tasks, network edge, edge devices, edge server
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: This paper was accepted by IEEE Transactions on Vehicular Technology on June 30, 2024

点击查看摘要

Abstract:Edge-device co-inference, which concerns the cooperation between edge devices and an edge server for completing inference tasks over wireless networks, has been a promising technique for enabling various kinds of intelligent services at the network edge, e.g., auto-driving. In this paradigm, the concerned design objective of the network shifts from the traditional communication throughput to the effective and efficient execution of the inference task underpinned by the network, measured by, e.g., the inference accuracy and latency. In this paper, a task-oriented over-the-air computation scheme is proposed for a multidevice artificial intelligence system. Particularly, a novel tractable inference accuracy metric is proposed for classification tasks, which is called minimum pair-wise discriminant gain. Unlike prior work measuring the average of all class pairs in feature space, it measures the minimum distance of all class pairs. By maximizing the minimum pair-wise discriminant gain instead of its average counterpart, any pair of classes can be better separated in the feature space, and thus leading to a balanced and improved inference accuracy for all classes. Besides, this paper jointly optimizes the minimum discriminant gain of all feature elements instead of separately maximizing that of each element in the existing designs. As a result, the transmit power can be adaptively allocated to the feature elements according to their different contributions to the inference accuracy, opening an extra degree of freedom to improve inference performance. Extensive experiments are conducted using a concrete use case of human motion recognition to verify the superiority of the proposed design over the benchmarking scheme.

[AI-76] he House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

链接: https://arxiv.org/abs/2407.00948
作者: Tanush Chopra,Michael Li
关键词: large language models, language models, large language, evaluating strategic deception, fair play
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research conducted at the Deception Detection Hackathon 2024 hosted by Apart Apollo Research

点击查看摘要

Abstract:We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the “house.” Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

[AI-77] ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions

链接: https://arxiv.org/abs/2407.00942
作者: Jingheng Ye,Yong Jiang,Xiaobin Wang,Yinghui Li,Yangning Li,Hai-Tao Zheng,Pengjun Xie,Fei Huang
关键词: tailored product searching, clarification question generation, strategic clarification question, e-commercial scenario, paper introduces
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 17 pages, 13 tables, 6 figures. Under review

点击查看摘要

Abstract:This paper introduces the task of product demand clarification within an e-commercial scenario, where the user commences the conversation with ambiguous queries and the task-oriented agent is designed to achieve more accurate and tailored product searching by asking clarification questions. To address this task, we propose ProductAgent, a conversational information seeking agent equipped with abilities of strategic clarification question generation and dynamic product retrieval. Specifically, we develop the agent with strategies for product feature summarization, query generation, and product retrieval. Furthermore, we propose the benchmark called PROCLARE to evaluate the agent’s performance both automatically and qualitatively with the aid of a LLM-driven user simulator. Experiments show that ProductAgent interacts positively with the user and enhances retrieval performance with increasing dialogue turns, where user demands become gradually more explicit and detailed. All the source codes will be released after the review anonymity period.

[AI-78] Large Language Model Enhanced Knowledge Representation Learning: A Survey

链接: https://arxiv.org/abs/2407.00936
作者: Xin Wang,Zirui Chen,Haofen Wang,Leong Hou U,Zhao Li,Wenbin Guo
关键词: Large Language Models, Knowledge Representation Learning, complex knowledge structures, Large Language, utilize complex knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with Knowledge Representation Learning (KRL) signifies a pivotal advancement in the field of artificial intelligence, enhancing the ability to capture and utilize complex knowledge structures. This synergy leverages the advanced linguistic and contextual understanding capabilities of LLMs to improve the accuracy, adaptability, and efficacy of KRL, thereby expanding its applications and potential. Despite the increasing volume of research focused on embedding LLMs within the domain of knowledge representation, a thorough review that examines the fundamental components and processes of these enhanced models is conspicuously absent. Our survey addresses this by categorizing these models based on three distinct Transformer architectures, and by analyzing experimental data from various KRL downstream tasks to evaluate the strengths and weaknesses of each approach. Finally, we identify and explore potential future research directions in this emerging yet underexplored domain, proposing pathways for continued progress.

[AI-79] Robust and Reliable Early-Stage Website Fingerprinting Attacks via Spatial-Temporal Distribution Analysis

链接: https://arxiv.org/abs/2407.00918
作者: Xinhao Deng,Qi Li,Ke Xu
关键词: compromising user privacy, Website Fingerprinting, traffic, compromising user, user privacy
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2024

点击查看摘要

Abstract:Website Fingerprinting (WF) attacks identify the websites visited by users by performing traffic analysis, compromising user privacy. Particularly, DL-based WF attacks demonstrate impressive attack performance. However, the effectiveness of DL-based WF attacks relies on the collected complete and pure traffic during the page loading, which impacts the practicality of these attacks. The WF performance is rather low under dynamic network conditions and various WF defenses, particularly when the analyzed traffic is only a small part of the complete traffic. In this paper, we propose Holmes, a robust and reliable early-stage WF attack. Holmes utilizes temporal and spatial distribution analysis of website traffic to effectively identify websites in the early stages of page loading. Specifically, Holmes develops adaptive data augmentation based on the temporal distribution of website traffic and utilizes a supervised contrastive learning method to extract the correlations between the early-stage traffic and the pre-collected complete traffic. Holmes accurately identifies traffic in the early stages of page loading by computing the correlation of the traffic with the spatial distribution information, which ensures robust and reliable detection according to early-stage traffic. We extensively evaluate Holmes using six datasets. Compared to nine existing DL-based WF attacks, Holmes improves the F1-score of identifying early-stage traffic by an average of 169.18%. Furthermore, we replay the traffic of visiting real-world dark web websites. Holmes successfully identifies dark web websites when the ratio of page loading on average is only 21.71%, with an average precision improvement of 169.36% over the existing WF attacks.

[AI-80] FineSurE: Fine-grained Summarization Evaluation using LLMs

链接: https://arxiv.org/abs/2407.00908
作者: Hwanjun Song,Hang Su,Igor Shalyminov,Jason Cai,Saab Mansour
关键词: Automated evaluation, streamlining text summarization, crucial for streamlining, streamlining text, costly and time-consuming
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at ACL 2024 (main, long)

点击查看摘要

Abstract:Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at this https URL.

[AI-81] From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

链接: https://arxiv.org/abs/2407.00902
作者: Nan Xu,Fei Wang,Sheng Zhang,Hoifung Poon,Muhao Chen
关键词: Large Language models, Large Language, multiple image-text pairs, similar ICL abilities, capabilities of Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Considering such modality impact, we further utilize modality-driven demonstration strategies to boost ICL performance. We also identify that demonstration selection is closely related to the models’ ability to capture task inductive biases from multimodal ICL. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks even if those tasks are not seen in or even contradict pretraining data.

[AI-82] MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula

链接: https://arxiv.org/abs/2407.00900
作者: Shubhra Mishra,Gabriel Poesia,Belinda Mo,Noah D. Goodman
关键词: Large Language Models, Large Language, Language Models, important capability, Mathematics Common Core
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Dataset and code: this https URL

点击查看摘要

Abstract:Mathematical problem solving is an important skill for Large Language Models (LLMs), both as an important capability and a proxy for a range of reasoning abilities. Existing benchmarks probe a diverse set of skills, but they yield aggregate accuracy metrics, obscuring specific abilities or weaknesses. Furthermore, they are difficult to extend with new problems, risking data contamination over time. To address these challenges, we propose MathCAMPS: a method to synthesize high-quality mathematical problems at scale, grounded on 44 fine-grained “standards” from the Mathematics Common Core (CC) Standard for K-8 grades. We encode each standard in a formal grammar, allowing us to sample diverse symbolic problems and their answers. We then use LLMs to realize the symbolic problems into word problems. We propose a cycle-consistency method for validating problem faithfulness. Finally, we derive follow-up questions from symbolic structures and convert them into follow-up word problems - a novel task of mathematical dialogue that probes for robustness in understanding. Experiments on 23 LLMs show surprising failures even in the strongest models (in particular when asked simple follow-up questions). Moreover, we evaluate training checkpoints of Pythia 12B on MathCAMPS, allowing us to analyze when particular mathematical skills develop during its training. Our framework enables the community to reproduce and extend our pipeline for a fraction of the typical cost of building new high-quality datasets.

[AI-83] Mechanistic Interpretation through Contextual Decomposition in Transformers

链接: https://arxiv.org/abs/2407.00886
作者: Aliyah R. Hsu,Yeshwanth Cherapanamjeri,Anobel Y. Odisho,Peter R. Carroll,Bin Yu
关键词: black boxes due, complex nonlinear relationships, regarded as black, black boxes, boxes due
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers exhibit impressive capabilities but are often regarded as black boxes due to challenges in understanding the complex nonlinear relationships between features. Interpreting machine learning models is of paramount importance to mitigate risks, and mechanistic interpretability is in particular of current interest as it opens up a window for guiding manual modifications and reverse-engineering solutions. In this work, we introduce contextual decomposition for transformers (CD-T), extending a prior work on CD for RNNs and CNNs, to address mechanistic interpretation computationally efficiently. CD-T is a flexible interpretation method for transformers. It can capture contributions of combinations of input features or source internal components (e.g. attention heads, feed-forward networks) to (1) final predictions or (2) the output of any target internal component. Using CD-T, we propose a novel algorithm for circuit discovery. On a real-world pathology report classification task: we show CD-T distills a more faithful circuit of attention heads with improved computational efficiency (speed up 2x) than a prior benchmark, path patching. As a versatile interpretation method, CD-T also exhibits exceptional capabilities for local interpretations. CD-T is shown to reliably find words and phrases of contrasting sentiment/topic on SST-2 and AGNews datasets. Through human experiments, we demonstrate CD-T enables users to identify the more accurate of two models and to better trust a model’s outputs compared to alternative interpretation methods such as SHAP and LIME.

[AI-84] MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

链接: https://arxiv.org/abs/2407.00875
作者: Tianhao Li,Shangjie Li,Binbin Xie,Deyi Xiong,Baosong Yang
关键词: Conventional Continual Training, leaving a disparity, advent of large, predominantly catered, Continual Training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:The advent of large language models (LLMs) has predominantly catered to high-resource languages, leaving a disparity in performance for low-resource languages. Conventional Continual Training (CT) approaches to bridge this gap often undermine a model’s original linguistic proficiency when expanding to multilingual contexts. Addressing this issue, we introduce a novel MoE-CT architecture, a paradigm that innovatively separates the base model’s learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency. Our approach significantly outperforms conventional CT methods, as evidenced by our experiments, which show marked improvements in multilingual benchmarks without sacrificing the model’s original language performance. Moreover, our MoE-CT framework demonstrates enhanced resistance to forgetting and superior transfer learning capabilities. By preserving the base model’s integrity and focusing on strategic parameter expansion, our methodology advances multilingual language modeling and represents a significant step forward for low-resource language inclusion in LLMs, indicating a fruitful direction for future research in language technologies.

[AI-85] Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks

链接: https://arxiv.org/abs/2407.00869
作者: Yue Zhou,Henry Peng Zou,Barbara Di Eugenio,Yang Zhang
关键词: difficulties generating fallacious, difficulties generating, deceptive reasoning, language models, language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We find that language models have difficulties generating fallacious and deceptive reasoning. When asked to generate deceptive outputs, language models tend to leak honest counterparts but believe them to be false. Exploiting this deficiency, we propose a jailbreak attack method that elicits an aligned language model for malicious output. Specifically, we query the model to generate a fallacious yet deceptively real procedure for the harmful behavior. Since a fallacious procedure is generally considered fake and thus harmless by LLMs, it helps bypass the safeguard mechanism. Yet the output is factually harmful since the LLM cannot fabricate fallacious solutions but proposes truthful ones. We evaluate our approach over five safety-aligned large language models, comparing four previous jailbreak methods, and show that our approach achieves competitive performance with more harmful outputs. We believe the findings could be extended beyond model safety, such as self-verification and hallucination.

[AI-86] owards Robust Speech Representation Learning for Thousands of Languages

链接: https://arxiv.org/abs/2407.00837
作者: William Chen,Wangyou Zhang,Yifan Peng,Xinjian Li,Jinchuan Tian,Jiatong Shi,Xuankai Chang,Soumi Maiti,Karen Livescu,Shinji Watanabe
关键词: Self-supervised learning, helped extend speech, extend speech technologies, helped extend, Self-supervised
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 20 pages

点击查看摘要

Abstract:Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world’s 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in this https URL.

[AI-87] DRL-Based RAT Selection in a Hybrid Vehicular Communication Network

链接: https://arxiv.org/abs/2407.00828
作者: Badreddine Yacine Yacheur(LaBRI),Toufik Ahmed(LaBRI),Mohamed Mosbah(LaBRI)
关键词: Cooperative intelligent transport, transport systems rely, Driver Assistance Systems, Connected Autonomous Driving, Advanced Driver Assistance
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cooperative intelligent transport systems rely on a set of Vehicle-to-Everything (V2X) applications to enhance road safety. Emerging new V2X applications like Advanced Driver Assistance Systems (ADASs) and Connected Autonomous Driving (CAD) applications depend on a significant amount of shared data and require high reliability, low end-to-end (E2E) latency, and high throughput. However, present V2X communication technologies such as ITS-G5 and C-V2X (Cellular V2X) cannot satisfy these requirements alone. In this paper, we propose an intelligent, scalable hybrid vehicular communication architecture that leverages the performance of multiple Radio Access Technologies (RATs) to meet the needs of these applications. Then, we propose a communication mode selection algorithm based on Deep Reinforcement Learning (DRL) to maximize the network’s reliability while limiting resource consumption. Finally, we assess our work using the platooning scenario that requires high reliability. Numerical results reveal that the hybrid vehicular communication architecture has the potential to enhance the packet reception rate (PRR) by up to 30% compared to both the static RAT selection strategy and the multi-criteria decision-making (MCDM) selection algorithm. Additionally, it improves the efficiency of the redundant communication mode by 20% regarding resource consumption

[AI-88] A Deep Learning-based Pest Insect Monitoring System for Ultra-low Power Pocket-sized Drones

链接: https://arxiv.org/abs/2407.00815
作者: Luca Crupi,Luca Butera,Alberto Ferrante,Daniele Palossi
关键词: agriculture represent game-changer, represent game-changer technologies, precision agriculture represent, sustainable agribusiness, Smart farming
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Smart farming and precision agriculture represent game-changer technologies for efficient and sustainable agribusiness. Miniaturized palm-sized drones can act as flexible smart sensors inspecting crops, looking for early signs of potential pest outbreaking. However, achieving such an ambitious goal requires hardware-software codesign to develop accurate deep learning (DL) detection models while keeping memory and computational needs under an ultra-tight budget, i.e., a few MB on-chip memory and a few 100s mW power envelope. This work presents a novel vertically integrated solution featuring two ultra-low power System-on-Chips (SoCs), i.e., the dual-core STM32H74 and a multi-core GWT GAP9, running two State-of-the-Art DL models for detecting the Popillia japonica bug. We fine-tune both models for our image-based detection task, quantize them in 8-bit integers, and deploy them on the two SoCs. On the STM32H74, we deploy a FOMO-MobileNetV2 model, achieving a mean average precision (mAP) of 0.66 and running at 16.1 frame/s within 498 mW. While on the GAP9 SoC, we deploy a more complex SSDLite-MobileNetV3, which scores an mAP of 0.79 and peaks at 6.8 frame/s within 33 mW. Compared to a top-notch RetinaNet-ResNet101-FPN full-precision baseline, which requires 14.9x more memory and 300x more operations per inference, our best model drops only 15% in mAP, paving the way toward autonomous palm-sized drones capable of lightweight and precise pest detection.

[AI-89] Privacy-Aware Spectrum Pricing and Power Control Optimization for LEO Satellite Internet-of-Things

链接: https://arxiv.org/abs/2407.00814
作者: Bowen Shen,Kwok-Yan Lam,Feng Li
关键词: Low earth orbit, LEO satellite systems, Low earth, LEO satellite, communication networks due
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Low earth orbit (LEO) satellite systems play an important role in next generation communication networks due to their ability to provide extensive global coverage with guaranteed communications in remote areas and isolated areas where base stations cannot be cost-efficiently deployed. With the pervasive adoption of LEO satellite systems, especially in the LEO Internet-of-Things (IoT) scenarios, their spectrum resource management requirements have become more complex as a result of massive service requests and high bandwidth demand from terrestrial terminals. For instance, when leasing the spectrum to terrestrial users and controlling the uplink transmit power, satellites collect user data for machine learning purposes, which usually are sensitive information such as location, budget and quality of service (QoS) requirement. To facilitate model training in LEO IoT while preserving the privacy of data, blockchain-driven federated learning (FL) is widely used by leveraging on a fully decentralized architecture. In this paper, we propose a hybrid spectrum pricing and power control framework for LEO IoT by combining blockchain technology and FL. We first design a local deep reinforcement learning algorithm for LEO satellite systems to learn a revenue-maximizing pricing and power control scheme. Then the agents collaborate to form a FL system. We also propose a reputation-based blockchain which is used in the global model aggregation phase of FL. Based on the reputation mechanism, a node is selected for each global training round to perform model aggregation and block generation, which can further enhance the decentralization of the network and guarantee the trust. Simulation tests are conducted to evaluate the performances of the proposed scheme. Our results show the efficiency of finding the maximum revenue scheme for LEO satellite systems while preserving the privacy of each agent.

[AI-90] Exploring a Physics-Informed Decision Transformer for Distribution System Restoration: Methodology and Performance Analysis

链接: https://arxiv.org/abs/2407.00808
作者: Hong Zhao,Jin Wei-Kocsis,Adel Heidari Akhijahani,Karen L Butler-Purry
关键词: deep reinforcement learning, uncertain operational scenarios, demonstrated significant potential, effectively tackling distribution, Driven by advancements
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Driven by advancements in sensing and computing, deep reinforcement learning (DRL)-based methods have demonstrated significant potential in effectively tackling distribution system restoration (DSR) challenges under uncertain operational scenarios. However, the data-intensive nature of DRL poses obstacles in achieving satisfactory DSR solutions for large-scale, complex distribution systems. Inspired by the transformative impact of emerging foundation models, including large language models (LLMs), across various domains, this paper explores an innovative approach harnessing LLMs’ powerful computing capabilities to address scalability challenges inherent in conventional DRL methods for solving DSR. To our knowledge, this study represents the first exploration of foundation models, including LLMs, in revolutionizing conventional DRL applications in power system operations. Our contributions are twofold: 1) introducing a novel LLM-powered Physics-Informed Decision Transformer (PIDT) framework that leverages LLMs to transform conventional DRL methods for DSR operations, and 2) conducting comparative studies to assess the performance of the proposed LLM-powered PIDT framework at its initial development stage for solving DSR problems. While our primary focus in this paper is on DSR operations, the proposed PIDT framework can be generalized to optimize sequential decision-making across various power system operations.

[AI-91] owards shutdownable agents via stochastic choice

链接: https://arxiv.org/abs/2407.00805
作者: Elliott Thornley,Alexander Roman,Christos Ziakas,Leyton Ho,Louis Thomson
关键词: Incomplete Preferences Proposal, Preferences Proposal, resist being shut, Incomplete Preferences, NEUTRAL
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn’t happen. A key part of the IPP is using a novel ‘Discounted REward for Same-Length Trajectories (DREST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.

[AI-92] Controlling Faces Frame generation in StyleGANs latent space operations: Modifying faces to deceive our memory

链接: https://arxiv.org/abs/2407.00803
作者: Agustín Roca,Nicolás Ignacio Britos
关键词: Innocence Project, reducing wrongful convictions, non-profitable organization, Buenos Aires, Laboratorio de Sueño
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Innocence Project is a non-profitable organization that works in reducing wrongful convictions. In collaboration with Laboratorio de Sueño y Memoria from Instituto Tecnológico de Buenos Aires (ITBA), they are studying human memory in the context of face identification. They have a strong hypothesis stating that human memory heavily relies in face’s frame to recognize faces. If this is proved, it could mean that face recognition in police lineups couldn’t be trusted, as they may lead to wrongful convictions. This study uses experiments in order to try to prove this using faces with different properties, such as eyes size, but maintaining its frame as much as possible. In this project, we continue the work from a previous project that provided the basic tool to generate realistic faces using StyleGAN2. We take a deep dive into the internals of this tool to make full use of StyleGAN2 functionalities, while also adding more features, such as modifying certain of its attributes, including mouth-opening or eye-opening. As the usage of this tool heavily relies on maintaining the face-frame, we develop a way to identify the face-frame of each image and a function to compare it to the output of the neural network after applying some operations. We conclude that the face-frame is maintained when modifying eye-opening or mouth opening. When modifying vertical face orientation, gender, age and smile, have a considerable impact on its frame variation. And finally, the horizontal face orientation shows a major impact on the face-frame. This way, the Lab may apply some operations being confident that the face-frame won’t significantly change, making them viable to be used to deceive subjects’ memories. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2407.00803 [cs.CV] (or arXiv:2407.00803v1 [cs.CV] for this version) Submission history From: Agustin Roca [view email] [v1] Sun, 30 Jun 2024 19:10:22 UTC (38,716 KB)

[AI-93] Diffusion Models and Representation Learning: A Survey

链接: https://arxiv.org/abs/2407.00783
作者: Michael Fuest,Pingchuan Ma,Ming Gui,Johannes S. Fischer,Vincent Tao Hu,Bjorn Ommer
关键词: attracting significant attention, Diffusion Models, popular generative modeling, generative modeling methods, attracting significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Github Repo: this https URL

点击查看摘要

Abstract:Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models’ essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link: this https URL

[AI-94] owards Faster Matrix Diagonalization with Graph Isomorphism Networks and the AlphaZero Framework

链接: https://arxiv.org/abs/2407.00779
作者: Geigh Zollicoffer,Kshitij Bhatta,Manish Bhattarai,Phil Romero,Christian F. A. Negre,Anders M. N. Niklasson,Adetokunbo Adedoyin
关键词: Markov Decision Process, Semi-Markov Decision Process, Markov Decision, introduce innovative approaches, Decision Process
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Accepted to Deployable RL: From Research to Practice workshop @ RLC conference

点击查看摘要

Abstract:In this paper, we introduce innovative approaches for accelerating the Jacobi method for matrix diagonalization, specifically through the formulation of large matrix diagonalization as a Semi-Markov Decision Process and small matrix diagonalization as a Markov Decision Process. Furthermore, we examine the potential of utilizing scalable architecture between different-sized matrices. During a short training period, our method discovered a significant reduction in the number of steps required for diagonalization and exhibited efficient inference capabilities. Importantly, this approach demonstrated possible scalability to large-sized matrices, indicating its potential for wide-ranging applicability. Upon training completion, we obtain action-state probabilities and transition graphs, which depict transitions between different states. These outputs not only provide insights into the diagonalization process but also pave the way for cost savings pertinent to large-scale matrices. The advancements made in this research enhance the efficacy and scalability of matrix diagonalization, pushing for new possibilities for deployment in practical applications in scientific and engineering domains.

[AI-95] Characterizing Stereotypical Bias from Privacy-preserving Pre-Training

链接: https://arxiv.org/abs/2407.00764
作者: Stefan Arnold,Rene Gröbner,Annika Schreiner
关键词: Differential Privacy, embedding space, applied to raw, exploiting the spatial, spatial arrangement
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Differential Privacy (DP) can be applied to raw text by exploiting the spatial arrangement of words in an embedding space. We investigate the implications of such text privatization on Language Models (LMs) and their tendency towards stereotypical associations. Since previous studies documented that linguistic proficiency correlates with stereotypical bias, one could assume that techniques for text privatization, which are known to degrade language modeling capabilities, would cancel out undesirable biases. By testing BERT models trained on texts containing biased statements primed with varying degrees of privacy, our study reveals that while stereotypical bias generally diminishes when privacy is tightened, text privatization does not uniformly equate to diminishing bias across all social domains. This highlights the need for careful diagnosis of bias in LMs that undergo text privatization.

[AI-96] Chest-Diffusion: A Light-Weight Text-to-Image Model for Report-to-CXR Generation

链接: https://arxiv.org/abs/2407.00752
作者: Peng Huang,Xue Gao,Lihong Huang,Jing Jiao,Xiaokang Li,Yuanyuan Wang,Yi Guo
关键词: important implications, diverse and controllable, Stable Diffusion, adapt Stable Diffusion, common stable diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-to-image generation has important implications for generation of diverse and controllable images. Several attempts have been made to adapt Stable Diffusion (SD) to the medical domain. However, the large distribution difference between medical reports and natural texts, as well as high computational complexity in common stable diffusion limit the authenticity and feasibility of the generated medical images. To solve above problems, we propose a novel light-weight transformer-based diffusion model learning framework, Chest-Diffusion, for report-to-CXR generation. Chest-Diffusion employs a domain-specific text encoder to obtain accurate and expressive text features to guide image generation, improving the authenticity of the generated images. Meanwhile, we introduce a light-weight transformer architecture as the denoising model, reducing the computational complexity of the diffusion model. Experiments demonstrate that our Chest-Diffusion achieves the lowest FID score 24.456, under the computation budget of 118.918 GFLOPs, which is nearly one-third of the computational complexity of SD.

[AI-97] A Comparative Study of Quality Evaluation Methods for Text Summarization

链接: https://arxiv.org/abs/2407.00747
作者: Huyen Nguyen,Haihua Chen,Lavanya Pobbathi,Junhua Ding
关键词: natural language processing, Evaluating text summarization, challenging task, task in natural, NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: The paper is under review at Empirical Methods in Natural Language Processing (EMNLP) 2024. It has 15 pages and 4 figures

点击查看摘要

Abstract:Evaluating text summarization has been a challenging task in natural language processing (NLP). Automatic metrics which heavily rely on reference summaries are not suitable in many situations, while human evaluation is time-consuming and labor-intensive. To bridge this gap, this paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. We also conducts a comparative study on eight automatic metrics, human evaluation, and our proposed LLM-based method. Seven different types of state-of-the-art (SOTA) summarization models were evaluated. We perform extensive experiments and analysis on datasets with patent documents. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency. Based on the empirical comparison, we propose a LLM-powered framework for automatically evaluating and improving text summarization, which is beneficial and could attract wide attention among the community.

[AI-98] Disentangled Representations for Causal Cognition

链接: https://arxiv.org/abs/2407.00744
作者: Filippo Torresan,Manuel Baltieri
关键词: Complex adaptive agents, combined agent-environment systems, Complex adaptive, adaptive agents consistently, agents consistently achieve
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 49 pages, 9 figures

点击查看摘要

Abstract:Complex adaptive agents consistently achieve their goals by solving problems that seem to require an understanding of causal information, information pertaining to the causal relationships that exist among elements of combined agent-environment systems. Causal cognition studies and describes the main characteristics of causal learning and reasoning in human and non-human animals, offering a conceptual framework to discuss cognitive performances based on the level of apparent causal understanding of a task. Despite the use of formal intervention-based models of causality, including causal Bayesian networks, psychological and behavioural research on causal cognition does not yet offer a computational account that operationalises how agents acquire a causal understanding of the world. Machine and reinforcement learning research on causality, especially involving disentanglement as a candidate process to build causal representations, represent on the one hand a concrete attempt at designing causal artificial agents that can shed light on the inner workings of natural causal cognition. In this work, we connect these two areas of research to build a unifying framework for causal cognition that will offer a computational perspective on studies of animal cognition, and provide insights in the development of new algorithms for causal reinforcement learning in AI.

[AI-99] AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

链接: https://arxiv.org/abs/2407.00743
作者: Sheng Wu,Jiaxing Liu,Longbiao Wang,Dongxiao He,Xiaobao Wang,Jianwu Dang
关键词: natural language processing, Recognition in Conversations, Emotion Recognition, speaker in conversations, language processing
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Specifically, we design a Modality Augmentation Network which performs rich representation learning through dimension transformation of different modalities and parameter-efficient inception block. On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics compared to the state-of-the-art (SOTA) models.

[AI-100] Diffusion Models for Offline Multi-agent Reinforcement Learning with Safety Constraints

链接: https://arxiv.org/abs/2407.00741
作者: Jianuo Huang
关键词: Multi-agent Reinforcement Learning, Multi-agent Reinforcement, Reinforcement Learning, advancements in Multi-agent, safety-critical scenarios
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:In recent advancements in Multi-agent Reinforcement Learning (MARL), its application has extended to various safety-critical scenarios. However, most methods focus on online learning, which presents substantial risks when deployed in real-world settings. Addressing this challenge, we introduce an innovative framework integrating diffusion models within the MARL paradigm. This approach notably enhances the safety of actions taken by multiple agents through risk mitigation while modeling coordinated action. Our framework is grounded in the Centralized Training with Decentralized Execution (CTDE) architecture, augmented by a Diffusion Model for prediction trajectory generation. Additionally, we incorporate a specialized algorithm to further ensure operational safety. We evaluate our model against baselines on the DSRL benchmark. Experiment results demonstrate that our model not only adheres to stringent safety constraints but also achieves superior performance compared to existing methodologies. This underscores the potential of our approach in advancing the safety and efficacy of MARL in real-world applications.

[AI-101] Large Language Models Struggle in Token-Level Clinical Named Entity Recognition

链接: https://arxiv.org/abs/2407.00731
作者: Qiuhao Lu,Rui Li,Andrew Wen,Jinlian Wang,Liwei Wang,Hongfang Liu
关键词: Large Language Models, Large Language, Named Entity Recognition, Language Models, token-level NER
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: AMIA 2024 Annual Symposium Proceedings

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized various sectors, including healthcare where they are employed in diverse applications. Their utility is particularly significant in the context of rare diseases, where data scarcity, complexity, and specificity pose considerable challenges. In the clinical domain, Named Entity Recognition (NER) stands out as an essential task and it plays a crucial role in extracting relevant information from clinical texts. Despite the promise of LLMs, current research mostly concentrates on document-level NER, identifying entities in a more general context across entire documents, without extracting their precise location. Additionally, efforts have been directed towards adapting ChatGPT for token-level NER. However, there is a significant research gap when it comes to employing token-level NER for clinical texts, especially with the use of local open-source LLMs. This study aims to bridge this gap by investigating the effectiveness of both proprietary and local LLMs in token-level clinical NER. Essentially, we delve into the capabilities of these models through a series of experiments involving zero-shot prompting, few-shot prompting, retrieval-augmented generation (RAG), and instruction-fine-tuning. Our exploration reveals the inherent challenges LLMs face in token-level NER, particularly in the context of rare diseases, and suggests possible improvements for their application in healthcare. This research contributes to narrowing a significant gap in healthcare informatics and offers insights that could lead to a more refined application of LLMs in the healthcare sector.

[AI-102] Learning System Dynamics without Forgetting

链接: https://arxiv.org/abs/2407.00717
作者: Xikun Zhang,Dongjin Song,Yushan Jiang,Yixin Chen,Dacheng Tao
关键词: dynamics, governing rules, including physics, physics and biology, systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Predicting the trajectories of systems with unknown dynamics (\textiti.e. the governing rules) is crucial in various research fields, including physics and biology. This challenge has gathered significant attention from diverse communities. Most existing works focus on learning fixed system dynamics within one single system. However, real-world applications often involve multiple systems with different types of dynamics or evolving systems with non-stationary dynamics (dynamics shifts). When data from those systems are continuously collected and sequentially fed to machine learning models for training, these models tend to be biased toward the most recently learned dynamics, leading to catastrophic forgetting of previously observed/learned system dynamics. To this end, we aim to learn system dynamics via continual learning. Specifically, we present a novel framework of Mode-switching Graph ODE (MS-GODE), which can continually learn varying dynamics and encode the system-specific dynamics into binary masks over the model parameters. During the inference stage, the model can select the most confident mask based on the observational data to identify the system and predict future trajectories accordingly. Empirically, we systematically investigate the task configurations and compare the proposed MS-GODE with state-of-the-art techniques. More importantly, we construct a novel benchmark of biological dynamic systems, featuring diverse systems with disparate dynamics and significantly enriching the research field of machine learning for dynamic systems.

[AI-103] ackling Long-Horizon Tasks with Model-based Offline Reinforcement Learning

链接: https://arxiv.org/abs/2407.00699
作者: Kwanyoung Park,Youngwoon Lee
关键词: generating imaginary trajectories, offline reinforcement learning, Lower Expectile Q-learning, reinforcement learning, compelling approach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Model-based offline reinforcement learning (RL) is a compelling approach that addresses the challenge of learning from limited, static data by generating imaginary trajectories using learned models. However, it falls short in solving long-horizon tasks due to high bias in value estimation from model rollouts. In this paper, we introduce a novel model-based offline RL method, Lower Expectile Q-learning (LEQ), which enhances long-horizon task performance by mitigating the high bias in model-based value estimation via expectile regression of \lambda -returns. Our empirical results show that LEQ significantly outperforms previous model-based offline RL methods on long-horizon tasks, such as the D4RL AntMaze tasks, matching or surpassing the performance of model-free approaches. Our experiments demonstrate that expectile regression, \lambda -returns, and critic training on offline data are all crucial for addressing long-horizon tasks. Additionally, LEQ achieves performance comparable to the state-of-the-art model-based and model-free offline RL methods on the NeoRL benchmark and the D4RL MuJoCo Gym tasks.

[AI-104] NourishNet: Proactive Severity State Forecasting of Food Commodity Prices for Global Warning Systems

链接: https://arxiv.org/abs/2407.00698
作者: Sydney Balboni,Grace Ivey,Brett Storoe,John Cisler,Tyge Plater,Caitlyn Grant,Ella Bruce,Benjamin Paulson
关键词: critical signal indicating, signal indicating potential, indicating potential disruptions, global food commodities, critical signal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); General Economics (econ.GN); Numerical Analysis (math.NA)
*备注: MICS 2024 1st Place Paper, MSOE AI-Club Research Group

点击查看摘要

Abstract:Price volatility in global food commodities is a critical signal indicating potential disruptions in the food market. Understanding forthcoming changes in these prices is essential for bolstering food security, particularly for nations at risk. The Food and Agriculture Organization of the United Nations (FAO) previously developed sophisticated statistical frameworks for the proactive prediction of food commodity prices, aiding in the creation of global early warning systems. These frameworks utilize food security indicators to produce accurate forecasts, thereby facilitating preparations against potential food shortages. Our research builds on these foundations by integrating robust price security indicators with cutting-edge deep learning (DL) methodologies to reveal complex interdependencies. DL techniques examine intricate dynamics among diverse factors affecting food prices. Through sophisticated time-series forecasting models coupled with a classification model, our approach enhances existing models to better support communities worldwide in advancing their food security initiatives.

[AI-105] CaFNet: A Confidence-Driven Framework for Radar Camera Depth Estimation

链接: https://arxiv.org/abs/2407.00697
作者: Huawei Sun,Hao Feng,Julius Ott,Lorenzo Servadei,Robert Wille
关键词: Depth estimation, scenes accurately, driving for interpreting, critical in autonomous, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Accepted by IROS 2024

点击查看摘要

Abstract:Depth estimation is critical in autonomous driving for interpreting 3D scenes accurately. Recently, radar-camera depth estimation has become of sufficient interest due to the robustness and low-cost properties of radar. Thus, this paper introduces a two-stage, end-to-end trainable Confidence-aware Fusion Net (CaFNet) for dense depth estimation, combining RGB imagery with sparse and noisy radar point cloud data. The first stage addresses radar-specific challenges, such as ambiguous elevation and noisy measurements, by predicting a radar confidence map and a preliminary coarse depth map. A novel approach is presented for generating the ground truth for the confidence map, which involves associating each radar point with its corresponding object to identify potential projection surfaces. These maps, together with the initial radar input, are processed by a second encoder. For the final depth estimation, we innovate a confidence-aware gated fusion mechanism to integrate radar and image features effectively, thereby enhancing the reliability of the depth map by filtering out radar noise. Our methodology, evaluated on the nuScenes dataset, demonstrates superior performance, improving upon the current leading model by 3.2% in Mean Absolute Error (MAE) and 2.7% in Root Mean Square Error (RMSE).

[AI-106] Learning Formal Mathematics From Intrinsic Motivation

链接: https://arxiv.org/abs/2407.00695
作者: Gabriel Poesia,David Broman,Nick Haber,Noah D. Goodman
关键词: humanity coax mathematics, humanity coax, coax mathematics, Intrinsic Motivation, mathematics
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:How did humanity coax mathematics from the aether? We explore the Platonic view that mathematics can be discovered from its axioms - a game of conjecture and proof. We describe Minimo (Mathematics from Intrinsic Motivation): an agent that jointly learns to pose challenging problems for itself (conjecturing) and solve them (theorem proving). Given a mathematical domain axiomatized in dependent type theory, we first combine methods for constrained decoding and type-directed synthesis to sample valid conjectures from a language model. Our method guarantees well-formed conjectures by construction, even as we start with a randomly initialized model. We use the same model to represent a policy and value function for guiding proof search. Our agent targets generating hard but provable conjectures - a moving target, since its own theorem proving ability also improves as it trains. We propose novel methods for hindsight relabeling on proof search trees to significantly improve the agent’s sample efficiency in both tasks. Experiments on 3 axiomatic domains (propositional logic, arithmetic and group theory) demonstrate that our agent can bootstrap from only the axioms, self-improving in generating true and challenging conjectures and in finding proofs.

[AI-107] BAPO: Base-Anchored Preference Optimization for Personalized Alignment in Large Language Models

链接: https://arxiv.org/abs/2407.00693
作者: Gihun Lee,Minchan Jeong,Yujin Kim,Hojung Jung,Jaehoon Oh,Sangmook Kim,Se-Young Yun
关键词: align Large Language, Large Language Models, Large Language, shown remarkable success, align Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:While learning to align Large Language Models (LLMs) with human preferences has shown remarkable success, aligning these models to meet the diverse user preferences presents further challenges in preserving previous knowledge. This paper examines the impact of personalized preference optimization on LLMs, revealing that the extent of knowledge loss varies significantly with preference heterogeneity. Although previous approaches have utilized the KL constraint between the reference model and the policy model, we observe that they fail to maintain general knowledge and alignment when facing personalized preferences. To this end, we introduce Base-Anchored Preference Optimization (BAPO), a simple yet effective approach that utilizes the initial responses of reference model to mitigate forgetting while accommodating personalized alignment. BAPO effectively adapts to diverse user preferences while minimally affecting global knowledge or general alignment. Our experiments demonstrate the efficacy of BAPO in various setups.

[AI-108] SCMIL: Sparse Context-aware Multiple Instance Learning for Predicting Cancer Survival Probability Distribution in Whole Slide Images

链接: https://arxiv.org/abs/2407.00664
作者: Zekang Yang,Hong Liu,Xiangdong Wang
关键词: Slide Image, Cancer survival prediction, Cancer survival, involves analyzing, tumor microenvironment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: MICCAI2024

点击查看摘要

Abstract:Cancer survival prediction is a challenging task that involves analyzing of the tumor microenvironment within Whole Slide Image (WSI). Previous methods cannot effectively capture the intricate interaction features among instances within the local area of WSI. Moreover, existing methods for cancer survival prediction based on WSI often fail to provide better clinically meaningful predictions. To overcome these challenges, we propose a Sparse Context-aware Multiple Instance Learning (SCMIL) framework for predicting cancer survival probability distributions. SCMIL innovatively segments patches into various clusters based on their morphological features and spatial location information, subsequently leveraging sparse self-attention to discern the relationships between these patches with a context-aware perspective. Considering many patches are irrelevant to the task, we introduce a learnable patch filtering module called SoftFilter, which ensures that only interactions between task-relevant patches are considered. To enhance the clinical relevance of our prediction, we propose a register-based mixture density network to forecast the survival probability distribution for individual patients. We evaluate SCMIL on two public WSI datasets from the The Cancer Genome Atlas (TCGA) specifically focusing on lung adenocarcinom (LUAD) and kidney renal clear cell carcinoma (KIRC). Our experimental results indicate that SCMIL outperforms current state-of-the-art methods for survival prediction, offering more clinically meaningful and interpretable outcomes. Our code is accessible at this https URL.

[AI-109] Multi-Agent Training for Pommerman: Curriculum Learning and Population-based Self-Play Approach

链接: https://arxiv.org/abs/2407.00662
作者: Nhat-Minh Huynh,Hoang-Giang Cao,I-Chen Wu
关键词: received considerable attention, recent years, received considerable, considerable attention, attention from researchers
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: Accepted at The First Workshop on Game AI Algorithms and Multi-Agent Learning - IJCAI 2024

点击查看摘要

Abstract:Pommerman is a multi-agent environment that has received considerable attention from researchers in recent years. This environment is an ideal benchmark for multi-agent training, providing a battleground for two teams with communication capabilities among allied agents. Pommerman presents significant challenges for model-free reinforcement learning due to delayed action effects, sparse rewards, and false positives, where opponent players can lose due to their own mistakes. This study introduces a system designed to train multi-agent systems to play Pommerman using a combination of curriculum learning and population-based self-play. We also tackle two challenging problems when deploying the multi-agent training system for competitive games: sparse reward and suitable matchmaking mechanism. Specifically, we propose an adaptive annealing factor based on agents’ performance to adjust the dense exploration reward during training dynamically. Additionally, we implement a matchmaking mechanism utilizing the Elo rating system to pair agents effectively. Our experimental results demonstrate that our trained agent can outperform top learning agents without requiring communication among allied agents.

[AI-110] Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs

链接: https://arxiv.org/abs/2407.00653
作者: Yifei Zhang,Xintao Wang,Jiaqing Liang,Sirui Xia,Lida Chen,Yanghua Xiao
关键词: Large Language Models, natural language processing, Large Language, exhibited impressive proficiency, involve increasingly complex
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited impressive proficiency in various natural language processing (NLP) tasks, which involve increasingly complex reasoning. Knowledge reasoning, a primary type of reasoning, aims at deriving new knowledge from existing one.While it has been widely studied in the context of knowledge graphs (KGs), knowledge reasoning in LLMs remains underexplored. In this paper, we introduce Chain-of-Knowledge, a comprehensive framework for knowledge reasoning, including methodologies for both dataset construction and model learning. For dataset construction, we create KnowReason via rule mining on KGs. For model learning, we observe rule overfitting induced by naive training. Hence, we enhance CoK with a trial-and-error mechanism that simulates the human process of internal knowledge exploration. We conduct extensive experiments with KnowReason. Our results show the effectiveness of CoK in refining LLMs in not only knowledge reasoning, but also general reasoning benchmarkms.

[AI-111] HASNAS: A Hardware-Aware Spiking Neural Architecture Search Framework for Neuromorphic Compute-in-Memory Systems

链接: https://arxiv.org/abs/2407.00641
作者: Rachmad Vidya Wicaksana Putra,Muhammad Shafique
关键词: Artificial Neural Networks, Spiking Neural Networks, machine learning tasks, SNN, solving diverse machine
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 9 pages, 13 figures, 2 tables

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have shown capabilities for solving diverse machine learning tasks with ultra-low-power/energy computation. To further improve the performance and efficiency of SNN inference, the Compute-in-Memory (CIM) paradigm with emerging device technologies such as resistive random access memory is employed. However, most of SNN architectures are developed without considering constraints from the application and the underlying CIM hardware (e.g., memory, area, latency, and energy consumption). Moreover, most of SNN designs are derived from the Artificial Neural Networks, whose network operations are different from SNNs. These limitations hinder SNNs from reaching their full potential in accuracy and efficiency. Toward this, we propose HASNAS, a novel hardware-aware spiking neural architecture search (NAS) framework for neuromorphic CIM systems that finds an SNN that offers high accuracy under the given memory, area, latency, and energy constraints. To achieve this, HASNAS employs the following key steps: (1) optimizing SNN operations to achieve high accuracy, (2) developing an SNN architecture that facilitates an effective learning process, and (3) devising a systematic hardware-aware search algorithm to meet the constraints. The experimental results show that our HASNAS quickly finds an SNN that maintains high accuracy compared to the state-of-the-art by up to 11x speed-up, and meets the given constraints: 4x10^6 parameters of memory, 100mm^2 of area, 400ms of latency, and 120uJ energy consumption for CIFAR10 and CIFAR100; while the state-of-the-art fails to meet the constraints. In this manner, our HASNAS can enable efficient design automation for providing high-performance and energy-efficient neuromorphic CIM systems for diverse applications.

[AI-112] rialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets

链接: https://arxiv.org/abs/2407.00631
作者: Jintai Chen,Yaojun Hu,Yue Wang,Yingzhou Lu,Xu Cao,Miao Lin,Hongxia Xu,Jian Wu,Cao Xiao,Jimeng Sun,Lucas Glass,Kexin Huang,Marinka Zitnik,Tianfan Fu
关键词: waste immense efforts, immense efforts spanning, clinical trial design, clinical trial, trial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Clinical trials are pivotal for developing new medical treatments, yet they typically pose some risks such as patient mortality, adverse events, and enrollment failure that waste immense efforts spanning over a decade. Applying artificial intelligence (AI) to forecast or simulate key events in clinical trials holds great potential for providing insights to guide trial designs. However, complex data collection and question definition requiring medical expertise and a deep understanding of trial designs have hindered the involvement of AI thus far. This paper tackles these challenges by presenting a comprehensive suite of meticulously curated AIready datasets covering multi-modal data (e.g., drug molecule, disease code, text, categorical/numerical features) and 8 crucial prediction challenges in clinical trial design, encompassing prediction of trial duration, patient dropout rate, serious adverse event, mortality rate, trial approval outcome, trial failure reason, drug dose finding, design of eligibility criteria. Furthermore, we provide basic validation methods for each task to ensure the datasets’ usability and reliability. We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design, ultimately advancing clinical trial research and accelerating medical solution development. The curated dataset, metrics, and basic models are publicly available at this https URL.

[AI-113] Maximum Entropy Inverse Reinforcement Learning of Diffusion Models with Energy-Based Models

链接: https://arxiv.org/abs/2407.00626
作者: Sangwoong Yoon,Himchan Hwang,Dohyun Kwon,Yung-Kyun Noh,Frank C. Park
关键词: Maximum Entropy IRL, diffusion generative models, diffusion model, maximum entropy inverse, diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code is released at this https URL

点击查看摘要

Abstract:We present a maximum entropy inverse reinforcement learning (IRL) approach for improving the sample quality of diffusion generative models, especially when the number of generation time steps is small. Similar to how IRL trains a policy based on the reward function learned from expert demonstrations, we train (or fine-tune) a diffusion model using the log probability density estimated from training data. Since we employ an energy-based model (EBM) to represent the log density, our approach boils down to the joint training of a diffusion model and an EBM. Our IRL formulation, named Diffusion by Maximum Entropy IRL (DxMI), is a minimax problem that reaches equilibrium when both models converge to the data distribution. The entropy maximization plays a key role in DxMI, facilitating the exploration of the diffusion model and ensuring the convergence of the EBM. We also propose Diffusion by Dynamic Programming (DxDP), a novel reinforcement learning algorithm for diffusion models, as a subroutine in DxMI. DxDP makes the diffusion model update in DxMI efficient by transforming the original problem into an optimal control formulation where value functions replace back-propagation in time. Our empirical studies show that diffusion models fine-tuned using DxMI can generate high-quality samples in as few as 4 and 10 steps. Additionally, DxMI enables the training of an EBM without MCMC, stabilizing EBM training dynamics and enhancing anomaly detection performance.

[AI-114] Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

链接: https://arxiv.org/abs/2407.00617
作者: Yuheng Zhang,Dian Yu,Baolin Peng,Linfeng Song,Ye Tian,Mingyue Huo,Nan Jiang,Haitao Mi,Dong Yu
关键词: achieved great success, aligning large language, large language models, Human Feedback, Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art iterative algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our ablation study highlights the benefits of incorporating KL regularization for response length control.

[AI-115] Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

链接: https://arxiv.org/abs/2407.00608
作者: Shian Du,Xiaotian Cheng,Qi Qian,Henglu Wei,Yi Xu,Xiangyang Ji
关键词: attracted unprecedented attention, generating highly-personalized images, input concept dataset, textual prompt, input textual prompt
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Personalized text-to-image generation has attracted unprecedented attention in the recent few years due to its unique capability of generating highly-personalized images via using the input concept dataset and novel textual prompt. However, previous methods solely focus on the performance of the reconstruction task, degrading its ability to combine with different textual prompt. Besides, optimizing in the high-dimensional embedding space usually leads to unnecessary time-consuming training process and slow convergence. To address these issues, we propose an efficient method to explore the target embedding in a textual subspace, drawing inspiration from the self-expressiveness property. Additionally, we propose an efficient selection strategy for determining the basis vectors of the textual subspace. The experimental evaluations demonstrate that the learned embedding can not only faithfully reconstruct input image, but also significantly improves its alignment with novel input textual prompt. Furthermore, we observe that optimizing in the textual subspace leads to an significant improvement of the robustness to the initial word, relaxing the constraint that requires users to input the most relevant initial word. Our method opens the door to more efficient representation learning for personalized text-to-image generation.

[AI-116] GenderBias-emphVL: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing

链接: https://arxiv.org/abs/2407.00600
作者: Yisong Xiao,Aishan Liu,QianJia Cheng,Zhenfei Yin,Siyuan Liang,Jiapeng Li,Jing Shao,Xianglong Liu,Dacheng Tao
关键词: Large Vision-Language Models, Large Vision-Language, exhibit significant gender, widely adopted, exhibit significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have been widely adopted in various applications; however, they exhibit significant gender biases. Existing benchmarks primarily evaluate gender bias at the demographic group level, neglecting individual fairness, which emphasizes equal treatment of similar individuals. This research gap limits the detection of discriminatory behaviors, as individual fairness offers a more granular examination of biases that group fairness may overlook. For the first time, this paper introduces the GenderBias-\emphVL benchmark to evaluate occupation-related gender bias in LVLMs using counterfactual visual questions under individual fairness criteria. To construct this benchmark, we first utilize text-to-image diffusion models to generate occupation images and their gender counterfactuals. Subsequently, we generate corresponding textual occupation options by identifying stereotyped occupation pairs with high semantic similarity but opposite gender proportions in real-world statistics. This method enables the creation of large-scale visual question counterfactuals to expose biases in LVLMs, applicable in both multimodal and unimodal contexts through modifying gender attributes in specific modalities. Overall, our GenderBias-\emphVL benchmark comprises 34,581 visual question counterfactual pairs, covering 177 occupations. Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs (\eg, LLaVA) and state-of-the-art commercial APIs, including GPT-4o and Gemini-Pro. Our findings reveal widespread gender biases in existing LVLMs. Our benchmark offers: (1) a comprehensive dataset for occupation-related gender bias evaluation; (2) an up-to-date leaderboard on LVLM biases; and (3) a nuanced understanding of the biases presented by these models. \footnoteThe dataset and code are available at the \hrefthis https URLwebsite.

[AI-117] Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

链接: https://arxiv.org/abs/2407.00569
作者: Weihong Zhong,Xiaocheng Feng,Liang Zhao,Qiming Li,Lei Huang,Yuxuan Gu,Weitao Ma,Yuan Xu,Bing Qin
关键词: Large Vision-Language Models, Large Vision-Language, understanding visual information, human languages, generated hallucinations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted to ACL 2024 Main Conference. 21 pages, 20 figures

点击查看摘要

Abstract:Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs’ subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called MMHalSnowball to evaluate LVLMs’ behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least 31% , indicating that LVLMs are prone to accept the generated hallucinations and make false claims that they would not have supported without distractions. We term this phenomenon Multimodal Hallucination Snowballing. To mitigate this, we further propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input, providing models with direct access to the visual information. Experiments show that our method can mitigate more than 24% of the snowballed multimodal hallucination while maintaining capabilities.

[AI-118] Divide And Conquer: Learning Chaotic Dynamical Systems With Multistep Penalty Neural Ordinary Differential Equations

链接: https://arxiv.org/abs/2407.00568
作者: Dibyajyoti Chakraborty,Seung Whan Chung,Romit Maulik
关键词: Forecasting high-dimensional dynamical, Neural Ordinary Differential, high-dimensional dynamical systems, chaotic dynamical systems, Ordinary Differential Equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 10 Figures, submitted to Journal of Computational Physics

点击查看摘要

Abstract:Forecasting high-dimensional dynamical systems is a fundamental challenge in various fields, such as the geosciences and engineering. Neural Ordinary Differential Equations (NODEs), which combine the power of neural networks and numerical solvers, have emerged as a promising algorithm for forecasting complex nonlinear dynamical systems. However, classical techniques used for NODE training are ineffective for learning chaotic dynamical systems. In this work, we propose a novel NODE-training approach that allows for robust learning of chaotic dynamical systems. Our method addresses the challenges of non-convexity and exploding gradients associated with underlying chaotic dynamics. Training data trajectories from such systems are split into multiple, non-overlapping time windows. In addition to the deviation from the training data, the optimization loss term further penalizes the discontinuities of the predicted trajectory between the time windows. The window size is selected based on the fastest Lyapunov time scale of the system. Multi-step penalty(MP) method is first demonstrated on Lorenz equation, to illustrate how it improves the loss landscape and thereby accelerating the optimization convergence. MP method can optimize chaotic systems in a manner similar to least-squares shadowing with significantly lower computational costs. Our proposed algorithm, denoted the Multistep Penalty NODE(MP-NODE), is applied to chaotic systems such as the Kuramoto-Sivashinsky equation and the two-dimensional Kolmogorov flow. It is observed that MP-NODE provide viable performance for such chaotic systems, not only for short-term trajectory predictions but also for invariant statistics that are hallmarks of the chaotic nature of these dynamics.

[AI-119] A Contextual Combinatorial Bandit Approach to Negotiation

链接: https://arxiv.org/abs/2407.00567
作者: Yexin Li,Zhancun Mu,Siyuan Qi
关键词: Learning effective negotiation, Learning effective, effective negotiation strategies, negotiation strategies poses, large action spaces
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning effective negotiation strategies poses two key challenges: the exploration-exploitation dilemma and dealing with large action spaces. However, there is an absence of learning-based approaches that effectively address these challenges in negotiation. This paper introduces a comprehensive formulation to tackle various negotiation problems. Our approach leverages contextual combinatorial multi-armed bandits, with the bandits resolving the exploration-exploitation dilemma, and the combinatorial nature handles large action spaces. Building upon this formulation, we introduce NegUCB, a novel method that also handles common issues such as partial observations and complex reward functions in negotiation. NegUCB is contextual and tailored for full-bandit feedback without constraints on the reward functions. Under mild assumptions, it ensures a sub-linear regret upper bound. Experiments conducted on three negotiation tasks demonstrate the superiority of our approach.

[AI-120] Cooperative Advisory Residual Policies for Congestion Mitigation

链接: https://arxiv.org/abs/2407.00553
作者: Aamir Hasan,Neeloy Chakraborty,Haonan Chen,Jung-Hoon Cho,Cathy Wu,Katherine Driggs-Campbell
关键词: autonomous vehicle fleets, simple actions, improving many socioeconomic, socioeconomic factors, mitigate traffic congestion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fleets of autonomous vehicles can mitigate traffic congestion through simple actions, thus improving many socioeconomic factors such as commute time and gas costs. However, these approaches are limited in practice as they assume precise control over autonomous vehicle fleets, incur extensive installation costs for a centralized sensor ecosystem, and also fail to account for uncertainty in driver behavior. To this end, we develop a class of learned residual policies that can be used in cooperative advisory systems and only require the use of a single vehicle with a human driver. Our policies advise drivers to behave in ways that mitigate traffic congestion while accounting for diverse driver behaviors, particularly drivers’ reactions to instructions, to provide an improved user experience. To realize such policies, we introduce an improved reward function that explicitly addresses congestion mitigation and driver attitudes to advice. We show that our residual policies can be personalized by conditioning them on an inferred driver trait that is learned in an unsupervised manner with a variational autoencoder. Our policies are trained in simulation with our novel instruction adherence driver model, and evaluated in simulation and through a user study (N=16) to capture the sentiments of human drivers. Our results show that our approaches successfully mitigate congestion while adapting to different driver behaviors, with up to 20% and 40% improvement as measured by a combination metric of speed and deviations in speed across time over baselines in our simulation tests and user study, respectively. Our user study further shows that our policies are human-compatible and personalize to drivers.

[AI-121] Answering real-world clinical questions using large language model based systems

链接: https://arxiv.org/abs/2407.00541
作者: Yen Sia Low(1),Michael L. Jackson(1),Rebecca J. Hyde(1),Robert E. Brown(1),Neil M. Sanghavi(1),Julian D. Baldwin(1),C. William Pike(1),Jananee Muralidharan(1),Gavin Hui(1 and 2),Natasha Alexander(3),Hadeel Hassan(3),Rahul V. Nene(4),Morgan Pike(5),Courtney J. Pokrzywa(6),Shivam Vedak(7),Adam Paul Yan(3),Dong-han Yao(7),Amy R. Zipursky(3),Christina Dinh(1),Philip Ballentine(1),Dan C. Derieg(1),Vladimir Polony(1),Rehan N. Chawdry(1),Jordan Davies(1),Brigham B. Hyde(1),Nigam H. Shah(1 and 7),Saurabh Gombar(1 and 8) ((1) Atropos Health, New York NY, USA, (2) Department of Medicine, University of California, Los Angeles CA, USA, (3) Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada, (4) Department of Emergency Medicine, University of California, San Diego CA, USA, (5) Department of Emergency Medicine, University of Michigan, Ann Arbor MI, USA, (6) Department of Surgery, Columbia University, New York NY, USA, (7) Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA (8) Department of Pathology, Stanford University, Stanford CA, USA)
关键词: guide healthcare decisions, contextualizing existing research, guide healthcare, healthcare decisions, difficulty in contextualizing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 28 pages (2 figures, 3 tables) inclusive of 8 pages of supplemental materials (4 supplemental figures and 4 supplemental tables)

点击查看摘要

Abstract:Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.

[AI-122] Privacy-Preserving and Trustworthy Deep Learning for Medical Imaging

链接: https://arxiv.org/abs/2407.00538
作者: Kiarash Sedghighadikolaei,Attila A Yavuz
关键词: Deep Radiomics, Deep Radiomics pipeline, impacted healthcare systems, Machine Learning, notably impacted healthcare
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The shift towards efficient and automated data analysis through Machine Learning (ML) has notably impacted healthcare systems, particularly Radiomics. Radiomics leverages ML to analyze medical images accurately and efficiently for precision medicine. Current methods rely on Deep Learning (DL) to improve performance and accuracy (Deep Radiomics). Given the sensitivity of medical images, ensuring privacy throughout the Deep Radiomics pipeline-from data generation and collection to model training and inference-is essential, especially when outsourced. Thus, Privacy-Enhancing Technologies (PETs) are crucial tools for Deep Radiomics. Previous studies and systematization efforts have either broadly overviewed PETs and their applications or mainly focused on subsets of PETs for ML algorithms. In Deep Radiomics, where efficiency, accuracy, and privacy are crucial, many PETs, while theoretically applicable, may not be practical without specialized optimizations or hybrid designs. Additionally, not all DL models are suitable for Radiomics. Consequently, there is a need for specialized studies that investigate and systematize the effective and practical integration of PETs into the Deep Radiomics pipeline. This work addresses this research gap by (1) classifying existing PETs, presenting practical hybrid PETS constructions, and a taxonomy illustrating their potential integration with the Deep Radiomics pipeline, with comparative analyses detailing assumptions, architectural suitability, and security, (2) Offering technical insights, describing potential challenges and means of combining PETs into the Deep Radiomics pipeline, including integration strategies, subtilities, and potential challenges, (3) Proposing potential research directions, identifying challenges, and suggesting solutions to enhance the PETs in Deep Radiomics.

[AI-123] Interpreting Pretrained Speech Models for Automatic Speech Assessment of Voice Disorders

链接: https://arxiv.org/abs/2407.00531
作者: Hok-Shing Lau,Mark Huntly,Nathon Morgan,Adesua Iyenoma,Biao Zeng,Tim Bashford
关键词: Automatic Speech Assessment, Speech Assessment, health assessment, clinically relevant, Automatic Speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech contains information that is clinically relevant to some diseases, which has the potential to be used for health assessment. Recent work shows an interest in applying deep learning algorithms, especially pretrained large speech models to the applications of Automatic Speech Assessment. One question that has not been explored is how these models output the results based on their inputs. In this work, we train and compare two configurations of Audio Spectrogram Transformer in the context of Voice Disorder Detection and apply the attention rollout method to produce model relevance maps, the computed relevance of the spectrogram regions when the model makes predictions. We use these maps to analyse how models make predictions in different conditions and to show that the spread of attention is reduced as a model is finetuned, and the model attention is concentrated on specific phoneme regions.

[AI-124] st Case Features as Hyper-heuristics for Inductive Programming

链接: https://arxiv.org/abs/2407.00519
作者: Edward McDaid,Sarah McDaid
关键词: Instruction subsets, subsets, programming search space, inductive programming search, search space
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 3 figures. Accepted for 20th IFIP WG 12.5 International Conference, AIAI 2024 Corfu, Greece, June 27-30, 2024

点击查看摘要

Abstract:Instruction subsets are heuristics that can reduce the size of the inductive programming search space by tens of orders of magnitude. Comprising many overlapping subsets of different sizes, they serve as predictions of the instructions required to code a solution for any problem. Currently, this approach employs a single, large family of subsets meaning that some problems can search thousands of subsets before a solution is found. In this paper we introduce the use of test case type signatures as hyper-heuristics to select one of many, smaller families of instruction subsets. The type signature for any set of test cases maps directly to a single family and smaller families mean that fewer subsets need to be considered for most problems. Having many families also permits subsets to be reordered to better reflect their relative occurrence in human code - again reducing the search space size for many problems. Overall the new approach can further reduce the size of the inductive programming search space by between 1 and 3 orders of magnitude, depending on the type signature. Larger and more consistent reductions are possible through the use of more sophisticated type systems. The potential use of additional test case features as hyper-heuristics and some other possible future work is also briefly discussed.

[AI-125] Stochastic stem bucking using mixture density neural networks

链接: https://arxiv.org/abs/2407.00510
作者: Simon Schmiedel
关键词: Poor bucking decisions, bucking decisions made, bucking decisions, Poor bucking, bucking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Poor bucking decisions made by forest harvesters can have a negative effect on the products that are generated from the logs. Making the right bucking decisions is not an easy task because harvesters must rely on predictions of the stem profile for the part of the stems that is not yet measured. The goal of this project is to improve the bucking decisions made by forest harvesters with a stochastic bucking method. We developed a Long Short-Term Memory (LSTM) neural network that predicted the parameters of a Gaussian distribution conditioned on the known part of the stem, enabling the creation of multiple samples of stem profile predictions for the unknown part of the stem. The bucking decisions could then be optimized using a novel stochastic bucking algorithm which used all the stem profiles generated to choose the logs to generate from the stem. The stochastic bucking algorithm was compared to two benchmark models: A polynomial model that could not condition its predictions on more than one diameter measurement, and a deterministic LSTM neural network. All models were evaluated on stem profiles of four coniferous species prevalent in eastern Canada. In general, the best bucking decisions were taken by the stochastic LSTM models, demonstrating the usefulness of the method. The second-best results were mostly obtained by the deterministic LSTM model and the worst results by the polynomial model, corroborating the usefulness of conditioning the stem curve predictions on multiple measurements.

[AI-126] Leveraging Ontologies to Document Bias in Data

链接: https://arxiv.org/abs/2407.00509
作者: Mayra Russo,Maria-Esther Vidal
关键词: amplifying undesired biases, systems are capable, capable of reproducing, amplifying undesired, Machine Learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine Learning (ML) systems are capable of reproducing and often amplifying undesired biases. This puts emphasis on the importance of operating under practices that enable the study and understanding of the intrinsic characteristics of ML pipelines, prompting the emergence of documentation frameworks with the idea that ``any remedy for bias starts with awareness of its existence’'. However, a resource that can formally describe these pipelines in terms of biases detected is still amiss. To fill this gap, we present the Doc-BiasO ontology, a resource that aims to create an integrated vocabulary of biases defined in the \textitfair-ML literature and their measures, as well as to incorporate relevant terminology and the relationships between them. Overseeing ontology engineering best practices, we re-use existing vocabulary on machine learning and AI, to foster knowledge sharing and interoperability between the actors concerned with its research, development, regulation, among others. Overall, our main objective is to contribute towards clarifying existing terminology on bias research as it rapidly expands to all areas of AI and to improve the interpretation of bias in data and downstream impact.

[AI-127] ShapG: new feature importance method based on the Shapley value

链接: https://arxiv.org/abs/2407.00506
作者: Chi Zhao,Jing Liu,Elena Parilina
关键词: Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, make decisions, Intelligence
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With wide application of Artificial Intelligence (AI), it has become particularly important to make decisions of AI systems explainable and transparent. In this paper, we proposed a new Explainable Artificial Intelligence (XAI) method called ShapG (Explanations based on Shapley value for Graphs) for measuring feature importance. ShapG is a model-agnostic global explanation method. At the first stage, it defines an undirected graph based on the dataset, where nodes represent features and edges are added based on calculation of correlation coefficients between features. At the second stage, it calculates an approximated Shapley value by sampling the data taking into account this graph structure. The sampling approach of ShapG allows to calculate the importance of features efficiently, i.e. to reduce computational complexity. Comparison of ShapG with other existing XAI methods shows that it provides more accurate explanations for two examined datasets. We also compared other XAI methods developed based on cooperative game theory with ShapG in running time, and the results show that ShapG exhibits obvious advantages in its running time, which further proves efficiency of ShapG. In addition, extensive experiments demonstrate a wide range of applicability of the ShapG method for explaining complex models. We find ShapG an important tool in improving explainability and transparency of AI systems and believe it can be widely used in various fields.

[AI-128] Deep Frequency Derivative Learning for Non-stationary Time Series Forecasting

链接: https://arxiv.org/abs/2407.00502
作者: Wei Fan,Kun Yi,Hangting Ye,Zhiyuan Ning,Qi Zhang,Ning An
关键词: time series, time series forecasting, inevitable for models, models to face, series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by IJCAI 2024

点击查看摘要

Abstract:While most time series are non-stationary, it is inevitable for models to face the distribution shift issue in time series forecasting. Existing solutions manipulate statistical measures (usually mean and std.) to adjust time series distribution. However, these operations can be theoretically seen as the transformation towards zero frequency component of the spectrum which cannot reveal full distribution information and would further lead to information utilization bottleneck in normalization, thus hindering forecasting performance. To address this problem, we propose to utilize the whole frequency spectrum to transform time series to make full use of data distribution from the frequency perspective. We present a deep frequency derivative learning framework, DERITS, for non-stationary time series forecasting. Specifically, DERITS is built upon a novel reversible transformation, namely Frequency Derivative Transformation (FDT) that makes signals derived in the frequency domain to acquire more stationary frequency representations. Then, we propose the Order-adaptive Fourier Convolution Network to conduct adaptive frequency filtering and learning. Furthermore, we organize DERITS as a parallel-stacked architecture for the multi-order derivation and fusion for forecasting. Finally, we conduct extensive experiments on several datasets which show the consistent superiority in both time series forecasting and shift alleviation.

[AI-129] Aeroengine performance prediction using a physical-embedded data-driven method

链接: https://arxiv.org/abs/2407.00501
作者: Tong Mo,Shiran Dai,An Fu,Xiaomeng Zhu,Shuxiao Li
关键词: Accurate and efficient, optimization endeavours, paramount importance, Accurate, efficient prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Accurate and efficient prediction of aeroengine performance is of paramount importance for engine design, maintenance, and optimization endeavours. However, existing methodologies often struggle to strike an optimal balance among predictive accuracy, computational efficiency, modelling complexity, and data dependency. To address these challenges, we propose a strategy that synergistically combines domain knowledge from both the aeroengine and neural network realms to enable real-time prediction of engine performance parameters. Leveraging aeroengine domain knowledge, we judiciously design the network structure and regulate the internal information flow. Concurrently, drawing upon neural network domain expertise, we devise four distinct feature fusion methods and introduce an innovative loss function formulation. To rigorously evaluate the effectiveness and robustness of our proposed strategy, we conduct comprehensive validation across two distinct datasets. The empirical results demonstrate :(1) the evident advantages of our tailored loss function; (2) our model’s ability to maintain equal or superior performance with a reduced parameter count; (3) our model’s reduced data dependency compared to generalized neural network architectures; (4)Our model is more interpretable than traditional black box machine learning methods.

[AI-130] Intrinsic PAPR for Point-level 3D Scene Albedo and Shading Editing

链接: https://arxiv.org/abs/2407.00500
作者: Alireza Moazeni,Shichong Peng,Ke Li
关键词: multi-view RGB images, RGB images, multi-view RGB, Intrinsic PAPR, neural rendering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in neural rendering have excelled at novel view synthesis from multi-view RGB images. However, they often lack the capability to edit the shading or colour of the scene at a detailed point-level, while ensuring consistency across different viewpoints. In this work, we address the challenge of point-level 3D scene albedo and shading editing from multi-view RGB images, focusing on detailed editing at the point-level rather than at a part or global level. While prior works based on volumetric representation such as NeRF struggle with achieving 3D consistent editing at the point level, recent advancements in point-based neural rendering show promise in overcoming this challenge. We introduce ``Intrinsic PAPR’', a novel method based on the recent point-based neural rendering technique Proximity Attention Point Rendering (PAPR). Unlike other point-based methods that model the intrinsic decomposition of the scene, our approach does not rely on complicated shading models or simplistic priors that may not universally apply. Instead, we directly model scene decomposition into albedo and shading components, leading to better estimation accuracy. Comparative evaluations against the latest point-based inverse rendering methods demonstrate that Intrinsic PAPR achieves higher-quality novel view rendering and superior point-level albedo and shading editing.

[AI-131] ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees

链接: https://arxiv.org/abs/2407.00499
作者: Zhiyuan Wang,Jinhao Duan,Lu Cheng,Yue Zhang,Qingni Wang,Hengtao Shen,Xiaofeng Zhu,Xiaoshuang Shi,Kaidi Xu
关键词: natural language generation, recent large language, large language models, open-ended NLG tasks, language generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Uncertainty quantification (UQ) in natural language generation (NLG) tasks remains an open challenge, exacerbated by the intricate nature of the recent large language models (LLMs). This study investigates adapting conformal prediction (CP), which can convert any heuristic measure of uncertainty into rigorous theoretical guarantees by constructing prediction sets, for black-box LLMs in open-ended NLG tasks. We propose a sampling-based uncertainty measure leveraging self-consistency and develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the design of the CP algorithm. Experimental results indicate that our uncertainty measure generally surpasses prior state-of-the-art methods. Furthermore, we calibrate the prediction sets within the model’s unfixed answer distribution and achieve strict control over the correctness coverage rate across 6 LLMs on 4 free-form NLG datasets, spanning general-purpose and medical domains, while the small average set size further highlights the efficiency of our method in providing trustworthy guarantees for practical open-ended NLG applications.

[AI-132] A Two-stage Reinforcement Learning-based Approach for Multi-entity Task Allocation

链接: https://arxiv.org/abs/2407.00496
作者: Aicheng Gong,Kai Yang,Jiafei Lyu,Xiu Li
关键词: key combinatorial optimization, combinatorial optimization problem, crucial for modern, resource scheduling, key combinatorial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Task allocation is a key combinatorial optimization problem, crucial for modern applications such as multi-robot cooperation and resource scheduling. Decision makers must allocate entities to tasks reasonably across different scenarios. However, traditional methods assume static attributes and numbers of tasks and entities, often relying on dynamic programming and heuristic algorithms for solutions. In reality, task allocation resembles Markov decision processes, with dynamically changing task and entity attributes. Thus, algorithms must dynamically allocate tasks based on their states. To address this issue, we propose a two-stage task allocation algorithm based on similarity, utilizing reinforcement learning to learn allocation strategies. The proposed pre-assign strategy allows entities to preselect appropriate tasks, effectively avoiding local optima and thereby better finding the optimal allocation. We also introduce an attention mechanism and a hyperparameter network structure to adapt to the changing number and attributes of entities and tasks, enabling our network structure to generalize to new tasks. Experimental results across multiple environments demonstrate that our algorithm effectively addresses the challenges of dynamic task allocation in practical applications. Compared to heuristic algorithms like genetic algorithms, our reinforcement learning approach better solves dynamic allocation problems and achieves zero-shot generalization to new tasks with good performance. The code is available at this https URL.

[AI-133] PFME: A Modular Approach for Fine-grained Hallucination Detection and Editing of Large Language Models

链接: https://arxiv.org/abs/2407.00488
作者: Kunquan Deng,Zeyu Huang,Chen Li,Chenghua Lin,Min Gao,Wenge Rong
关键词: Large Language Models, Large Language, producing inaccurate content, risk producing inaccurate, Fine-grained Hallucination Detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in fluency but risk producing inaccurate content, called “hallucinations.” This paper outlines a standardized process for categorizing fine-grained hallucination types and proposes an innovative framework–the Progressive Fine-grained Model Editor (PFME)–specifically designed to detect and correct fine-grained hallucinations in LLMs. PFME consists of two collaborative modules: the Real-time Fact Retrieval Module and the Fine-grained Hallucination Detection and Editing Module. The former identifies key entities in the document and retrieves the latest factual evidence from credible sources. The latter further segments the document into sentence-level text and, based on relevant evidence and previously edited context, identifies, locates, and edits each sentence’s hallucination type. Experimental results on FavaBench and FActScore demonstrate that PFME outperforms existing methods in fine-grained hallucination detection tasks. Particularly, when using the Llama3-8B-Instruct model, PFME’s performance in fine-grained hallucination detection with external knowledge assistance improves by 8.7 percentage points (pp) compared to ChatGPT. In editing tasks, PFME further enhances the FActScore of FActScore-Alpaca13B and FActScore-ChatGPT datasets, increasing by 16.2pp and 4.6pp, respectively.

[AI-134] Quantifying Spuriousness of Biased Datasets Using Partial Information Decomposition

链接: https://arxiv.org/abs/2407.00482
作者: Barproda Halder,Faisal Hamman,Pasan Dissanayake,Qiuyi Zhang,Ilia Sucholutsky,Sanghamitra Dutta
关键词: Spurious patterns refer, Partial Information Decomposition, unique information, causally related, patterns refer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Information Theory (cs.IT)
*备注: Accepted at ICML 2024 Workshop on Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

点击查看摘要

Abstract:Spurious patterns refer to a mathematical association between two or more variables in a dataset that are not causally related. However, this notion of spuriousness, which is usually introduced due to sampling biases in the dataset, has classically lacked a formal definition. To address this gap, this work presents the first information-theoretic formalization of spuriousness in a dataset (given a split of spurious and core features) using a mathematical framework called Partial Information Decomposition (PID). Specifically, we disentangle the joint information content that the spurious and core features share about another target variable (e.g., the prediction label) into distinct components, namely unique, redundant, and synergistic information. We propose the use of unique information, with roots in Blackwell Sufficiency, as a novel metric to formally quantify dataset spuriousness and derive its desirable properties. We empirically demonstrate how higher unique information in the spurious features in a dataset could lead a model into choosing the spurious features over the core features for inference, often having low worst-group-accuracy. We also propose a novel autoencoder-based estimator for computing unique information that is able to handle high-dimensional image data. Finally, we also show how this unique information in the spurious feature is reduced across several dataset-based spurious-pattern-mitigation techniques such as data reweighting and varying levels of background mixing, demonstrating a novel tradeoff between unique information (spuriousness) and worst-group-accuracy.

[AI-135] Knowledge-Aware Parsimony Learning: A Perspective from Relational Graphs

链接: https://arxiv.org/abs/2407.00478
作者: Quanming Yao,Yongqi Zhang,Yaqing Wang,Nan Yin,James Kwok,Qiang Yang
关键词: strategy that involves, involves the brute-force, training dataset, dataset and learnable, prevalent approach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The scaling law, a strategy that involves the brute-force scaling of the training dataset and learnable parameters, has become a prevalent approach for developing stronger learning models. In this paper, we examine its rationale in terms of learning from relational graphs. We demonstrate that directly adhering to such a scaling law does not necessarily yield stronger models due to architectural incompatibility and representation bottlenecks. To tackle this challenge, we propose a novel framework for learning from relational graphs via knowledge-aware parsimony learning. Our method draws inspiration from the duality between data and knowledge inherent in these graphs. Specifically, we first extract knowledge (like symbolic logic and physical laws) during the learning process, and then apply combinatorial generalization to the task at hand. This extracted knowledge serves as the ``building blocks’’ for achieving parsimony learning. By applying this philosophy to architecture, parameters, and inference, we can effectively achieve versatile, sample-efficient, and interpretable learning. Experimental results show that our proposed framework surpasses methods that strictly follow the traditional scaling-up roadmap. This highlights the importance of incorporating knowledge in the development of next-generation learning technologies.

[AI-136] MH-pFLGB: Model Heterogeneous personalized Federated Learning via Global Bypass for Medical Image Analysis

链接: https://arxiv.org/abs/2407.00474
作者: Luyuan Xie,Manqing Lin,ChenMing Xu,Tianyu Luan,Zhipeng Zeng,Wenjun Qian,Cong Li,Yuejian Fang,Qingni Shen,Zhonghai Wu
关键词: training data privacy, protect training data, medical artificial intelligence, federated learning, artificial intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2405.06822

点击查看摘要

Abstract:In the evolving application of medical artificial intelligence, federated learning is notable for its ability to protect training data privacy. Federated learning facilitates collaborative model development without the need to share local data from healthcare institutions. Yet, the statistical and system heterogeneity among these institutions poses substantial challenges, which affects the effectiveness of federated learning and hampers the exchange of information between clients. To address these issues, we introduce a novel approach, MH-pFLGB, which employs a global bypass strategy to mitigate the reliance on public datasets and navigate the complexities of non-IID data distributions. Our method enhances traditional federated learning by integrating a global bypass model, which would share the information among the clients, but also serves as part of the network to enhance the performance on each client. Additionally, MH-pFLGB provides a feature fusion module to better combine the local and global features. We validate \model’s effectiveness and adaptability through extensive testing on different medical tasks, demonstrating superior performance compared to existing state-of-the-art methods.

[AI-137] MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

链接: https://arxiv.org/abs/2407.00468
作者: Jinsheng Huang,Liang Chen,Taian Guo,Fu Zeng,Yusheng Zhao,Bohan Wu,Ye Yuan,Haozhe Zhao,Zhihui Guo,Yichi Zhang,Jingyang Yuan,Wei Ju,Luchen Liu,Tianyu Liu,Baobao Chang,Ming Zhang
关键词: Large Multimodal Models, exhibit impressive cross-modal, impressive cross-modal understanding, Multimodal Models, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 21 pages, code released at this https URL , Homepage at this https URL

点击查看摘要

Abstract:Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises 2,138 question triplets, totaling 6,414 distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by 31.73% , compared to an average gap of 8.03% in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by 23.09% , whereas the gap for previous benchmarks is just 14.64% ). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.

[AI-138] BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science

链接: https://arxiv.org/abs/2407.00466
作者: Xinna Lin,Siqi Ma,Junjie Shan,Xiaojing Zhang,Shell Xu Hu,Tiannan Guo,Stan Z. Li,Kaicheng Yu
关键词: Pursuing artificial intelligence, Large Language Models, Pursuing artificial, artificial intelligence, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pursuing artificial intelligence for biomedical science, a.k.a. AI Scientist, draws increasing attention, where one common approach is to build a copilot agent driven by Large Language Models (LLMs). However, to evaluate such systems, people either rely on direct Question-Answering (QA) to the LLM itself, or in a biomedical experimental manner. How to precisely benchmark biomedical agents from an AI Scientist perspective remains largely unexplored. To this end, we draw inspiration from one most important abilities of scientists, understanding the literature, and introduce BioKGBench. In contrast to traditional evaluation benchmark that only focuses on factual QA, where the LLMs are known to have hallucination issues, we first disentangle “Understanding Literature” into two atomic abilities, i) “Understanding” the unstructured text from research papers by performing scientific claim verification, and ii) Ability to interact with structured Knowledge-Graph Question-Answering (KGQA) as a form of “Literature” grounding. We then formulate a novel agent task, dubbed KGCheck, using KGQA and domain-based Retrieval-Augmented Generation (RAG) to identify the factual errors of existing large-scale knowledge graph databases. We collect over two thousand data for two atomic tasks and 225 high-quality annotated data for the agent task. Surprisingly, we discover that state-of-the-art agents, both daily scenarios and biomedical ones, have either failed or inferior performance on our benchmark. We then introduce a simple yet effective baseline, dubbed BKGAgent. On the widely used popular knowledge graph, we discover over 90 factual errors which provide scenarios for agents to make discoveries and demonstrate the effectiveness of our approach. The code and data are available at this https URL.

[AI-139] Open-Source Conversational AI with SpeechBrain 1.0

链接: https://arxiv.org/abs/2407.00463
作者: Mirco Ravanelli,Titouan Parcollet,Adel Moumen,Sylvain de Langen,Cem Subakan,Peter Plantinga,Yingzhi Wang,Pooneh Mousavi,Luca Della Libera,Artem Ploujnikov,Francesco Paissan,Davide Borra,Salah Zaiem,Zeyu Zhao,Shucong Zhang,Georgios Karakasidis,Sung-Lin Yeh,Aku Rouhe,Rudolf Braun,Florian Mai,Juan Zuluaga-Gomez,Seyed Mahed Mousavi,Andreas Nautsch,Xuechen Liu,Sangeet Sagar,Jarod Duret,Salima Mdhaffar,Gaelle Laperriere,Renato De Mori,Yannick Esteve
关键词: http URL promotes, URL promotes transparency, open-source Conversational, http URL, URL promotes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
*备注: Submitted to JMLR (Machine Learning Open Source Software)

点击查看摘要

Abstract:SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much this http URL promotes transparency and replicability by releasing both the pre-trained models and the complete “recipes” of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

[AI-140] pFLFE: Cross-silo Personalized Federated Learning via Feature Enhancement on Medical Image Segmentation

链接: https://arxiv.org/abs/2407.00462
作者: Luyuan Xie,Manqing Lin,Siyuan Liu,ChenMing Xu,Tianyu Luan,Cong Li,Yuejian Fang,Qingni Shen,Zhonghai Wu
关键词: overcome data scarcity, utilizing varied data, personalized cross-silo federated, cross-silo federated learning, Personalized Federated Learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In medical image segmentation, personalized cross-silo federated learning (FL) is becoming popular for utilizing varied data across healthcare settings to overcome data scarcity and privacy concerns. However, existing methods often suffer from client drift, leading to inconsistent performance and delayed training. We propose a new framework, Personalized Federated Learning via Feature Enhancement (pFLFE), designed to mitigate these challenges. pFLFE consists of two main stages: feature enhancement and supervised learning. The first stage improves differentiation between foreground and background features, and the second uses these enhanced features for learning from segmentation masks. We also design an alternative training approach that requires fewer communication rounds without compromising segmentation quality, even with limited communication resources. Through experiments on three medical segmentation tasks, we demonstrate that pFLFE outperforms the state-of-the-art methods.

[AI-141] A Rule-Based Behaviour Planner for Autonomous Driving

链接: https://arxiv.org/abs/2407.00460
作者: Bouchard Frederic,Sedwards Sean,Czarnecki Krzysztof
关键词: require highly sophisticated, highly sophisticated decision-making, vehicles require highly, Autonomous vehicles require, require highly
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Use this https URL for citations

点击查看摘要

Abstract:Autonomous vehicles require highly sophisticated decision-making to determine their motion. This paper describes how such functionality can be achieved with a practical rule engine learned from expert driving decisions. We propose an algorithm to create and maintain a rule-based behaviour planner, using a two-layer rule-based theory. The first layer determines a set of feasible parametrized behaviours, given the perceived state of the environment. From these, a resolution function chooses the most conservative high-level maneuver. The second layer then reconciles the parameters into a single behaviour. To demonstrate the practicality of our approach, we report results of its implementation in a level-3 autonomous vehicle and its field test in an urban environment.

[AI-142] Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models

链接: https://arxiv.org/abs/2407.00456
作者: Yanlin Wang,Tianyue Jiang,Mingwei Liu,Jiachi Chen,Zibin Zheng
关键词: software development process, Large language models, code, coding style, Code LLMs
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 13pages, 14 figures

点击查看摘要

Abstract:Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream Code LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by Code LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers have different coding styles. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem.

[AI-143] KHNNs: hypercomplex neural networks computations via Keras using TensorFlow and PyTorch

链接: https://arxiv.org/abs/2407.00452
作者: Agnieszka Niemczynowicz,Radosław Antoni Kycia
关键词: real numbers perform, Neural networks, advanced algebras, algebras than real, real numbers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Neural networks used in computations with more advanced algebras than real numbers perform better in some applications. However, there is no general framework for constructing hypercomplex neural networks. We propose a library integrated with Keras that can do computations within TensorFlow and PyTorch. It provides Dense and Convolutional 1D, 2D, and 3D layers architectures.

[AI-144] Fully tensorial approach to hypercomplex neural networks

链接: https://arxiv.org/abs/2407.00449
作者: Agnieszka Niemczynowicz,Radosław Antoni Kycia
关键词: Fully tensorial theory, Fully tensorial, hypercomplex neural networks, theory of hypercomplex, Fully
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 10 pages

点击查看摘要

Abstract:Fully tensorial theory of hypercomplex neural networks is given. The key point is to observe that the algebra multiplication can be represented as a rank three tensor. This approach is attractive for neural network libraries that support effective tensorial operations.

[AI-145] me Series Clustering with General State Space Models via Stochastic Variational Inference

链接: https://arxiv.org/abs/2407.00429
作者: Ryoichi Ishizuka,Takashi Imai,Kaoru Kawamoto
关键词: model-based time series, time series, time series models, mixtures of general, time series clustering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 4 figures

点击查看摘要

Abstract:In this paper, we propose a novel method of model-based time series clustering with mixtures of general state space models (MSSMs). Each component of MSSMs is associated with each cluster. An advantage of the proposed method is that it enables the use of time series models appropriate to the specific time series. This not only improves clustering and prediction accuracy but also enhances the interpretability of the estimated parameters. The parameters of the MSSMs are estimated using stochastic variational inference, a subtype of variational inference. The proposed method estimates the latent variables of an arbitrary state space model by using neural networks with a normalizing flow as a variational estimator. The number of clusters can be estimated using the Bayesian information criterion. In addition, to prevent MSSMs from converging to the local optimum, we propose several optimization tricks, including an additional penalty term called entropy annealing. Experiments on simulated datasets show that the proposed method is effective for clustering, parameter estimation, and estimating the number of clusters.

[AI-146] On the Complexity of Learning to Cooperate with Populations of Socially Rational Agents

链接: https://arxiv.org/abs/2407.00419
作者: Robert Loftin,Saptarashmi Bandyopadhyay,Mustafa Mert Çelikok
关键词: Artificially intelligent agents, Artificially intelligent, intelligent agents deployed, ability to reliably, real-world will require
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Artificially intelligent agents deployed in the real-world will require the ability to reliably \textitcooperate with humans (as well as other, heterogeneous AI agents). To provide formal guarantees of successful cooperation, we must make some assumptions about how partner agents could plausibly behave. Any realistic set of assumptions must account for the fact that other agents may be just as adaptable as our agent is. In this work, we consider the problem of cooperating with a \textitpopulation of agents in a finitely-repeated, two player general-sum matrix game with private utilities. Two natural assumptions in such settings are that: 1) all agents in the population are individually rational learners, and 2) when any two members of the population are paired together, with high-probability they will achieve at least the same utility as they would under some Pareto efficient equilibrium strategy. Our results first show that these assumptions alone are insufficient to ensure \textitzero-shot cooperation with members of the target population. We therefore consider the problem of \textitlearning a strategy for cooperating with such a population using prior observations its members interacting with one another. We provide upper and lower bounds on the number of samples needed to learn an effective cooperation strategy. Most importantly, we show that these bounds can be much stronger than those arising from a "naive’’ reduction of the problem to one of imitation learning.

[AI-147] Explainability of Machine Learning Models under Missing Data

链接: https://arxiv.org/abs/2407.00411
作者: Tuan L. Vo,Thu Nguyen,Hugo L. Hammer,Michael A. Riegler,Pal Halvorsen
关键词: Explainable Artificial Intelligence, Missing data, significantly impair model, impair model performance, Shapley
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Missing data is a prevalent issue that can significantly impair model performance and interpretability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on the calculation of Shapley values, a popular technique for interpreting complex machine learning models. We compare different imputation strategies and assess their impact on feature importance and interaction as determined by Shapley values. Moreover, we also theoretically analyze the effects of missing values on Shapley values. Importantly, our findings reveal that the choice of imputation method can introduce biases that could lead to changes in the Shapley values, thereby affecting the interpretability of the model. Moreover, and that a lower test prediction mean square error (MSE) may not imply a lower MSE in Shapley values and vice versa. Also, while Xgboost is a method that could handle missing data directly, using Xgboost directly on missing data can seriously affect interpretability compared to imputing the data before training Xgboost. This study provides a comprehensive evaluation of imputation methods in the context of model interpretation, offering practical guidance for selecting appropriate techniques based on dataset characteristics and analysis objectives. The results underscore the importance of considering imputation effects to ensure robust and reliable insights from machine learning models.

[AI-148] Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

链接: https://arxiv.org/abs/2407.00402
作者: Omer Goldman,Alon Jacovi,Aviv Slobodkin,Aviya Maimon,Ido Dagan,Reut Tsarfaty
关键词: language models’ capabilities, Improvements in language, making long-context evaluation, language models’, models’ capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Improvements in language models’ capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of “long-context”, defined simply by the total length of the model’s input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.

[AI-149] PUZZLES: A Benchmark for Neural Algorithmic Reasoning

链接: https://arxiv.org/abs/2407.00401
作者: Benjamin Estermann,Luca A. Lanzendörfer,Yannick Niedermayr,Roger Wattenhofer
关键词: fundamental cognitive ability, decision-making processes, fundamental cognitive, cognitive ability, ability that plays
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Algorithmic reasoning is a fundamental cognitive ability that plays a pivotal role in problem-solving and decision-making processes. Reinforcement Learning (RL) has demonstrated remarkable proficiency in tasks such as motor control, handling perceptual input, and managing stochastic environments. These advancements have been enabled in part by the availability of benchmarks. In this work we introduce PUZZLES, a benchmark based on Simon Tatham’s Portable Puzzle Collection, aimed at fostering progress in algorithmic and logical reasoning in RL. PUZZLES contains 40 diverse logic puzzles of adjustable sizes and varying levels of complexity; many puzzles also feature a diverse set of additional configuration parameters. The 40 puzzles provide detailed information on the strengths and generalization capabilities of RL agents. Furthermore, we evaluate various RL algorithms on PUZZLES, providing baseline comparisons and demonstrating the potential for future research. All the software, including the environment, is available at this https URL.

[AI-150] A Study on Effect of Reference Knowledge Choice in Generating Technical Content Relevant to SAPPhIRE Model Using Large Language Model

链接: https://arxiv.org/abs/2407.00396
作者: Kausik Bhattacharya,Anubhab Majumder,Amaresh Chakrabarti
关键词: SAPPhIRE model, Large Language Model, stimulus in design, model, inspirational stimulus
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Representation of systems using the SAPPhIRE model of causality can be an inspirational stimulus in design. However, creating a SAPPhIRE model of a technical or a natural system requires sourcing technical knowledge from multiple technical documents regarding how the system works. This research investigates how to generate technical content accurately relevant to the SAPPhIRE model of causality using a Large Language Model, also called LLM. This paper, which is the first part of the two-part research, presents a method for hallucination suppression using Retrieval Augmented Generating with LLM to generate technical content supported by the scientific information relevant to a SAPPhIRE con-struct. The result from this research shows that the selection of reference knowledge used in providing context to the LLM for generating the technical content is very important. The outcome of this research is used to build a software support tool to generate the SAPPhIRE model of a given technical system.

[AI-151] Multi-task multi-constraint differential evolution with elite-guided knowledge transfer for coal mine integrated energy system dispatching

链接: https://arxiv.org/abs/2407.00386
作者: Canyun Dai,Xiaoyan Sun,Hejuan Hu,Wei Song,Yong Zhang,Dunwei Gong
关键词: multiobjective evolutionary algorithms, high dimensionality, constrained multiobjective evolutionary, challenging due, due to high
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The dispatch optimization of coal mine integrated energy system is challenging due to high dimensionality, strong coupling constraints, and multiobjective. Existing constrained multiobjective evolutionary algorithms struggle with locating multiple small and irregular feasible regions, making them inaplicable to this problem. To address this issue, we here develop a multitask evolutionary algorithm framework that incorporates the dispatch correlated domain knowledge to effectively deal with strong constraints and multiobjective optimization. Possible evolutionary multitask construction strategy based on complex constraint relationship analysis and handling, i.e., constraint coupled spatial decomposition, constraint strength classification and constraint handling technique, is first explored. Within the multitask evolutionary optimization framework, two strategies, i.e., an elite guided knowledge transfer by designing a special crowding distance mechanism to select dominant individuals from each task, and an adaptive neighborhood technology based mutation to effectively balance the diversity and convergence of each optimized task for the differential evolution algorithm, are further developed. The performance of the proposed algorithm in feasibility, convergence, and diversity is demonstrated in a case study of a coal mine integrated energy system by comparing with CPLEX solver and seven constrained multiobjective evolutionary algorithms.

[AI-152] FANFOLD: Graph Normalizing Flows-driven Asymmetric Network for Unsupervised Graph-Level Anomaly Detection

链接: https://arxiv.org/abs/2407.00383
作者: Rui Cao,Shijie Xue,Jindong Li,Qi Wang,Yi Chang
关键词: attracted increasing interest, Unsupervised graph-level anomaly, graph-level anomaly detection, increasing interest due, anomaly detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised graph-level anomaly detection (UGAD) has attracted increasing interest due to its widespread application. In recent studies, knowledge distillation-based methods have been widely used in unsupervised anomaly detection to improve model efficiency and generalization. However, the inherent symmetry between the source (teacher) and target (student) networks typically results in consistent outputs across both architectures, making it difficult to distinguish abnormal graphs from normal graphs. Also, existing methods mainly rely on graph features to distinguish anomalies, which may be unstable with complex and diverse data and fail to capture the essence that differentiates normal graphs from abnormal ones. In this work, we propose a Graph Normalizing Flows-driven Asymmetric Network For Unsupervised Graph-Level Anomaly Detection (FANFOLD in short). We introduce normalizing flows to unsupervised graph-level anomaly detection due to their successful application and superior quality in learning the underlying distribution of samples. Specifically, we adopt the knowledge distillation technique and apply normalizing flows on the source network, achieving the asymmetric network. In the training stage, FANFOLD transforms the original distribution of normal graphs to a standard normal distribution. During inference, FANFOLD computes the anomaly score using the source-target loss to discriminate between normal and anomalous graphs. We conduct extensive experiments on 15 datasets of different fields with 9 baseline methods to validate the superiority of FANFOLD.

[AI-153] UM2N: Towards Universal Mesh Movement Networks

链接: https://arxiv.org/abs/2407.00382
作者: Mingrui Zhang,Chunyang Wang,Stephan Kramer,Joseph G. Wallwork,Siyi Li,Jiancheng Liu,Xiang Chen,Matthew D. Piggott
关键词: Partial Differential Equations, Solving complex Partial, complex Partial Differential, Differential Equations, Partial Differential
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving complex Partial Differential Equations (PDEs) accurately and efficiently is an essential and challenging problem in all scientific and engineering disciplines. Mesh movement methods provide the capability to improve the accuracy of the numerical solution without increasing the overall mesh degree of freedom count. Conventional sophisticated mesh movement methods are extremely expensive and struggle to handle scenarios with complex boundary geometries. However, existing learning-based methods require re-training from scratch given a different PDE type or boundary geometry, which limits their applicability, and also often suffer from robustness issues in the form of inverted elements. In this paper, we introduce the Universal Mesh Movement Network (UM2N), which – once trained – can be applied in a non-intrusive, zero-shot manner to move meshes with different size distributions and structures, for solvers applicable to different PDE types and boundary geometries. UM2N consists of a Graph Transformer (GT) encoder for extracting features and a Graph Attention Network (GAT) based decoder for moving the mesh. We evaluate our method on advection and Navier-Stokes based examples, as well as a real-world tsunami simulation case. Our method outperforms existing learning-based mesh movement methods in terms of the benchmarks described above. In comparison to the conventional sophisticated Monge-Ampère PDE-solver based method, our approach not only significantly accelerates mesh movement, but also proves effective in scenarios where the conventional method fails. Our project page is at \urlthis https URL.

[AI-154] GraphArena: Benchmarking Large Language Models on Graph Computational Problems

链接: https://arxiv.org/abs/2407.00379
作者: Jianheng Tang,Qifan Zhang,Yuhan Li,Jia Li
关键词: Large Language Models, Large Language, arms race, Language Models, examine their progresses
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The “arms race” of Large Language Models (LLMs) demands novel, challenging, and diverse benchmarks to faithfully examine their progresses. We introduce GraphArena, a benchmarking tool designed to evaluate LLMs on graph computational problems using million-scale real-world graphs from diverse scenarios such as knowledge graphs, social networks, and molecular structures. GraphArena offers a suite of 10 computational tasks, encompassing four polynomial-time (e.g., Shortest Distance) and six NP-complete challenges (e.g., Travelling Salesman Problem). It features a rigorous evaluation framework that classifies LLM outputs as correct, suboptimal (feasible but not optimal), or hallucinatory (properly formatted but infeasible). Evaluation of 10 leading LLMs, including GPT-4o and LLaMA3-70B-Instruct, reveals that even top-performing models struggle with larger, more complex graph problems and exhibit hallucination issues. Despite the application of strategies such as chain-of-thought prompting, these issues remain unresolved. GraphArena contributes a valuable supplement to the existing LLM benchmarks and is open-sourced at this https URL.

[AI-155] he Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention

链接: https://arxiv.org/abs/2407.00377
作者: Yixin Wan,Di Wu,Haoran Wang,Kai-Wei Chang
关键词: models depicting individuals, commonly adopted, depicting individuals, Prompt-based, diversity interventions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Prompt-based “diversity interventions” are commonly adopted to improve the diversity of Text-to-Image (T2I) models depicting individuals with various racial or gender traits. However, will this strategy result in nonfactual demographic distribution, especially when generating real historical figures? In this work, we propose DemOgraphic FActualIty Representation (DoFaiR), a benchmark to systematically quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. DoFaiR consists of 756 meticulously fact-checked test instances to reveal the factuality tax of various diversity prompts through an automated evidence-supported evaluation pipeline. Experiments on DoFaiR unveil that diversity-oriented instructions increase the number of different gender and racial groups in DALLE-3’s generations at the cost of historically inaccurate demographic distributions. To resolve this issue, we propose Fact-Augmented Intervention (FAI), which instructs a Large Language Model (LLM) to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history, and incorporate it into the generation context of T2I models. By orienting model generations using the reflected historical truths, FAI significantly improves the demographic factuality under diversity interventions while preserving diversity.

[AI-156] Axiomatization of Gradient Smoothing in Neural Networks

链接: https://arxiv.org/abs/2407.00371
作者: Linjiang Zhou,Xiaochuan Shi,Chao Ma,Zepeng Wang
关键词: neural networks explanation, neural networks, play a pivotal, pivotal role, networks explanation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Gradients play a pivotal role in neural networks explanation. The inherent high dimensionality and structural complexity of neural networks result in the original gradients containing a significant amount of noise. While several approaches were proposed to reduce noise with smoothing, there is little discussion of the rationale behind smoothing gradients in neural networks. In this work, we proposed a gradient smooth theoretical framework for neural networks based on the function mollification and Monte Carlo integration. The framework intrinsically axiomatized gradient smoothing and reveals the rationale of existing methods. Furthermore, we provided an approach to design new smooth methods derived from the framework. By experimental measurement of several newly designed smooth methods, we demonstrated the research potential of our framework.

[AI-157] JSCDS: A Core Data Selection Method with Jason-Shannon Divergence for Caries RGB Images-Efficient Learning

链接: https://arxiv.org/abs/2407.00362
作者: Peiliang Zhang,Yujia Tong,Chenghu Du,Chao Che,Yongjun Zhu
关键词: preventing oral diseases, Core data selection, Deep learning-based RGB, data selection, data selection methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted in KDD 2024 Workshop AIDSH

点击查看摘要

Abstract:Deep learning-based RGB caries detection improves the efficiency of caries identification and is crucial for preventing oral diseases. The performance of deep learning models depends on high-quality data and requires substantial training resources, making efficient deployment challenging. Core data selection, by eliminating low-quality and confusing data, aims to enhance training efficiency without significantly compromising model performance. However, distance-based data selection methods struggle to distinguish dependencies among high-dimensional caries data. To address this issue, we propose a Core Data Selection Method with Jensen-Shannon Divergence (JSCDS) for efficient caries image learning and caries classification. We describe the core data selection criterion as the distribution of samples in different classes. JSCDS calculates the cluster centers by sample embedding representation in the caries classification network and utilizes Jensen-Shannon Divergence to compute the mutual information between data samples and cluster centers, capturing nonlinear dependencies among high-dimensional data. The average mutual information is calculated to fit the above distribution, serving as the criterion for constructing the core set for model training. Extensive experiments on RGB caries datasets show that JSCDS outperforms other data selection methods in prediction performance and time consumption. Notably, JSCDS exceeds the performance of the full dataset model with only 50% of the core data, with its performance advantage becoming more pronounced in the 70% of core data.

[AI-158] From RAG to RICHES: Retrieval Interlaced with Sequence Generation

链接: https://arxiv.org/abs/2407.00361
作者: Palak Jain,Livio Baldini Soares,Tom Kwiatkowski
关键词: present RICHES, RICHES, conventional RAG systems, sequence generation tasks, sequence generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 18 pages, 3 figures, Preprint

点击查看摘要

Abstract:We present RICHES, a novel approach that interleaves retrieval with sequence generation tasks. RICHES offers an alternative to conventional RAG systems by eliminating the need for separate retriever and generator. It retrieves documents by directly decoding their contents, constrained on the corpus. Unifying retrieval with generation allows us to adapt to diverse new tasks via prompting alone. RICHES can work with any Instruction-tuned model, without additional training. It provides attributed evidence, supports multi-hop retrievals and interleaves thoughts to plan on what to retrieve next, all within a single decoding pass of the LLM. We demonstrate the strong performance of RICHES across ODQA tasks including attributed and multi-hop QA.

[AI-159] PhyTracker: An Online Tracker for Phytoplankton

链接: https://arxiv.org/abs/2407.00352
作者: Yang Yu,Qingxuan Lv,Yuezun Li,Zhiqiang Wei,Junyu Dong
关键词: understand marine ecological, marine ecological processes, requires efficient monitoring, aquatic ecosystems, requires efficient
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13pages,eleven figures

点击查看摘要

Abstract:Phytoplankton, a crucial component of aquatic ecosystems, requires efficient monitoring to understand marine ecological processes and environmental conditions. Traditional phytoplankton monitoring methods, relying on non-in situ observations, are time-consuming and resource-intensive, limiting timely analysis. To address these limitations, we introduce PhyTracker, an intelligent in situ tracking framework designed for automatic tracking of phytoplankton. PhyTracker overcomes significant challenges unique to phytoplankton monitoring, such as constrained mobility within water flow, inconspicuous appearance, and the presence of impurities. Our method incorporates three innovative modules: a Texture-enhanced Feature Extraction (TFE) module, an Attention-enhanced Temporal Association (ATA) module, and a Flow-agnostic Movement Refinement (FMR) module. These modules enhance feature capture, differentiate between phytoplankton and impurities, and refine movement characteristics, respectively. Extensive experiments on the PMOT dataset validate the superiority of PhyTracker in phytoplankton tracking, and additional tests on the MOT dataset demonstrate its general applicability, outperforming conventional tracking methods. This work highlights key differences between phytoplankton and traditional objects, offering an effective solution for phytoplankton monitoring.

[AI-160] Korean Aspect-Based Sentiment Analysis via Implicit-Feature Alignment with Corpus Filtering

链接: https://arxiv.org/abs/2407.00342
作者: Kibeom Nam
关键词: Aspect-Based Sentiment Analysis, Sentiment Analysis, Korean restaurant reviews, Investigations into Aspect-Based, Aspect-Based Sentiment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 13 pages, EMNLP 2024 (submitted), DMLR@ICML 2024

点击查看摘要

Abstract:Investigations into Aspect-Based Sentiment Analysis (ABSA) for Korean restaurant reviews are notably lacking in the existing literature. Our research proposes an intuitive and effective framework for ABSA in low-resource languages such as Korean. It optimizes prediction labels by integrating translated benchmark and unlabeled Korean data. Using a model fine-tuned on translated data, we pseudo-labeled the actual Korean NLI set. Subsequently, we applied LaBSE and MSP-based filtering to this pseudo-NLI set as implicit feature, enhancing Aspect Category Detection and Polarity determination through additional training. Incorporating dual filtering, this model bridged dataset gaps, achieving positive results in Korean ABSA with minimal resources. Through additional data injection pipelines, our approach aims to utilize high-resource data and construct effective models within communities, whether corporate or individual, in low-resource language countries. Compared to English ABSA, our framework showed an approximately 3% difference in F1 scores and accuracy. We release the dataset and our code for Korean ABSA, at this link.

[AI-161] ola: Towards End-to-End Optimization of LLM-based Applications

链接: https://arxiv.org/abs/2407.00326
作者: Xin Tan,Yimin Jiang,Yitao Yang,Hong Xu
关键词: Large language model, Large language, based applications consist, language model, non-LLM components
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query’s workflow as a primitive-level dataflow graph. This explicitly exposes a much larger design space, enables optimizations in parallelization and pipelining across primitives of different modules, and enhances scheduling to improve application-level performance. We build Teola, a novel orchestration framework for LLM-based applications that implements this scheme. Comprehensive experiments show that Teola can achieve up to 2.09x speedup over existing systems across various popular LLM applications.

[AI-162] LiteSearch: Efficacious Tree Search for LLM

链接: https://arxiv.org/abs/2407.00320
作者: Ante Wang,Linfeng Song,Ye Tian,Baolin Peng,Dian Yu,Haitao Mi,Jinsong Su,Dong Yu
关键词: Monte Carlo Tree, Recent research suggests, dramatically boost LLM, mathematical reasoning tasks, Monte Carlo
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research suggests that tree search algorithms (e.g. Monte Carlo Tree Search) can dramatically boost LLM performance on complex mathematical reasoning tasks. However, they often require more than 10 times the computational resources of greedy decoding due to wasteful search strategies, making them difficult to be deployed in practical applications. This study introduces a novel guided tree search algorithm with dynamic node selection and node-level exploration budget (maximum number of children) calculation to tackle this issue. By considering the search progress towards the final answer (history) and the guidance from a value network (future) trained without any step-wise annotations, our algorithm iteratively selects the most promising tree node before expanding it within the boundaries of the allocated computational budget. Experiments conducted on the GSM8K and TabMWP datasets demonstrate that our approach not only offers competitive performance but also enjoys significantly lower computational costs compared to baseline methods.

[AI-163] UDC: A Unified Neural Divide-and-Conquer Framework for Large-Scale Combinatorial Optimization Problems

链接: https://arxiv.org/abs/2407.00312
作者: Zhi Zheng,Changliang Zhou,Tong Xialiang,Mingxuan Yuan,Zhenkun Wang
关键词: Single-stage neural combinatorial, small-scale combinatorial optimization, needing expert knowledge, achieved near-optimal results, combinatorial optimization solvers
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Single-stage neural combinatorial optimization solvers have achieved near-optimal results on various small-scale combinatorial optimization (CO) problems without needing expert knowledge. However, these solvers exhibit significant performance degradation when applied to large-scale CO problems. Recently, two-stage neural methods with divide-and-conquer strategies have shown superiorities in addressing large-scale CO problems. Nevertheless, the efficiency of these methods highly relies on problem-specific heuristics in either the divide or the conquer procedure, which limits their applicability to general CO problems. Moreover, these methods employ separate training schemes and ignore the interdependencies between the dividing and conquering strategies, which often leads to sub-optimal solutions. To tackle these drawbacks, this article develops a unified neural divide-and-conquer framework (i.e., UDC) for solving general large-scale CO problems. UDC offers a Divide-Conquer-Reunion (DCR) training method to eliminate the negative impact of a sub-optimal dividing policy. Employing a high-efficiency Graph Neural Network (GNN) for global dividing and a fixed-length sub-path solver for conquering sub-problems, the proposed UDC framework demonstrates extensive applicability, achieving superior performance in 10 representative large-scale CO problems.

[AI-164] PerAct2: A Perceiver Actor Framework for Bimanual Manipulation Tasks

链接: https://arxiv.org/abs/2407.00278
作者: Markus Grotz,Mohit Shridhar,Tamim Asfour,Dieter Fox
关键词: temporal coordination required, challenging due, due to precise, precise spatial, spatial and temporal
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by extending RLBench to bimanual manipulation. We open-source our code and benchmark comprising 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To kickstart the benchmark, we extended several state-of-the art methods to bimanual manipulation and also present a language-conditioned behavioral cloning agent – PerAct2, which enables the learning and execution of bimanual 6-DoF manipulation tasks. Our novel network architecture efficiently integrates language processing with action prediction, allowing robots to understand and perform complex bimanual tasks in response to user-specified goals. Project website with code is available at: this http URL

[AI-165] External Model Motivated Agents: Reinforcement Learning for Enhanced Environment Sampling

链接: https://arxiv.org/abs/2407.00264
作者: Rishav Bhagat,Jonathan Balloch,Zhiyu Lin,Julia Kim,Mark Riedl
关键词: Unlike reinforcement learning, humans remain capable, remain capable multitaskers, Unlike reinforcement, humans remain
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unlike reinforcement learning (RL) agents, humans remain capable multitaskers in changing environments. In spite of only experiencing the world through their own observations and interactions, people know how to balance focusing on tasks with learning about how changes may affect their understanding of the world. This is possible by choosing to solve tasks in ways that are interesting and generally informative beyond just the current task. Motivated by this, we propose an agent influence framework for RL agents to improve the adaptation efficiency of external models in changing environments without any changes to the agent’s rewards. Our formulation is composed of two self-contained modules: interest fields and behavior shaping via interest fields. We implement an uncertainty-based interest field algorithm as well as a skill-sampling-based behavior-shaping algorithm to use in testing this framework. Our results show that our method outperforms the baselines in terms of external model adaptation on metrics that measure both efficiency and performance.

[AI-166] From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models

链接: https://arxiv.org/abs/2407.00263
作者: Mehar Bhatia,Sahithya Ravi,Aditya Chinchure,Eunjeong Hwang,Vered Shwartz
关键词: non-western cultures due, performance remains suboptimal, training datasets, recent advancements, remains suboptimal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under peer review

点击查看摘要

Abstract:Despite recent advancements in vision-language models, their performance remains suboptimal on images from non-western cultures due to underrepresentation in training datasets. Various benchmarks have been proposed to test models’ cultural inclusivity, but they have limited coverage of cultures and do not adequately assess cultural diversity across universal as well as culture-specific local concepts. To address these limitations, we introduce the GlobalRG benchmark, comprising two challenging tasks: retrieval across universals and cultural visual grounding. The former task entails retrieving culturally diverse images for universal concepts from 50 countries, while the latter aims at grounding culture-specific concepts within images from 15 countries. Our evaluation across a wide range of models reveals that the performance varies significantly across cultures – underscoring the necessity for enhancing multicultural understanding in vision-language models.

[AI-167] One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts

链接: https://arxiv.org/abs/2407.00256
作者: Ruochen Wang,Sohyun An,Minhao Cheng,Tianyi Zhou,Sung Ju Hwang,Cho-Jui Hsieh
关键词: Large Language Models, Large Language, Language Models, exhibit strong generalization, strong generalization capabilities
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2024. code available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong generalization capabilities to novel tasks when prompted with language instructions and in-context demos. Since this ability sensitively depends on the quality of prompts, various methods have been explored to automate the instruction design. While these methods demonstrated promising results, they also restricted the searched prompt to one instruction. Such simplification significantly limits their capacity, as a single demo-free instruction might not be able to cover the entire complex problem space of the targeted task. To alleviate this issue, we adopt the Mixture-of-Expert paradigm and divide the problem space into a set of sub-regions; Each sub-region is governed by a specialized expert, equipped with both an instruction and a set of demos. A two-phase process is developed to construct the specialized expert for each region: (1) demo assignment: Inspired by the theoretical connection between in-context learning and kernel regression, we group demos into experts based on their semantic similarity; (2) instruction assignment: A region-based joint search of an instruction per expert complements the demos assigned to it, yielding a synergistic effect. The resulting method, codenamed Mixture-of-Prompts (MoP), achieves an average win rate of 81% against prior arts across several major benchmarks.

[AI-168] SemUV: Deep Learning based semantic manipulation over UV texture map of virtual human heads

链接: https://arxiv.org/abs/2407.00229
作者: Anirban Mukherjee,Venkat Suprabath Bitra,Vignesh Bondugula,Tarun Reddy Tallapureddy,Dinesh Babu Jayagopi
关键词: manipulating virtual human, virtual human heads, Designing and manipulating, interaction and VFX, human heads
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: CVIP 2024 Preprint

点击查看摘要

Abstract:Designing and manipulating virtual human heads is essential across various applications, including AR, VR, gaming, human-computer interaction and VFX. Traditional graphic-based approaches require manual effort and resources to achieve accurate representation of human heads. While modern deep learning techniques can generate and edit highly photorealistic images of faces, their focus remains predominantly on 2D facial images. This limitation makes them less suitable for 3D applications. Recognizing the vital role of editing within the UV texture space as a key component in the 3D graphics pipeline, our work focuses on this aspect to benefit graphic designers by providing enhanced control and precision in appearance manipulation. Research on existing methods within the UV texture space is limited, complex, and poses challenges. In this paper, we introduce SemUV: a simple and effective approach using the FFHQ-UV dataset for semantic manipulation directly within the UV texture space. We train a StyleGAN model on the publicly available FFHQ-UV dataset, and subsequently train a boundary for interpolation and semantic feature manipulation. Through experiments comparing our method with 2D manipulation technique, we demonstrate its superior ability to preserve identity while effectively modifying semantic features such as age, gender, and facial hair. Our approach is simple, agnostic to other 3D components such as structure, lighting, and rendering, and also enables seamless integration into standard 3D graphics pipelines without demanding extensive domain expertise, time, or resources.

[AI-169] Evaluating Human Alignment and Model Faithfulness of LLM Rationale

链接: https://arxiv.org/abs/2407.00219
作者: Mohsen Fayyaz,Fan Yin,Jiao Sun,Nanyun Peng
关键词: large language models, explain their generations, large language, input texts, texts that reflect
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study how well large language models (LLMs) explain their generations with rationales – a set of tokens extracted from the input texts that reflect the decision process of LLMs. We examine LLM rationales extracted with two methods: 1) attribution-based methods that use attention or gradients to locate important tokens, and 2) prompting-based methods that guide LLMs to extract rationales using prompts. Through extensive experiments, we show that prompting-based rationales align better with human-annotated rationales than attribution-based rationales, and demonstrate reasonable alignment with humans even when model performance is poor. We additionally find that the faithfulness limitations of prompting-based methods, which are identified in previous work, may be linked to their collapsed predictions. By fine-tuning these models on the corresponding datasets, both prompting and attribution methods demonstrate improved faithfulness. Our study sheds light on more rigorous and fair evaluations of LLM rationales, especially for prompting-based ones.

[AI-170] radeoffs When Considering Deep Reinforcement Learning for Contingency Management in Advanced Air Mobility

链接: https://arxiv.org/abs/2407.00197
作者: Luis E. Alvarez,Marc W. Brittain,Steven D. Young
关键词: Advanced Air Mobility, Air Mobility, Advanced Air, rapid evolution globally, Air transportation
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Air transportation is undergoing a rapid evolution globally with the introduction of Advanced Air Mobility (AAM) and with it comes novel challenges and opportunities for transforming aviation. As AAM operations introduce increasing heterogeneity in vehicle capabilities and density, increased levels of automation are likely necessary to achieve operational safety and efficiency goals. This paper focuses on one example where increased automation has been suggested. Autonomous operations will need contingency management systems that can monitor evolving risk across a span of interrelated (or interdependent) hazards and, if necessary, execute appropriate control interventions via supervised or automated decision making. Accommodating this complex environment may require automated functions (autonomy) that apply artificial intelligence (AI) techniques that can adapt and respond to a quickly changing environment. This paper explores the use of Deep Reinforcement Learning (DRL) which has shown promising performance in complex and high-dimensional environments where the objective can be constructed as a sequential decision-making problem. An extension of a prior formulation of the contingency management problem as a Markov Decision Process (MDP) is presented and uses a DRL framework to train agents that mitigate hazards present in the simulation environment. A comparison of these learning-based agents and classical techniques is presented in terms of their performance, verification difficulties, and development process.

[AI-171] A Novel Labeled Human Voice Signal Dataset for Misbehavior Detection

链接: https://arxiv.org/abs/2407.00188
作者: Ali Raza(Department of Software Engineering The University Of Lahore, Lahore, Pakistan),Faizan Younas(Department of Computer Science amp; Information Technology, The University Of Lahore, Lahore, Pakistan)
关键词: behaviours involves analyzing, signal classification based, Voice, involves analyzing, analyzing various aspects
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Voice signal classification based on human behaviours involves analyzing various aspects of speech patterns and delivery styles. In this study, a real-time dataset collection is performed where participants are instructed to speak twelve psychology questions in two distinct manners: first, in a harsh voice, which is categorized as “misbehaved”; and second, in a polite manner, categorized as “normal”. These classifications are crucial in understanding how different vocal behaviours affect the interpretation and classification of voice signals. This research highlights the significance of voice tone and delivery in automated machine-learning systems for voice analysis and recognition. This research contributes to the broader field of voice signal analysis by elucidating the impact of human behaviour on the perception and categorization of voice signals, thereby enhancing the development of more accurate and context-aware voice recognition technologies.

[AI-172] Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach

链接: https://arxiv.org/abs/2407.00167
作者: Sai Krishna Revanth Vuruma,Dezhi Wu,Saborny Sen Gupta,Lucas Aust,Valerie Lookingbill,Wyatt Bellamy,Yang Ren,Erin Kasson,Li-Shiun Chen,Patricia Cavazos-Rehg,Dian Hu,Ming Huang
关键词: use-associated lung injury, United States, EVALI outbreak, vaping use-associated lung, comprehend vaping behaviors
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
*备注: Accepted for the AI Applications in Public Health and Social Services workshop at the 22nd International Conference on Artificial Intelligence in Medicine (AIME 2024)

点击查看摘要

Abstract:In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use-associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due to the ubiquity of social media platforms, over 4.7 billion users worldwide use them for connectivity, communications, news, and entertainment with a significant portion of the discourse related to health, thereby establishing social media data as an invaluable organic data resource for public health research. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users’ quit-vaping intentions. Leveraging OpenAI’s latest large language model GPT-4 for sentence-level quit vaping intention detection, this study compares the outcomes of this model against layman and clinical expert annotations. Using different prompting strategies such as zero-shot, one-shot, few-shot and chain-of-thought prompting, we developed 8 prompts with varying levels of detail to explain the task to GPT-4 and also evaluated the performance of the strategies against each other. These preliminary findings emphasize the potential of GPT-4 in social media data analysis, especially in identifying users’ subtle intentions that may elude human detection.

[AI-173] Predicting Elevated Risk of Hospitalization Following Emergency Department Discharges

链接: https://arxiv.org/abs/2407.00147
作者: Dat Hong,Philip M. Polgreen,Alberto Maria Segre
关键词: proper diagnosis, follow closely, symptoms of missed, missed opportunities, opportunities to form
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hospitalizations that follow closely on the heels of one or more emergency department visits are often symptoms of missed opportunities to form a proper diagnosis. These diagnostic errors imply a failure to recognize the need for hospitalization and deliver appropriate care, and thus also bear important connotations for patient safety. In this paper, we show how data mining techniques can be applied to a large existing hospitalization data set to learn useful models that predict these upcoming hospitalizations with high accuracy. Specifically, we use an ensemble of logistics regression, naïve Bayes and association rule classifiers to successfully predict hospitalization within 3, 7 and 14 days of an emergency department discharge. Aside from high accuracy, one of the advantages of the techniques proposed here is that the resulting classifier is easily inspected and interpreted by humans so that the learned rules can be readily operationalized. These rules can then be easily distributed and applied directly by physicians in emergency department settings to predict the risk of early admission prior to discharging their emergency department patients.

[AI-174] he Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic

链接: https://arxiv.org/abs/2407.00146
作者: Shahad Al-Khalifa,Hend Al-Khalifa
关键词: models pre-trained exclusively, language models pre-trained, Arabic data, Arabic, growing importance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the growing importance of Arabic as a global language, there is a notable lack of language models pre-trained exclusively on Arabic data. This shortage has led to limited benchmarks available for assessing language model performance in Arabic. To address this gap, we introduce two novel benchmarks designed to evaluate models’ mathematical reasoning and language understanding abilities in Arabic. These benchmarks are derived from a General Aptitude Test (GAT) called Qiyas exam, a standardized test widely used for university admissions in Saudi Arabia. For validation purposes, we assess the performance of ChatGPT-3.5-trubo and ChatGPT-4 on our benchmarks. Our findings reveal that these benchmarks pose a significant challenge, with ChatGPT-4 achieving an overall average accuracy of 64%, while ChatGPT-3.5-trubo achieved an overall accuracy of 49% across the various question types in the Qiyas benchmark. We believe the release of these benchmarks will pave the way for enhancing the mathematical reasoning and language understanding capabilities of future models tailored for the low-resource Arabic language.

[AI-175] Graph Neural Networks for Gut Microbiome Metaomic data: A preliminary work

链接: https://arxiv.org/abs/2407.00142
作者: Christopher Irwin,Flavio Mignone,Stefania Montani,Luigi Portinale
关键词: complex metaomic data, metaomic data due, crucial for human, human health, presents challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The gut microbiome, crucial for human health, presents challenges in analyzing its complex metaomic data due to high dimensionality and sparsity. Traditional methods struggle to capture its intricate relationships. We investigate graph neural networks (GNNs) for this task, aiming to derive meaningful representations of individual gut microbiomes. Unlike methods relying solely on taxa abundance, we directly leverage phylogenetic relationships, in order to obtain a generalized encoder for taxa networks. The representation learnt from the encoder are then used to train a model for phenotype prediction such as Inflammatory Bowel Disease (IBD).

[AI-176] owards Secure and Efficient Data Scheduling for Vehicular Social Networks

链接: https://arxiv.org/abs/2407.00141
作者: Youhua Xia,Tiehua Zhang,Jiong Jin,Ying He,Fei Yu
关键词: significant challenge due, Efficient data transmission, vehicular environments poses, vehicular social networks, Efficient data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient data transmission scheduling within vehicular environments poses a significant challenge due to the high mobility of such networks. Contemporary research predominantly centers on crafting cooperative scheduling algorithms tailored for vehicular networks. Notwithstanding, the intricacies of orchestrating scheduling in vehicular social networks both effectively and efficiently remain formidable. This paper introduces an innovative learning-based algorithm for scheduling data transmission that prioritizes efficiency and security within vehicular social networks. The algorithm first uses a specifically constructed neural network to enhance data processing capabilities. After this, it incorporates a Q-learning paradigm during the data transmission phase to optimize the information exchange, the privacy of which is safeguarded by differential privacy through the communication process. Comparative experiments demonstrate the superior performance of the proposed Q-learning enhanced scheduling algorithm relative to existing state-of-the-art scheduling algorithms in the context of vehicular social networks.

[AI-177] Analyzing Quality Bias and Performance in Text-to-Image Generative Models

链接: https://arxiv.org/abs/2407.00138
作者: Nila Masrourisaadat,Nazanin Sedaghatkish,Fatemeh Sarshartehrani,Edward A. Fox
关键词: Advances in generative, demonstrating the ability, text prompts, led to significant, significant interest
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:Advances in generative models have led to significant interest in image synthesis, demonstrating the ability to generate high-quality images for a diverse range of text prompts. Despite this progress, most studies ignore the presence of bias. In this paper, we examine several text-to-image models not only by qualitatively assessing their performance in generating accurate images of human faces, groups, and specified numbers of objects but also by presenting a social bias analysis. As expected, models with larger capacity generate higher-quality images. However, we also document the inherent gender or social biases these models possess, offering a more complete understanding of their impact and limitations.

[AI-178] A Simple Attention-Based Mechanism for Bimodal Emotion Classification

链接: https://arxiv.org/abs/2407.00134
作者: Mazen Elabd,Sardar Jaf
关键词: Big data, learning important features, Big, important features, learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 15 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Big data contain rich information for machine learning algorithms to utilize when learning important features during classification tasks. Human beings express their emotion using certain words, speech (tone, pitch, speed) or facial expression. Artificial Intelligence approach to emotion classification are largely based on learning from textual information. However, public datasets containing text and speech data provide sufficient resources to train machine learning algorithms for the tack of emotion classification. In this paper, we present novel bimodal deep learning-based architectures enhanced with attention mechanism trained and tested on text and speech data for emotion classification. We report details of different deep learning based architectures and show the performance of each architecture including rigorous error analyses. Our finding suggests that deep learning based architectures trained on different types of data (text and speech) outperform architectures trained only on text or speech. Our proposed attention-based bimodal architecture outperforms several state-of-the-art systems in emotion classification.

[AI-179] ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents

链接: https://arxiv.org/abs/2407.00132
作者: Haiyang Shen,Yue Li,Desong Meng,Dongqi Cai,Sheng Qi,Li Zhang,Mengwei Xu,Yun Ma
关键词: large language models, application programming interfaces, integrating large language, Recent advancements, gained significant interest
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. These API-based agents, leveraging the strong autonomy and planning capabilities of LLMs, can efficiently solve problems requiring multi-step actions. However, their ability to handle multi-dimensional difficulty levels, diverse task types, and real-world demands through APIs remains unknown. In this paper, we introduce \textscShortcutsBench, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving tasks with varying levels of difficulty, diverse task types, and real-world demands. \textscShortcutsBench includes a wealth of real APIs from Apple Inc.'s operating systems, refined user queries from shortcuts, human-annotated high-quality action sequences from shortcut developers, and accurate parameter filling values about primitive parameter types, enum parameter types, outputs from previous actions, and parameters that need to request necessary information from the system or user. Our extensive evaluation of agents built with 5 leading open-source (size = 57B) and 4 closed-source LLMs (e.g. Gemini-1.5-Pro and GPT-3.5) reveals significant limitations in handling complex queries related to API selection, parameter filling, and requesting necessary information from systems and users. These findings highlight the challenges that API-based agents face in effectively fulfilling real and complex user queries. All datasets, code, and experimental results will be available at \urlthis https URL.

[AI-180] RepAct: The Re-parameterizable Adaptive Activation Function

链接: https://arxiv.org/abs/2407.00131
作者: Xian Wu,Qingchuan Tao,Shuang Wang
关键词: efficient artificial intelligence, Addressing the imperative, study presents RepAct, re-parameterizable adaptive activation, activation function tailored
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Addressing the imperative need for efficient artificial intelligence in IoT and edge computing, this study presents RepAct, a re-parameterizable adaptive activation function tailored for optimizing lightweight neural networks within the computational limitations of edge devices. By employing a multi-branch structure with learnable adaptive weights, RepAct enriches feature processing and enhances cross-layer interpretability. When evaluated on tasks such as image classification and object detection, RepAct notably surpassed conventional activation functions in lightweight networks, delivering up to a 7.92% accuracy boost on MobileNetV3-Small for the ImageNet100 dataset, while maintaining computational complexity on par with HardSwish. This innovative approach not only maximizes model parameter efficiency but also significantly improves the performance and understanding capabilities of lightweight neural networks, demonstrating its potential for real-time edge computing applications.

[AI-181] When Search Engine Services meet Large Language Models: Visions and Challenges

链接: https://arxiv.org/abs/2407.00128
作者: Haoyi Xiong,Jiang Bian,Yuchen Li,Xuhong Li,Mengnan Du,Shuaiqiang Wang,Dawei Yin,Sumi Helal
关键词: Combining Large Language, Large Language Models, Combining Large, Large Language, Language Models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Combining Large Language Models (LLMs) with search engine services marks a significant shift in the field of services computing, opening up new possibilities to enhance how we search for and retrieve information, understand content, and interact with internet services. This paper conducts an in-depth examination of how integrating LLMs with search engines can mutually benefit both technologies. We focus on two main areas: using search engines to improve LLMs (Search4LLM) and enhancing search engine functions using LLMs (LLM4Search). For Search4LLM, we investigate how search engines can provide diverse high-quality datasets for pre-training of LLMs, how they can use the most relevant documents to help LLMs learn to answer queries more accurately, how training LLMs with Learning-To-Rank (LTR) tasks can enhance their ability to respond with greater precision, and how incorporating recent search results can make LLM-generated content more accurate and current. In terms of LLM4Search, we examine how LLMs can be used to summarize content for better indexing by search engines, improve query outcomes through optimization, enhance the ranking of search results by analyzing document relevance, and help in annotating data for learning-to-rank tasks in various learning contexts. However, this promising integration comes with its challenges, which include addressing potential biases and ethical issues in training models, managing the computational and other costs of incorporating LLMs into search services, and continuously updating LLM training with the ever-changing web content. We discuss these challenges and chart out required research directions to address them. We also discuss broader implications for service computing, such as scalability, privacy concerns, and the need to adapt search engine architectures for these advanced models.

[AI-182] A Survey on Failure Analysis and Fault Injection in AI Systems

链接: https://arxiv.org/abs/2407.00125
作者: Guangba Yu,Gou Tan,Haojia Huang,Zhenyu Zhang,Pengfei Chen,Roberto Natella,Zibin Zheng
关键词: Artificial Intelligence Generated, Intelligence Generated Content, Large Language Models, Artificial Intelligence, Intelligence Generated
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. Despite the importance of these techniques, there lacks a comprehensive review of FA and FI methodologies in AI systems. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. We systematically analyze 160 papers and repositories to answer three research questions including (1) what are the prevalent failures in AI systems, (2) what types of faults can current FI tools simulate, (3) what gaps exist between the simulated faults and real-world failures. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures. Moreover, this survey contributes to the field by providing a framework for fault diagnosis, evaluating the state-of-the-art in FI, and identifying areas for improvement in FI techniques to enhance the resilience of AI systems.

[AI-183] Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks

链接: https://arxiv.org/abs/2407.00121
作者: Ibrahim Abdelaziz,Kinjal Basu,Mayank Agarwal,Sadhana Kumaravel,Matthew Stallone,Rameswar Panda,Yara Rizk,GP Bhargav,Maxwell Crouse,Chulaka Gunasekara,Shajith Ikbal,Sachin Joshi,Hima Karanam,Vineet Kumar,Asim Munawar,Sumit Neelam,Dinesh Raghu,Udit Sharma,Adriana Meza Soria,Dheeraj Sreedhar,Praveen Venkateswaran,Merve Unuvar,David Cox,Salim Roukos,Luis Lastras,Pavan Kapanipathi
关键词: Large language models, recently shown tremendous, shown tremendous promise, Large language, function calling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We present a comprehensive evaluation on multiple out-of-domain datasets comparing GRANITE-20B-FUNCTIONCALLING to more than 15 other best proprietary and open models. GRANITE-20B-FUNCTIONCALLING provides the best performance among all open models on the Berkeley Function Calling Leaderboard and fourth overall. As a result of the diverse tasks and datasets used for training our model, we show that GRANITE-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.

[AI-184] Automated Web-Based Malaria Detection System with Machine Learning and Deep Learning Techniques

链接: https://arxiv.org/abs/2407.00120
作者: Abraham G Taye,Sador Yemane,Eshetu Negash,Yared Minwuyelet,Moges Abebe,Melkamu Hunegnaw Asmare
关键词: global health burden, causing widespread suffering, significant global health, Malaria parasites pose, health burden
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Malaria parasites pose a significant global health burden, causing widespread suffering and mortality. Detecting malaria infection accurately is crucial for effective treatment and control. However, existing automated detection techniques have shown limitations in terms of accuracy and generalizability. Many studies have focused on specific features without exploring more comprehensive approaches. In our case, we formulate a deep learning technique for malaria-infected cell classification using traditional CNNs and transfer learning models notably VGG19, InceptionV3, and Xception. The models were trained using NIH datasets and tested using different performance metrics such as accuracy, precision, recall, and F1-score. The test results showed that deep CNNs achieved the highest accuracy – 97%, followed by Xception with an accuracy of 95%. A machine learning model SVM achieved an accuracy of 83%, while an Inception-V3 achieved an accuracy of 94%. Furthermore, the system can be accessed through a web interface, where users can upload blood smear images for malaria detection.

[AI-185] Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations

链接: https://arxiv.org/abs/2407.00119
作者: Yuntao Shou,Wei Ai,Jiayi Du,Tao Meng,Haiyan Liu
关键词: genuine emotional state, graph neural networks, aims to analyze, multi-modal emotion recognition, analyze the genuine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 3 tables

点击查看摘要

Abstract:The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52% and 35%, respectively.

[AI-186] From Efficient Multimodal Models to World Models: A Survey

链接: https://arxiv.org/abs/2407.00118
作者: Xinji Mai,Zeng Tao,Junxiong Lin,Haoran Wang,Yang Chang,Yanlan Kang,Yan Wang,Wenqiang Zhang
关键词: combining powerful large, powerful large language, large language models, perform complex tasks, Multimodal Large Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal Large Models (MLMs) are becoming a significant research focus, combining powerful large language models with multimodal learning to perform complex tasks across different data modalities. This review explores the latest developments and challenges in MLMs, emphasizing their potential in achieving artificial general intelligence and as a pathway to world models. We provide an overview of key techniques such as Multimodal Chain of Thought (M-COT), Multimodal Instruction Tuning (M-IT), and Multimodal In-Context Learning (M-ICL). Additionally, we discuss both the fundamental and specific technologies of multimodal models, highlighting their applications, input/output modalities, and design characteristics. Despite significant advancements, the development of a unified multimodal model remains elusive. We discuss the integration of 3D generation and embodied intelligence to enhance world simulation capabilities and propose incorporating external rule systems for improved reasoning and decision-making. Finally, we outline future research directions to address these challenges and advance the field.

[AI-187] Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges

链接: https://arxiv.org/abs/2407.00116
作者: Mahmoud Ibrahim,Yasmina Al Khalil,Sina Amirrajab,Chang Suna,Marcel Breeuwer,Josien Pluim,Bart Elen,Gokhan Ertaylan,Michel Dumontiera
关键词: comprehensive systematic review, including imaging, medical data types, paper presents, presents a comprehensive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive systematic review of generative models (GANs, VAEs, DMs, and LLMs) used to synthesize various medical data types, including imaging (dermoscopic, mammographic, ultrasound, CT, MRI, and X-ray), text, time-series, and tabular data (EHR). Unlike previous narrowly focused reviews, our study encompasses a broad array of medical data modalities and explores various generative models. Our search strategy queries databases such as Scopus, PubMed, and ArXiv, focusing on recent works from January 2021 to November 2023, excluding reviews and perspectives. This period emphasizes recent advancements beyond GANs, which have been extensively covered previously. The survey reveals insights from three key aspects: (1) Synthesis applications and purpose of synthesis, (2) generation techniques, and (3) evaluation methods. It highlights clinically valid synthesis applications, demonstrating the potential of synthetic data to tackle diverse clinical requirements. While conditional models incorporating class labels, segmentation masks and image translations are prevalent, there is a gap in utilizing prior clinical knowledge and patient-specific context, suggesting a need for more personalized synthesis approaches and emphasizing the importance of tailoring generative approaches to the unique characteristics of medical data. Additionally, there is a significant gap in using synthetic data beyond augmentation, such as for validation and evaluation of downstream medical AI models. The survey uncovers that the lack of standardized evaluation methodologies tailored to medical images is a barrier to clinical application, underscoring the need for in-depth evaluation approaches, benchmarking, and comparative studies to promote openness and collaboration. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2407.00116 [cs.LG] (or arXiv:2407.00116v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2407.00116 Focus to learn more arXiv-issued DOI via DataCite

[AI-188] Instance Temperature Knowledge Distillation

链接: https://arxiv.org/abs/2407.00115
作者: Zhengbo Zhang,Yuxi Zhou,Jia Gong,Jun Liu,Zhigang Tu
关键词: teacher network incrementally, Knowledge distillation, knowledge transferred, student network, enhances the performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) enhances the performance of a student network by allowing it to learn the knowledge transferred from a teacher network incrementally. Existing methods dynamically adjust the temperature to enable the student network to adapt to the varying learning difficulties at different learning stages of KD. KD is a continuous process, but when adjusting the temperature, these methods consider only the immediate benefits of the operation in the current learning phase and fail to take into account its future returns. To address this issue, we formulate the adjustment of temperature as a sequential decision-making task and propose a method based on reinforcement learning, termed RLKD. Importantly, we design a novel state representation to enable the agent to make more informed action (i.e. instance temperature adjustment). To handle the problem of delayed rewards in our method due to the KD setting, we explore an instance reward calibration approach. In addition,we devise an efficient exploration strategy that enables the agent to learn valuable instance temperature adjustment policy more efficiently. Our framework can serve as a plug-and-play technique to be inserted into various KD methods easily, and we validate its effectiveness on both image classification and object detection tasks. Our code is at this https URL

[AI-189] OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

链接: https://arxiv.org/abs/2407.00114
作者: Zihao Wang,Shaofei Cai,Zhancun Mu,Haowei Lin,Ceyao Zhang,Xuejie Liu,Qing Li,Anji Liu,Xiaojian Ma,Yitao Liang
关键词: open-world instruction-following agents, instruction-following agents, open-world Minecraft, behavior, tokens
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories \tau = o_0 , a_0 , \dots and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.

[AI-190] Personalized Federated Continual Learning via Multi-granularity Prompt

链接: https://arxiv.org/abs/2407.00113
作者: Hao Yu,Xin Yang,Xin Gao,Yan Kang,Hao Wang,Junbo Zhang,Tianrui Li
关键词: Federated Continual Learning, Personalized Federated Continual, poses greater challenges, Federated Continual, Personalized Federated Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by KDD 2024 Research Track

点击查看摘要

Abstract:Personalized Federated Continual Learning (PFCL) is a new practical scenario that poses greater challenges in sharing and personalizing knowledge. PFCL not only relies on knowledge fusion for server aggregation at the global spatial-temporal perspective but also needs model improvement for each client according to the local requirements. Existing methods, whether in Personalized Federated Learning (PFL) or Federated Continual Learning (FCL), have overlooked the multi-granularity representation of knowledge, which can be utilized to overcome Spatial-Temporal Catastrophic Forgetting (STCF) and adopt generalized knowledge to itself by coarse-to-fine human cognitive mechanisms. Moreover, it allows more effectively to personalized shared knowledge, thus serving its own purpose. To this end, we propose a novel concept called multi-granularity prompt, i.e., coarse-grained global prompt acquired through the common model learning process, and fine-grained local prompt used to personalize the generalized representation. The former focuses on efficiently transferring shared global knowledge without spatial forgetting, and the latter emphasizes specific learning of personalized local knowledge to overcome temporal forgetting. In addition, we design a selective prompt fusion mechanism for aggregating knowledge of global prompts distilled from different clients. By the exclusive fusion of coarse-grained knowledge, we achieve the transmission and refinement of common knowledge among clients, further enhancing the performance of personalization. Extensive experiments demonstrate the effectiveness of the proposed method in addressing STCF as well as improving personalized performance. Our code now is available at this https URL.

[AI-191] Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models

链接: https://arxiv.org/abs/2407.00111
作者: Ben Fauber
关键词: instruction fine-tuned pretrained, fine-tuned pretrained generative, pretrained generative small, generative small language, small language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We describe the accurate prediction of ligand-protein interaction (LPI) affinities, also known as drug-target interactions (DTI), with instruction fine-tuned pretrained generative small language models (SLMs). We achieved accurate predictions for a range of affinity values associated with ligand-protein interactions on out-of-sample data in a zero-shot setting. Only the SMILES string of the ligand and the amino acid sequence of the protein were used as the model inputs. Our results demonstrate a clear improvement over machine learning (ML) and free-energy perturbation (FEP+) based methods in accurately predicting a range of ligand-protein interaction affinities, which can be leveraged to further accelerate drug discovery campaigns against challenging therapeutic targets.

[AI-192] Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

链接: https://arxiv.org/abs/2407.00110
作者: Ali Doosthosseini,Jonathan Decker,Hendrik Nolte,Julian M. Kunkel
关键词: custom fine-tuned LLMs, data remains private, large language models, increasing adoption, adoption of large
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: 27 pages, 5 figures, 2 tables

点击查看摘要

Abstract:The increasing adoption of large language models (LLMs) has created a pressing need for an efficient, secure and private serving infrastructure, which allows researchers to run open-source or custom fine-tuned LLMs and ensures users that their data remains private and is not stored without their consent. While high-performance computing (HPC) systems equipped with state-of-the-art GPUs are well-suited for training LLMs, their batch scheduling paradigm is not designed to support real-time serving of AI applications. Cloud systems, on the other hand, are well suited for web services but commonly lack access to the computational power of clusters, especially expensive and scarce high-end GPUs, which are required for optimal inference speed. We propose an architecture with an implementation consisting of a web service that runs on a cloud VM with secure access to a scalable backend running a multitude of AI models on HPC systems. By offering a web service using our HPC infrastructure to host LLMs, we leverage the trusted environment of local universities and research centers to offer a private and secure alternative to commercial LLM services. Our solution natively integrates with Slurm, enabling seamless deployment on HPC clusters and is able to run side by side with regular Slurm workloads, while utilizing gaps in the schedule created by Slurm. In order to ensure the security of the HPC system, we use the SSH ForceCommand directive to construct a robust circuit breaker, which prevents successful attacks on the web-facing server from affecting the cluster. We have successfully deployed our system as a production service, and made the source code available at this https URL

[AI-193] A Case Study on Contextual Machine Translation in a Professional Scenario of Subtitling

链接: https://arxiv.org/abs/2407.00108
作者: Sebastian Vincent,Charlotte Prescott,Chris Bayliss,Chris Oakley,Carolina Scarton
关键词: enhance translation quality, Incorporating extra-textual context, translation quality, Incorporating extra-textual, machine translation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: Accepted to EAMT 2024

点击查看摘要

Abstract:Incorporating extra-textual context such as film metadata into the machine translation (MT) pipeline can enhance translation quality, as indicated by automatic evaluation in recent work. However, the positive impact of such systems in industry remains unproven. We report on an industrial case study carried out to investigate the benefit of MT in a professional scenario of translating TV subtitles with a focus on how leveraging extra-textual context impacts post-editing. We found that post-editors marked significantly fewer context-related errors when correcting the outputs of MTCue, the context-aware model, as opposed to non-contextual models. We also present the results of a survey of the employed post-editors, which highlights contextual inadequacy as a significant gap consistently observed in MT. Our findings strengthen the motivation for further work within fully contextual MT.

[AI-194] WineGraph: A Graph Representation For Food-Wine Pairing

链接: https://arxiv.org/abs/2407.00107
作者: Zuzanna Gawrysiak,Agata Żywot,Agnieszka Ławrynowicz
关键词: present WineGraph, extended version, heterogeneous graph incorporating, graph incorporating wine, incorporating wine data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present WineGraph, an extended version of FlavorGraph, a heterogeneous graph incorporating wine data into its structure. This integration enables food-wine pairing based on taste and sommelier-defined rules. Leveraging a food dataset comprising 500,000 reviews and a wine reviews dataset with over 130,000 entries, we computed taste descriptors for both food and wine. This information was then utilised to pair food items with wine and augment FlavorGraph with additional data. The results demonstrate the potential of heterogeneous graphs to acquire supplementary information, proving beneficial for wine pairing.

[AI-195] UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

链接: https://arxiv.org/abs/2407.00106
作者: Ilia Shumailov,Jamie Hayes,Eleni Triantafillou,Guillermo Ortiz-Jimenez,Nicolas Papernot,Matthew Jagielski,Itay Yona,Heidi Howard,Eugene Bagdasaryan
关键词: Exact unlearning, allowed a user, user to retract, retract their data, data from machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Exact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate the impractical costs associated with exact unlearning. More recently unlearning is often discussed as an approach for removal of impermissible knowledge i.e. knowledge that the model should not possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the model does not have a certain malicious capability, then it cannot be used for the associated malicious purpose. In this paper we revisit the paradigm in which unlearning is used for in Large Language Models (LLMs) and highlight an underlying inconsistency arising from in-context learning. Unlearning can be an effective control mechanism for the training phase, yet it does not prevent the model from performing an impermissible act during inference. We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications.

[AI-196] Multiple Kronecker RLS fusion-based link propagation for drug-side effect prediction

链接: https://arxiv.org/abs/2407.00105
作者: Yuqing Qian,Ziyu Zheng,Prayag Tiwari,Yijie Ding,Quan Zou
关键词: Drug-side effect prediction, field of pharmacology, Drug-side effect, essential area, area of research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Transactions on Machine Learning Research (TMLR 2024)

点击查看摘要

Abstract:Drug-side effect prediction has become an essential area of research in the field of pharmacology. As the use of medications continues to rise, so does the importance of understanding and mitigating the potential risks associated with them. At present, researchers have turned to data-driven methods to predict drug-side effects. Drug-side effect prediction is a link prediction problem, and the related data can be described from various perspectives. To process these kinds of data, a multi-view method, called Multiple Kronecker RLS fusion-based link propagation (MKronRLSF-LP), is proposed. MKronRLSF-LP extends the Kron-RLS by finding the consensus partitions and multiple graph Laplacian constraints in the multi-view setting. Both of these multi-view settings contribute to a higher quality result. Extensive experiments have been conducted on drug-side effect datasets, and our empirical results provide evidence that our approach is effective and robust.

[AI-197] AI-Driven Skin Cancer Diagnosis: Grad-CAM and Expert Annotations for Enhanced Interpretability

链接: https://arxiv.org/abs/2407.00104
作者: Iván Matas,Carmen Serrano,Francisca Silva,Amalia Serrano,Tomás Toledo-Pastrana,Begoña Acha
关键词: optimizing resource utilization, BCC, provide interpretable support, BCC dermoscopic features, BCC dermoscopic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Image and Video Processing (eess.IV)
*备注: 8 pages, 4 figures, 4 tables, under review

点击查看摘要

Abstract:An AI tool has been developed to provide interpretable support for the diagnosis of BCC via teledermatology, thus speeding up referrals and optimizing resource utilization. The interpretability is provided in two ways: on the one hand, the main BCC dermoscopic patterns are found in the image to justify the BCC/Non BCC classification. Secondly, based on the common visual XAI Grad-CAM, a clinically inspired visual explanation is developed where the relevant features for diagnosis are located. Since there is no established ground truth for BCC dermoscopic features, a standard reference is inferred from the diagnosis of four dermatologists using an Expectation Maximization (EM) based algorithm. The results demonstrate significant improvements in classification accuracy and interpretability, positioning this approach as a valuable tool for early BCC detection and referral to dermatologists. The BCC/non-BCC classification achieved an accuracy rate of 90%. For Clinically-inspired XAI results, the detection of BCC patterns useful to clinicians reaches 99% accuracy. As for the Clinically-inspired Visual XAI results, the mean of the Grad-CAM normalized value within the manually segmented clinical features is 0.57, while outside this region it is 0.16. This indicates that the model struggles to accurately identify the regions of the BCC patterns. These results prove the ability of the AI tool to provide a useful explanation.

[AI-198] Curriculum Learning with Quality-Driven Data Selection

链接: https://arxiv.org/abs/2407.00102
作者: Biao Wu,Fang Meng,Ling Chen
关键词: Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, impressive multimodal capabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The impressive multimodal capabilities demonstrated by OpenAI’s GPT-4 have generated significant interest in the development of Multimodal Large Language Models (MLLMs). Visual instruction tuning of MLLMs with machine-generated instruction-following data has shown to enhance zero-shot capabilities across various tasks. However, there has been limited exploration into controlling the quality of the instruction data.Current methodologies for data selection in MLLMs often rely on single, unreliable scores or use downstream tasks for selection, which is time-consuming and can lead to potential overfitting on the chosen evaluation datasets. To mitigate these limitations, we propose a novel data selection methodology that utilizes image-text correlation and model perplexity to evaluate and select data of varying quality. This approach leverages the distinct distribution of these two attributes, mapping data quality into a two-dimensional space that allows for the selection of data based on their location within this distribution. By utilizing this space, we can analyze the impact of task type settings, used as prompts, on data quality. Additionally, this space can be used to construct multi-stage subsets of varying quality to facilitate curriculum learning. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in five commonly assessed capabilities compared to using the complete dataset. Our codes, data, and models are publicly available at: \urlhttps://anonymous.4open.science/r/EHIT-31B4

[AI-199] Hybrid Approach to Parallel Stochastic Gradient Descent

链接: https://arxiv.org/abs/2407.00101
作者: Aakash Sudhirbhai Vora,Dhrumil Chetankumar Joshi,Aksh Kantibhai Patel
关键词: Stochastic Gradient Descent, Stochastic Gradient, Gradient Descent, large datasets, reduce the training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Stochastic Gradient Descent is used for large datasets to train models to reduce the training time. On top of that data parallelism is widely used as a method to efficiently train neural networks using multiple worker nodes in parallel. Synchronous and asynchronous approach to data parallelism is used by most systems to train the model in parallel. However, both of them have their drawbacks. We propose a third approach to data parallelism which is a hybrid between synchronous and asynchronous approaches, using both approaches to train the neural network. When the threshold function is selected appropriately to gradually shift all parameter aggregation from asynchronous to synchronous, we show that in a given time period our hybrid approach outperforms both asynchronous and synchronous approaches.

[AI-200] Enhancing In-Context Learning via Implicit Demonstration Augmentation

链接: https://arxiv.org/abs/2407.00100
作者: Xiaoling Zhou,Wei Ye,Yidong Wang,Chaoya Jiang,Zhemg Lee,Rui Xie,Shikun Zhang
关键词: enables large pre-trained, pre-trained language models, large pre-trained language, ICL effectiveness heavily, in-context learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by ACL 2024 Main 19 pages,10 figures

点击查看摘要

Abstract:The emergence of in-context learning (ICL) enables large pre-trained language models (PLMs) to make predictions for unseen inputs without updating parameters. Despite its potential, ICL’s effectiveness heavily relies on the quality, quantity, and permutation of demonstrations, commonly leading to suboptimal and unstable performance. In this paper, we tackle this challenge for the first time from the perspective of demonstration augmentation. Specifically, we start with enriching representations of demonstrations by leveraging their deep feature distribution. We then theoretically reveal that when the number of augmented copies approaches infinity, the augmentation is approximately equal to a novel logit calibration mechanism integrated with specific statistical properties. This insight results in a simple yet highly efficient method that significantly improves the average and worst-case accuracy across diverse PLMs and tasks. Moreover, our method effectively reduces performance variance among varying demonstrations, permutations, and templates, and displays the capability to address imbalanced class distributions.

[AI-201] Visual Reasoning and Multi-Agent Approach in Multimodal Large Language Models (MLLMs): Solving TSP and mTSP Combinatorial Challenges

链接: https://arxiv.org/abs/2407.00092
作者: Mohammed Elhenawy,Ahmad Abutahoun,Taqwa I.Alhadidi,Ahmed Jaber,Huthaifa I. Ashqar,Shadi Jaradat,Ahmed Abdelhay,Sebastien Glaser,Andry Rakotonirainy
关键词: Multimodal Large Language, Large Language Models, Traveling Salesman Problem, Multimodal Large, Large Language
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) harness comprehensive knowledge spanning text, images, and audio to adeptly tackle complex problems, including zero-shot in-context learning scenarios. This study explores the ability of MLLMs in visually solving the Traveling Salesman Problem (TSP) and Multiple Traveling Salesman Problem (mTSP) using images that portray point distributions on a two-dimensional plane. We introduce a novel approach employing multiple specialized agents within the MLLM framework, each dedicated to optimizing solutions for these combinatorial challenges. Our experimental investigation includes rigorous evaluations across zero-shot settings and introduces innovative multi-agent zero-shot in-context scenarios. The results demonstrated that both multi-agent models. Multi-Agent 1, which includes the Initializer, Critic, and Scorer agents, and Multi-Agent 2, which comprises only the Initializer and Critic agents; significantly improved solution quality for TSP and mTSP problems. Multi-Agent 1 excelled in environments requiring detailed route refinement and evaluation, providing a robust framework for sophisticated optimizations. In contrast, Multi-Agent 2, focusing on iterative refinements by the Initializer and Critic, proved effective for rapid decision-making scenarios. These experiments yield promising outcomes, showcasing the robust visual reasoning capabilities of MLLMs in addressing diverse combinatorial problems. The findings underscore the potential of MLLMs as powerful tools in computational optimization, offering insights that could inspire further advancements in this promising field. Project link: this https URL

[AI-202] -MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

链接: https://arxiv.org/abs/2407.00088
作者: Jianyu Wei,Shijie Cao,Ting Cao,Lingxiao Ma,Lei Wang,Yanyong Zhang,Mao Yang
关键词: Large Language Models, Large Language, enhance on-device intelligence, Language Models, deployment of Large
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution. Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency. The system is open-sourced at this https URL. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2407.00088 [cs.DC] (or arXiv:2407.00088v1 [cs.DC] for this version)

[AI-203] ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback

链接: https://arxiv.org/abs/2407.00087
作者: Ju-Seung Byun,Jiyun Chun,Jihyung Kil,Andrew Perrault
关键词: Large Multimodal Models, Large Multimodal, comprehending human instructions, demonstrate remarkable results, excel at comprehending
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) excel at comprehending human instructions and demonstrate remarkable results across a broad spectrum of tasks. Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) further refine LLMs by aligning them with specific preferences. These methods primarily use ranking-based feedback for entire generations. With advanced AI models (Teacher), such as GPT-4 and Claude 3 Opus, we can request various types of detailed feedback that are expensive for humans to provide. We propose a two-stage algorithm ARES that Alternates REinforcement Learning (RL) and Supervised Fine-Tuning (SFT). First, we request the Teacher to score how much each sentence contributes to solving the problem in a Chain-of-Thought (CoT). This sentence-level feedback allows us to consider individual valuable segments, providing more granular rewards for the RL procedure. Second, we ask the Teacher to correct the wrong reasoning after the RL stage. The RL procedure requires massive efforts for hyperparameter tuning and often generates errors like repetitive words and incomplete sentences. With the correction feedback, we stabilize the RL fine-tuned model through SFT. We conduct experiments on multi-model dataset ScienceQA and A-OKVQA to demonstrate the effectiveness of our proposal. ARES rationale reasoning achieves around 70% win rate against baseline models judged by GPT-4o. Additionally, we observe that the improved rationale reasoning leads to a 2.5% increase in inference answer accuracy on average for the multi-modal datasets.

[AI-204] Adapting Job Recommendations to User Preference Drift with Behavioral-Semantic Fusion Learning

链接: https://arxiv.org/abs/2407.00082
作者: Xiao Han,Chen Zhu,Xiao Hu,Chuan Qin,Xiangyu Zhao,Hengshu Zhu
关键词: Job recommender systems, aligning job opportunities, recommender systems, systems are crucial, crucial for aligning
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by KDD 24 Research Track

点击查看摘要

Abstract:Job recommender systems are crucial for aligning job opportunities with job-seekers in online job-seeking. However, users tend to adjust their job preferences to secure employment opportunities continually, which limits the performance of job recommendations. The inherent frequency of preference drift poses a challenge to promptly and precisely capture user preferences. To address this issue, we propose a novel session-based framework, BISTRO, to timely model user preference through fusion learning of semantic and behavioral information. Specifically, BISTRO is composed of three stages: 1) coarse-grained semantic clustering, 2) fine-grained job preference extraction, and 3) personalized top- k job recommendation. Initially, BISTRO segments the user interaction sequence into sessions and leverages session-based semantic clustering to achieve broad identification of person-job matching. Subsequently, we design a hypergraph wavelet learning method to capture the nuanced job preference drift. To mitigate the effect of noise in interactions caused by frequent preference drift, we innovatively propose an adaptive wavelet filtering technique to remove noisy interaction. Finally, a recurrent neural network is utilized to analyze session-based interaction for inferring personalized preferences. Extensive experiments on three real-world offline recruitment datasets demonstrate the significant performances of our framework. Significantly, BISTRO also excels in online experiments, affirming its effectiveness in live recruitment settings. This dual success underscores the robustness and adaptability of BISTRO. The source code is available at this https URL.

[AI-205] Semantic Revolution from Communications to Orchestration for 6G: Challenges Enablers and Research Directions

链接: https://arxiv.org/abs/2407.00081
作者: Masoud Shokrnezhad,Hamidreza Mazandarani,Tarik Taleb,Jaeseung Song,Richard Li
关键词: digital entities presents, context of emerging, interactions involving, involving a myriad, digital entities
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted at IEEE Network magazine special issue: Goal-oriented Semantic Communication and Networking

点击查看摘要

Abstract:In the context of emerging 6G services, the realization of everything-to-everything interactions involving a myriad of physical and digital entities presents a crucial challenge. This challenge is exacerbated by resource scarcity in communication infrastructures, necessitating innovative solutions for effective service implementation. Exploring the potential of Semantic Communications (SemCom) to enhance point-to-point physical layer efficiency shows great promise in addressing this challenge. However, achieving efficient SemCom requires overcoming the significant hurdle of knowledge sharing between semantic decoders and encoders, particularly in the dynamic and non-stationary environment with stringent end-to-end quality requirements. To bridge this gap in existing literature, this paper introduces the Knowledge Base Management And Orchestration (KB-MANO) framework. Rooted in the concepts of Computing-Network Convergence (CNC) and lifelong learning, KB-MANO is crafted for the allocation of network and computing resources dedicated to updating and redistributing KBs across the system. The primary objective is to minimize the impact of knowledge management activities on actual service provisioning. A proof-of-concept is proposed to showcase the integration of KB-MANO with resource allocation in radio access networks. Finally, the paper offers insights into future research directions, emphasizing the transformative potential of semantic-oriented communication systems in the realm of 6G technology.

[AI-206] Mooncake: Kimis KVCache-centric Architecture for LLM Serving

链接: https://arxiv.org/abs/2407.00079
作者: Ruoyu Qin,Zheming Li,Weiran He,Mingxing Zhang,Yongwei Wu,Weimin Zheng,Xinran Xu
关键词: LLM service provided, leading LLM service, leading LLM, provided by Moonshot, LLM service
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: 21 pages, 11 figures

点击查看摘要

Abstract:Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake’s innovative architecture enables Kimi to handle 75% more requests.

[AI-207] Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference

链接: https://arxiv.org/abs/2407.00075
作者: Anton Xue,Avishree Khare,Rajeev Alur,Surbhi Goel,Eric Wong
关键词: subvert language models, propositional Horn logic, language models, models, large language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study how to subvert language models from following the rules. We model rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if P and Q , then R " for some propositions P , Q , and R . We prove that although transformers can faithfully abide by such rules, maliciously crafted prompts can nevertheless mislead even theoretically constructed models. Empirically, we find that attacks on our theoretical models mirror popular attacks on large language models. Our work suggests that studying smaller theoretical models can help understand the behavior of large language models in rule-based settings like logical reasoning and jailbreak attacks.

[AI-208] Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization

链接: https://arxiv.org/abs/2407.00071
作者: Mert Esencan,Tarun Advaith Kumar,Ata Akbari Asanjan,P. Aaron Lott,Masoud Mohseni,Can Unlu,Davide Venturelli,Alan Ho
关键词: Recent Large Language, Large Language Models, Recent Large, Language Models, Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Recent Large Language Models (LLMs) have demonstrated impressive capabilities at tasks that require human intelligence and are a significant step towards human-like artificial intelligence (AI). Yet the performance of LLMs at reasoning tasks have been subpar and the reasoning capability of LLMs is a matter of significant debate. While it has been shown that the choice of the prompting technique to the LLM can alter its performance on a multitude of tasks, including reasoning, the best performing techniques require human-made prompts with the knowledge of the tasks at hand. We introduce a framework for what we call Combinatorial Reasoning (CR), a fully-automated prompting method, where reasons are sampled from an LLM pipeline and mapped into a Quadratic Unconstrained Binary Optimization (QUBO) problem. The framework investigates whether QUBO solutions can be profitably used to select a useful subset of the reasons to construct a Chain-of-Thought style prompt. We explore the acceleration of CR with specialized solvers. We also investigate the performance of simpler zero-shot strategies such as linear majority rule or random selection of reasons. Our preliminary study indicates that coupling a combinatorial solver to generative AI pipelines is an interesting avenue for AI reasoning and elucidates design principles for future CR methods.

[AI-209] Perceptron Collaborative Filtering

链接: https://arxiv.org/abs/2407.00067
作者: Arya Chakraborty
关键词: making automatic predictions, achieve similar results, implementing collaborative filtering, multivariate logistic regression, logistic regression classifiers
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:While multivariate logistic regression classifiers are a great way of implementing collaborative filtering - a method of making automatic predictions about the interests of a user by collecting preferences or taste information from many other users, we can also achieve similar results using neural networks. A recommender system is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular user. A perceptron or a neural network is a machine learning model designed for fitting complex datasets using backpropagation and gradient descent. When coupled with advanced optimization techniques, the model may prove to be a great substitute for classical logistic classifiers. The optimizations include feature scaling, mean normalization, regularization, hyperparameter tuning and using stochastic/mini-batch gradient descent instead of regular gradient descent. In this use case, we will use the perceptron in the recommender system to fit the parameters i.e., the data from a multitude of users and use it to predict the preference/interest of a particular user.

[AI-210] Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

链接: https://arxiv.org/abs/2407.00066
作者: Rickard Brüel-Gabrielsson,Jiacheng Zhu,Onkar Bhardwaj,Leshem Choshen,Kristjan Greenewald,Mikhail Yurochkin,Justin Solomon
关键词: Fine-tuning large language, large language models, yielding numerous copies, Fine-tuning large, LLM differing
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) with low-rank adapters (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRA adapters. We consider compressing adapters individually via SVD and propose a method for joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. Our experiments with up to 500 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 75% of the throughput of serving a single LoRA.

[AI-211] An Interpretable Alternative to Neural Representation Learning for Rating Prediction – Transparent Latent Class Modeling of User Reviews

链接: https://arxiv.org/abs/2407.00063
作者: Giuseppe Serra,Peter Tino,Zhao Xu,Xin Yao
关键词: including recommender systems, including recommender, recommender systems, widely adopted, Nowadays
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nowadays, neural network (NN) and deep learning (DL) techniques are widely adopted in many applications, including recommender systems. Given the sparse and stochastic nature of collaborative filtering (CF) data, recent works have critically analyzed the effective improvement of neural-based approaches compared to simpler and often transparent algorithms for recommendation. Previous results showed that NN and DL models can be outperformed by traditional algorithms in many tasks. Moreover, given the largely black-box nature of neural-based methods, interpretable results are not naturally obtained. Following on this debate, we first present a transparent probabilistic model that topologically organizes user and product latent classes based on the review information. In contrast to popular neural techniques for representation learning, we readily obtain a statistical, visualization-friendly tool that can be easily inspected to understand user and product characteristics from a textual-based perspective. Then, given the limitations of common embedding techniques, we investigate the possibility of using the estimated interpretable quantities as model input for a rating prediction task. To contribute to the recent debates, we evaluate our results in terms of both capacity for interpretability and predictive performances in comparison with popular text-based neural approaches. The results demonstrate that the proposed latent class representations can yield competitive predictive performances, compared to popular, but difficult-to-interpret approaches.

[AI-212] MMBee: Live Streaming Gift-Sending Recommendations via Multi-Modal Fusion and Behaviour Expansion

链接: https://arxiv.org/abs/2407.00056
作者: Jiaxin Deng,Shiyao Wang,Yuchen Wang,Jiansong Qi,Liqin Zhao,Guorui Zhou,Gaofeng Meng
关键词: increasingly popular due, Live streaming services, increasingly popular, Live streaming, live streaming gifting
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Accepted at KDD 2024

点击查看摘要

Abstract:Live streaming services are becoming increasingly popular due to real-time interactions and entertainment. Viewers can chat and send comments or virtual gifts to express their preferences for the streamers. Accurately modeling the gifting interaction not only enhances users’ experience but also increases streamers’ revenue. Previous studies on live streaming gifting prediction treat this task as a conventional recommendation problem, and model users’ preferences using categorical data and observed historical behaviors. However, it is challenging to precisely describe the real-time content changes in live streaming using limited categorical information. Moreover, due to the sparsity of gifting behaviors, capturing the preferences and intentions of users is quite difficult. In this work, we propose MMBee based on real-time Multi-Modal Fusion and Behaviour Expansion to address these issues. Specifically, we first present a Multi-modal Fusion Module with Learnable Query (MFQ) to perceive the dynamic content of streaming segments and process complex multi-modal interactions, including images, text comments and speech. To alleviate the sparsity issue of gifting behaviors, we present a novel Graph-guided Interest Expansion (GIE) approach that learns both user and streamer representations on large-scale gifting graphs with multi-modal attributes. Comprehensive experiment results show that MMBee achieves significant performance improvements on both public datasets and Kuaishou real-world streaming datasets and the effectiveness has been further validated through online A/B experiments. MMBee has been deployed and is serving hundreds of millions of users at Kuaishou.

[AI-213] Ensemble Method for System Failure Detection Using Large-Scale Telemetry Data

链接: https://arxiv.org/abs/2407.00048
作者: Priyanka Mudgal,Rita H. Wouhaybi
关键词: necessitates heightened reliability, uphold user satisfaction, personal computers, necessitates heightened, growing reliance
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: Accepted at IEEE-IAICT-24. Copyright is held by IEEE

点击查看摘要

Abstract:The growing reliance on computer systems, particularly personal computers (PCs), necessitates heightened reliability to uphold user satisfaction. This research paper presents an in-depth analysis of extensive system telemetry data, proposing an ensemble methodology for detecting system failures. Our approach entails scrutinizing various parameters of system metrics, encompassing CPU utilization, memory utilization, disk activity, CPU temperature, and pertinent system metadata such as system age, usage patterns, core count, and processor type. The proposed ensemble technique integrates a diverse set of algorithms, including Long Short-Term Memory (LSTM) networks, isolation forests, one-class support vector machines (OCSVM), and local outlier factors (LOF), to effectively discern system failures. Specifically, the LSTM network with other machine learning techniques is trained on Intel Computing Improvement Program (ICIP) telemetry software data to distinguish between normal and failed system patterns. Experimental evaluations demonstrate the remarkable efficacy of our models, achieving a notable detection rate in identifying system failures. Our research contributes to advancing the field of system reliability and offers practical insights for enhancing user experience in computing environments.

[AI-214] Design a Win-Win Strategy That Is Fair to Both Service Providers and Tasks When Rejection Is Not an Option

链接: https://arxiv.org/abs/2407.00032
作者: Yohai Trabelsi,Pan Xu,Sarit Kraus
关键词: Assigning tasks, service providers, frequent procedure, service providers remain, service
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Assigning tasks to service providers is a frequent procedure across various applications. Often the tasks arrive dynamically while the service providers remain static. Preventing task rejection caused by service provider overload is of utmost significance. To ensure a positive experience in relevant applications for both service providers and tasks, fairness must be considered. To address the issue, we model the problem as an online matching within a bipartite graph and tackle two minimax problems: one focuses on minimizing the highest waiting time of a task, while the other aims to minimize the highest workload of a service provider. We show that the second problem can be expressed as a linear program and thus solved efficiently while maintaining a reasonable approximation to the objective of the first problem. We developed novel methods that utilize the two minimax problems. We conducted extensive simulation experiments using real data and demonstrated that our novel heuristics, based on the linear program, performed remarkably well.

[AI-215] LMVD: A Large-Scale Multimodal Vlog Dataset for Depression Detection in the Wild

链接: https://arxiv.org/abs/2407.00024
作者: Lang He,Kai Chen,Junnan Zhao,Yimeng Wang,Ercheng Pei,Haifeng Chen,Jiewei Jiang,Shiqing Zhang,Jie Zhang,Zhongmin Wang,Tao He,Prayag Tiwari
关键词: including their personal, social functioning, academic and work, significantly impact, impact many aspects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Depression can significantly impact many aspects of an individual’s life, including their personal and social functioning, academic and work performance, and overall quality of life. Many researchers within the field of affective computing are adopting deep learning technology to explore potential patterns related to the detection of depression. However, because of subjects’ privacy protection concerns, that data in this area is still scarce, presenting a challenge for the deep discriminative models used in detecting depression. To navigate these obstacles, a large-scale multimodal vlog dataset (LMVD), for depression recognition in the wild is built. In LMVD, which has 1823 samples with 214 hours of the 1475 participants captured from four multimedia platforms (Sina Weibo, Bilibili, Tiktok, and YouTube). A novel architecture termed MDDformer to learn the non-verbal behaviors of individuals is proposed. Extensive validations are performed on the LMVD dataset, demonstrating superior performance for depression detection. We anticipate that the LMVD will contribute a valuable function to the depression detection community. The data and code will released at the link: this https URL.

[AI-216] Visual Language Model based Cross-modal Semantic Communication Systems

链接: https://arxiv.org/abs/2407.00020
作者: Feibo Jiang,Chuanguo Tang,Li Dong,Kezhi Wang,Kun Yang,Cunhua Pan
关键词: Shannon physical capacity, physical capacity limits, Cross-modal Semantic Communication, transcending the Shannon, Shannon physical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:Semantic Communication (SC) has emerged as a novel communication paradigm in recent years, successfully transcending the Shannon physical capacity limits through innovative semantic transmission concepts. Nevertheless, extant Image Semantic Communication (ISC) systems face several challenges in dynamic environments, including low semantic density, catastrophic forgetting, and uncertain Signal-to-Noise Ratio (SNR). To address these challenges, we propose a novel Vision-Language Model-based Cross-modal Semantic Communication (VLM-CSC) system. The VLM-CSC comprises three novel components: (1) Cross-modal Knowledge Base (CKB) is used to extract high-density textual semantics from the semantically sparse image at the transmitter and reconstruct the original image based on textual semantics at the receiver. The transmission of high-density semantics contributes to alleviating bandwidth pressure. (2) Memory-assisted Encoder and Decoder (MED) employ a hybrid long/short-term memory mechanism, enabling the semantic encoder and decoder to overcome catastrophic forgetting in dynamic environments when there is a drift in the distribution of semantic features. (3) Noise Attention Module (NAM) employs attention mechanisms to adaptively adjust the semantic coding and the channel coding based on SNR, ensuring the robustness of the CSC system. The experimental simulations validate the effectiveness, adaptability, and robustness of the CSC system.

[AI-217] Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads

链接: https://arxiv.org/abs/2407.00010
作者: Grant Wilkins,Srinivasan Keshav,Richard Mortier
关键词: Large Language Models, require large amounts, Large Language, require large, large amounts
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Both the training and use of Large Language Models (LLMs) require large amounts of energy. Their increasing popularity, therefore, raises critical concerns regarding the energy efficiency and sustainability of data centers that host them. This paper addresses the challenge of reducing energy consumption in data centers running LLMs. We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate LLM tasks across hardware accelerators that differ in their energy efficiencies and computational capabilities. Specifically, our workload-aware strategy determines whether tasks are processed on energy-efficient processors or high-performance GPUs based on the number of input and output tokens in a query. Our analysis of a representative LLM dataset, finds that this hybrid strategy can reduce CPU+GPU energy consumption by 7.5% compared to a workload-unaware baseline.

[AI-218] NeuroMoCo: A Neuromorphic Momentum Contrast Learning Method for Spiking Neural Networks

链接: https://arxiv.org/abs/2406.06305
作者: Yuqi Ma,Huamin Wang,Hangchi Shen,Xuemei Chen,Shukai Duan,Shiping Wen
关键词: brain-inspired spiking neural, spiking neural networks, attracted great research, great research attention, research attention owing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 32 pages,4 figures,4 tables

点击查看摘要

Abstract:Recently, brain-inspired spiking neural networks (SNNs) have attracted great research attention owing to their inherent bio-interpretability, event-triggered properties and powerful perception of spatiotemporal information, which is beneficial to handling event-based neuromorphic datasets. In contrast to conventional static image datasets, event-based neuromorphic datasets present heightened complexity in feature extraction due to their distinctive time series and sparsity characteristics, which influences their classification accuracy. To overcome this challenge, a novel approach termed Neuromorphic Momentum Contrast Learning (NeuroMoCo) for SNNs is introduced in this paper by extending the benefits of self-supervised pre-training to SNNs to effectively stimulate their potential. This is the first time that self-supervised learning (SSL) based on momentum contrastive learning is realized in SNNs. In addition, we devise a novel loss function named MixInfoNCE tailored to their temporal characteristics to further increase the classification accuracy of neuromorphic datasets, which is verified through rigorous ablation experiments. Finally, experiments on DVS-CIFAR10, DVS128Gesture and N-Caltech101 have shown that NeuroMoCo of this paper establishes new state-of-the-art (SOTA) benchmarks: 83.6% (Spikformer-2-256), 98.62% (Spikformer-2-256), and 84.4% (SEW-ResNet-18), respectively.

[AI-219] Deep Dive into MRI: Exploring Deep Learning Applications in 0.55T and 7T MRI

链接: https://arxiv.org/abs/2407.01318
作者: Ana Carolina Alves,André Ferreira,Behrus Puladi,Jan Egger,Victor Alves
关键词: magnetic resonance imaging, involving ionising radiation, ionising radiation exposure, techniques involving ionising, providing a safe
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The development of magnetic resonance imaging (MRI) for medical imaging has provided a leap forward in diagnosis, providing a safe, non-invasive alternative to techniques involving ionising radiation exposure for diagnostic purposes. It was described by Block and Purcel in 1946, and it was not until 1980 that the first clinical application of MRI became available. Since that time the MRI has gone through many advances and has altered the way diagnosing procedures are performed. Due to its ability to improve constantly, MRI has become a commonly used practice among several specialisations in medicine. Particularly starting 0.55T and 7T MRI technologies have pointed out enhanced preservation of image detail and advanced tissue characterisation. This review examines the integration of deep learning (DL) techniques into these MRI modalities, disseminating and exploring the study applications. It highlights how DL contributes to 0.55T and 7T MRI data, showcasing the potential of DL in improving and refining these technologies. The review ends with a brief overview of how MRI technology will evolve in the coming years.

[AI-220] On Statistical Rates and Provably Efficient Criteria of Latent Diffusion Transformers (DiTs)

链接: https://arxiv.org/abs/2407.01079
作者: Jerry Yao-Chieh Hu,Weimin Wu,Zhuoru Li,Zhao Song,Han Liu
关键词: latent DiTs, linear latent space, textbf, time latent DiTs, latent space assumption
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the statistical and computational limits of latent \textbfDiffusion \textbfTransformers (\textbfDiTs) under the low-dimensional linear latent space assumption. Statistically, we study the universal approximation and sample complexity of the DiTs score function, as well as the distribution recovery property of the initial data. Specifically, under mild data assumptions, we derive an approximation error bound for the score network of latent DiTs, which is sub-linear in the latent space dimension. Additionally, we derive the corresponding sample complexity bound and show that the data distribution generated from the estimated score function converges toward a proximate area of the original one. Computationally, we characterize the hardness of both forward inference and backward computation of latent DiTs, assuming the Strong Exponential Time Hypothesis (SETH). For forward inference, we identify efficient criteria for all possible latent DiTs inference algorithms and showcase our theory by pushing the efficiency toward almost-linear time inference. For backward computation, we leverage the low-rank structure within the gradient computation of DiTs training for possible algorithmic speedup. Specifically, we show that such speedup achieves almost-linear time latent DiTs training by casting the DiTs gradient as a series of chained low-rank approximations with bounded error. Under the low-dimensional assumption, we show that the convergence rate and the computational efficiency are both dominated by the dimension of the subspace, suggesting that latent DiTs have the potential to bypass the challenges associated with the high dimensionality of initial data.

[AI-221] Individual brain parcellation: Review of methods validations and applications

链接: https://arxiv.org/abs/2407.00984
作者: Chengyi Li,Shan Yu,Yue Cui
关键词: individual brain parcellation, brains vary greatly, individual brain, Individual brains vary, Individual
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: 15 pages, 2 figures

点击查看摘要

Abstract:Individual brains vary greatly in morphology, connectivity and organization. The applicability of group-level parcellations is limited by the rapid development of precision medicine today because they do not take into account the variation of parcels at the individual level. Accurate mapping of brain functional regions at the individual level is pivotal for a comprehensive understanding of the variations in brain function and behaviors, early and precise identification of brain abnormalities, as well as personalized treatments for neuropsychiatric disorders. With the development of neuroimaging and machine learning techniques, studies on individual brain parcellation are booming. In this paper, we offer an overview of recent advances in the methodologies of individual brain parcellation, including optimization- and learning-based methods. Comprehensive evaluation metrics to validate individual brain mapping have been introduced. We also review the studies of how individual brain mapping promotes neuroscience research and clinical medicine. Finally, we summarize the major challenges and important future directions of individualized brain parcellation. Collectively, we intend to offer a thorough overview of individual brain parcellation methods, validations, and applications, along with highlighting the current challenges that call for an urgent demand for integrated platforms that integrate datasets, methods, and validations.

[AI-222] Channel Modeling Aided Dataset Generation for AI-Enabled CSI Feedback: Advances Challenges and Solutions

链接: https://arxiv.org/abs/2407.00896
作者: Yupeng Li,Gang Li,Zirui Wen,Shuangfeng Han,Shijian Gao,Guangyi Liu,Jiangzhou Wang
关键词: multiple input multiple, input multiple output, frequency division duplex, demonstrated great potential, channel state information
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The AI-enabled autoencoder has demonstrated great potential in channel state information (CSI) feedback in frequency division duplex (FDD) multiple input multiple output (MIMO) systems. However, this method completely changes the existing feedback strategies, making it impractical to deploy in recent years. To address this issue, this paper proposes a channel modeling aided data augmentation method based on a limited number of field channel data. Specifically, the user equipment (UE) extracts the primary stochastic parameters of the field channel data and transmits them to the base station (BS). The BS then updates the typical TR 38.901 model parameters with the extracted parameters. In this way, the updated channel model is used to generate the dataset. This strategy comprehensively considers the dataset collection, model generalization, model monitoring, and so on. Simulations verify that our proposed strategy can significantly improve performance compared to the benchmarks.

[AI-223] A data-driven approach to modeling brain activity using differential equations

链接: https://arxiv.org/abs/2407.00824
作者: Kuratov Andrey(1) ((1) HSE University, Moscow)
关键词: complete solutions, research focuses, innovative task, traditional methods, extracting equations
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research focuses on an innovative task of extracting equations from incomplete data, moving away from traditional methods used for complete solutions. The study addresses the challenge of extracting equations from data, particularly in the study of brain activity using electrophysiological data, which is often limited by insufficient information. The study provides a brief review of existing open-source equation derivation approaches in the context of modeling brain activity. The section below introduces a novel algorithm that employs incomplete data and prior domain knowledge to recover differential equations. The algorithm’s practicality in real-world scenarios is demonstrated through its application on both synthetic and real datasets.

[AI-224] Multimodal Learning and Cognitive Processes in Radiology: MedGaze for Chest X-ray Scanpath Prediction

链接: https://arxiv.org/abs/2407.00129
作者: Akash Awasthi,Ngan Le,Zhigang Deng,Rishi Agrawal,Carol C. Wu,Hien Van Nguyen
关键词: address fundamental questions, anticipate user attention, developing interactive systems, virtual reality, user attention
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Submitted to the Journal

点击查看摘要

Abstract:Predicting human gaze behavior within computer vision is integral for developing interactive systems that can anticipate user attention, address fundamental questions in cognitive science, and hold implications for fields like human-computer interaction (HCI) and augmented/virtual reality (AR/VR) systems. Despite methodologies introduced for modeling human eye gaze behavior, applying these models to medical imaging for scanpath prediction remains unexplored. Our proposed system aims to predict eye gaze sequences from radiology reports and CXR images, potentially streamlining data collection and enhancing AI systems using larger datasets. However, predicting human scanpaths on medical images presents unique challenges due to the diverse nature of abnormal regions. Our model predicts fixation coordinates and durations critical for medical scanpath prediction, outperforming existing models in the computer vision community. Utilizing a two-stage training process and large publicly available datasets, our approach generates static heatmaps and eye gaze videos aligned with radiology reports, facilitating comprehensive analysis. We validate our approach by comparing its performance with state-of-the-art methods and assessing its generalizability among different radiologists, introducing novel strategies to model radiologists’ search patterns during CXR image diagnosis. Based on the radiologist’s evaluation, MedGaze can generate human-like gaze sequences with a high focus on relevant regions over the CXR images. It sometimes also outperforms humans in terms of redundancy and randomness in the scanpaths.

[AI-225] Machine learning meets mass spectrometry: a focused perspective

链接: https://arxiv.org/abs/2407.00117
作者: Daniil A. Boiko,Valentine P. Ananikov
关键词: product quality control, industrial product quality, Mass spectrometry, life sciences, processes in medicine
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:Mass spectrometry is a widely used method to study molecules and processes in medicine, life sciences, chemistry, catalysis, and industrial product quality control, among many other applications. One of the main features of some mass spectrometry techniques is the extensive level of characterization (especially when coupled with chromatography and ion mobility methods, or a part of tandem mass spectrometry experiment) and a large amount of generated data per measurement. Terabyte scales can be easily reached with mass spectrometry studies. Consequently, mass spectrometry has faced the challenge of a high level of data disappearance. Researchers often neglect and then altogether lose access to the rich information mass spectrometry experiments could provide. With the development of machine learning methods, the opportunity arises to unlock the potential of these data, enabling previously inaccessible discoveries. The present perspective highlights reevaluation of mass spectrometry data analysis in the new generation of methods and describes significant challenges in the field, particularly related to problems involving the use of electrospray ionization. We argue that further applications of machine learning raise new requirements for instrumentation (increasing throughput and information density, decreasing pricing, and making more automation-friendly software), and once met, the field may experience significant transformation.

[AI-226] A Personalised Learning Tool for Physics Undergraduate Students Built On a Large Language Model for Symbolic Regression

链接: https://arxiv.org/abs/2407.00065
作者: Yufan Zhu,Zi-Yu Khoo,Jonathan Sze Choong Low,Stephane Bressan
关键词: Interleaved practice enhances, Large Language Model, Interleaved practice, Language Model, practice enhances
类目: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Interleaved practice enhances the memory and problem-solving ability of students in undergraduate courses. We introduce a personalized learning tool built on a Large Language Model (LLM) that can provide immediate and personalized attention to students as they complete homework containing problems interleaved from undergraduate physics courses. Our tool leverages the dimensional analysis method, enhancing students’ qualitative thinking and problem-solving skills for complex phenomena. Our approach combines LLMs for symbolic regression with dimensional analysis via prompt engineering and offers students a unique perspective to comprehend relationships between physics variables. This fosters a broader and more versatile understanding of physics and mathematical principles and complements a conventional undergraduate physics education that relies on interpreting and applying established equations within specific contexts. We test our personalized learning tool on the equations from Feynman’s lectures on physics. Our tool can correctly identify relationships between physics variables for most equations, underscoring its value as a complementary personalized learning tool for undergraduate physics students.

[AI-227] FoldToken2: Learning compact invariant and generative protein structure language

链接: https://arxiv.org/abs/2407.00050
作者: Zhangyang Gao,Cheng Tan,Stan Z. Li
关键词: posed long term, long term challenges, coordinates has posed, posed long, long term
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The equivalent nature of 3D coordinates has posed long term challenges in protein structure representation learning, alignment, and generation. Can we create a compact and invariant language that equivalently represents protein structures? Towards this goal, we propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structures. From FoldToken1 to FoldToken2, we improve three key components: (1) invariant structure encoder, (2) vector-quantized compressor, and (3) equivalent structure decoder. We evaluate FoldToken2 on the protein structure reconstruction task and show that it outperforms previous FoldToken1 by 20% in TMScore and 81% in RMSD. FoldToken2 probably be the first method that works well on both single-chain and multi-chain protein structures quantization. We believe that FoldToken2 will inspire further improvement in protein structure representation learning, structure alignment, and structure generation tasks.

[AI-228] Uncovering cognitive taskonomy through transfer learning in masked autoencoder-based fMRI reconstruction

链接: https://arxiv.org/abs/2407.00033
作者: Youzhi Qu,Junfeng Xia,Xinyao Jian,Wendu Li,Kaining Peng,Zhichao Liang,Haiyan Wu,Quanying Liu
关键词: widely used pre-training, learn the generalized, generalized features, tasks, MAE model
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data reconstruction is a widely used pre-training task to learn the generalized features for many downstream tasks. Although reconstruction tasks have been applied to neural signal completion and denoising, neural signal reconstruction is less studied. Here, we employ the masked autoencoder (MAE) model to reconstruct functional magnetic resonance imaging (fMRI) data, and utilize a transfer learning framework to obtain the cognitive taskonomy, a matrix to quantify the similarity between cognitive tasks. Our experimental results demonstrate that the MAE model effectively captures the temporal dynamics patterns and interactions within the brain regions, enabling robust cross-subject fMRI signal reconstruction. The cognitive taskonomy derived from the transfer learning framework reveals the relationships among cognitive tasks, highlighting subtask correlations within motor tasks and similarities between emotion, social, and gambling tasks. Our study suggests that the fMRI reconstruction with MAE model can uncover the latent representation and the obtained taskonomy offers guidance for selecting source tasks in neural decoding tasks for improving the decoding performance on target tasks.

[AI-229] Multi-objective generative AI for designing novel brain-targeting small molecules

链接: https://arxiv.org/abs/2407.00004
作者: Ayush Noori,Iñaki Arango,William E. Byrd,Nada Amin
关键词: central nervous system, successful central nervous, CNS drug design, BBB permeable drugs, blood-brain barrier
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 20 pages, 4 figures, Generative and Experimental Perspectives for Biomolecular Design Workshop at the 12th International Conference on Learning Representations

点击查看摘要

Abstract:The strict selectivity of the blood-brain barrier (BBB) represents one of the most formidable challenges to successful central nervous system (CNS) drug delivery. Computational methods to generate BBB permeable drugs in silico may be valuable tools in the CNS drug design pipeline. However, in real-world applications, BBB penetration alone is insufficient; rather, after transiting the BBB, molecules must bind to a specific target or receptor in the brain and must also be safe and non-toxic. To discover small molecules that concurrently satisfy these constraints, we use multi-objective generative AI to synthesize drug-like BBB-permeable small molecules. Specifically, we computationally synthesize molecules with predicted binding affinity against dopamine receptor D2, the primary target for many clinically effective antipsychotic drugs. After training several graph neural network-based property predictors, we adapt SyntheMol (Swanson et al., 2024), a recently developed Monte Carlo Tree Search-based algorithm for antibiotic design, to perform a multi-objective guided traversal over an easily synthesizable molecular space. We design a library of 26,581 novel and diverse small molecules containing hits with high predicted BBB permeability and favorable predicted safety and toxicity profiles, and that could readily be synthesized for experimental validation in the wet lab. We also validate top scoring molecules with molecular docking simulation against the D2 receptor and demonstrate predicted binding affinity on par with risperidone, a clinically prescribed D2-targeting antipsychotic. In the future, the SyntheMol-based computational approach described here may enable the discovery of novel neurotherapeutics for currently intractable disorders of the CNS.

附件下载

点击下载今日全部论文列表