本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-06-13)

今日共更新500篇论文,其中:

  • 自然语言处理83篇(Computation and Language (cs.CL))
  • 计算机视觉131篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能140篇(Artificial Intelligence (cs.AI))
  • 机器学习161篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation
[NLP-0] 胜过千张图片的文字:测量和理解文本到图像生成中的感知变异性

链接: https://arxiv.org/abs/2406.08482
作者: Raphael Tang,Xinyu Zhang,Lixinyu Xu,Yao Lu,Wenyan Li,Pontus Stenetorp,Jimmy Lin,Ferhan Ture
关键词: variability remains understudied, remains understudied, perceptual variability remains, Diffusion models, variability
中文关键词: 变异性仍然研究不足,仍然研究不足,知觉变异性仍然存在,扩散模型,变异性
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 13 pages, 11 figures

点击查看摘要

Abstract:Diffusion models are the state of the art in text-to-image generation, but their perceptual variability remains understudied. In this paper, we examine how prompts affect image variability in black-box diffusion-based models. We propose W1KP, a human-calibrated measure of variability in a set of images, bootstrapped from existing image-pair perceptual distances. Current datasets do not cover recent diffusion models, thus we curate three test sets for evaluation. Our best perceptual distance outperforms nine baselines by up to 18 points in accuracy, and our calibration matches graded human judgements 78% of the time. Using W1KP, we study prompt reusability and show that Imagen prompts can be reused for 10-50 random seeds before new images become too similar to already generated images, while Stable Diffusion XL and DALL-E 3 can be reused 50-200 times. Lastly, we analyze 56 linguistic features of real prompts, finding that the prompt’s length, CLIP embedding norm, concreteness, and word senses influence variability most. As far as we are aware, we are the first to analyze diffusion variability from a visuolinguistic perspective. Our project page is at this http URL
摘要:扩散模型是文本到图像生成的最新技术,但它们的知觉变异性仍未得到充分研究。在本文中,我们研究了在基于黑盒扩散的模型中,提示如何影响图像的可变性。我们提出了W1KP,这是一种人类校准的测量一组图像中可变性的方法,从现有的图像对感知距离引导而来。目前的数据集不包括最近的扩散模型,因此我们挑选了三个测试集进行评估。我们最好的感知距离比九条基线的精确度高出18个点,我们的校准在78%的时间内符合人类分级的判断。利用W1KP,我们研究了提示的可重用性,证明了Imagen提示可以在新图像变得与已生成的图像过于相似之前被重用10-50个随机种子,而稳定扩散XL和Dall-E 3可以被重用50-200次。最后,我们分析了56个真实提示的语言特征,发现提示的长度、片段嵌入规范、具体性和词义对变异性的影响最大。据我们所知,我们是第一个从视觉语言学的角度分析扩散变异性的人。我们的项目页面位于此http URL

[NLP-1] What If We Recaption Billions of Web Images with LLaMA-3?
[NLP-1] 如果我们用LLaMA-3回收数十亿张网络图像会怎样?

链接: https://arxiv.org/abs/2406.08478
作者: Xianhang Li,Haoqin Tu,Mude Hui,Zeyu Wang,Bingchen Zhao,Junfei Xiao,Sucheng Ren,Jieru Mei,Qing Liu,Huangjie Zheng,Yuyin Zhou,Cihang Xie
关键词: Web-crawled image-text pairs, Web-crawled image-text, inherently noisy, Web-crawled, image-text pairs
中文关键词: 网络抓取图像-文本对、网络抓取图像-文本、固有噪音、网络抓取、图像-文本对
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: * denotes equal contributions

点击查看摘要

Abstract:Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textitopen-sourced LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users’ text instructions, especially in following complex queries. Our project page is this https URL
摘要:网络爬行的图文对本身就存在噪声。先前的研究表明,对这些对的文本描述进行语义对齐和丰富可以显著提高各种视觉语言任务的模型训练,特别是文本到图像的生成。然而,这一领域的大规模调查仍然主要是封闭来源的。我们的论文旨在利用强大的开源骆驼-3,一个GPT-4级别的LLM,在社区的努力中架起桥梁。我们的重新捕获流程很简单:首先,我们微调一个由Llama-3-8B驱动的LLaVA-1.5,然后使用它从DataComp-1B数据集中重新捕获13亿张图像。我们的实验结果证实,这个增强的数据集Recap-DataComp-1B在训练高级视觉语言模型方面提供了实质性的好处。对于像CLIP这样的辨别性模型,我们在跨模式提取任务中观察到了增强的零射击性能。对于像文本到图像扩散转换器这样的生成性模型,生成的图像在与用户的文本指令的一致性方面表现出显著的改进,特别是在跟踪复杂的查询方面。我们的项目页面是这个HTTPS URL

[NLP-2] Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
[NLP-2] 喜鹊:通过零部件分割对齐的LLM,从头开始进行对齐数据合成

链接: https://arxiv.org/abs/2406.08464
作者: Zhangchen Xu,Fengqing Jiang,Luyao Niu,Yuntian Deng,Radha Poovendran,Yejin Choi,Bill Yuchen Lin
关键词: aligning large language, large language models, critical for aligning, aligning large, large language
中文关键词: 对齐大型语言、大型语言模型,对于对齐、对齐大型语言至关重要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Link: this https URL

点击查看摘要

Abstract:High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the left-side templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We perform a comprehensive analysis of the extracted data and select 300K high-quality instances. To compare Magpie data with other public instruction datasets, we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through supervised fine-tuning (SFT) and subsequent feedback learning. We also show that using Magpie solely for SFT can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.
摘要:高质量的教学数据是调整大型语言模型的关键。尽管一些模型,如骆驼-3-指令,有开放的权重,但它们的比对数据仍然是私有的,这阻碍了人工智能的民主化。高昂的人力成本和有限的、预先定义的提示范围阻碍了现有的开源数据创建方法的有效扩展,潜在地限制了公共比对数据集的多样性和质量。是否有可能通过直接从对齐的LLM中提取指令数据来大规模合成高质量的指令数据?我们提出了一种自合成方法来生成大规模的比对数据,该方法被称为Magbie。我们的主要观察是,当我们只输入左侧模板直到为用户消息保留的位置时,像Llama-3-Indict这样的对齐LLM可以生成用户查询,这要归功于它们的自动回归性质。我们使用这种方法来提示LAMA-3-指令,并生成400万条指令及其相应的响应。我们对提取的数据进行了全面的分析,选出了300K个高质量的实例。为了与其他公共指令数据集进行比较,我们对每个数据集对Llama-3-8B-Base进行了微调,并评估了微调后的模型的性能。我们的结果表明,在某些任务中,使用喜鹊微调的模型与官方的Llama-3-8B-Indict的性能相当,尽管后者通过监督微调(SFT)和随后的反馈学习得到了1000万个数据点的增强。我们还表明,仅将Magbie用于SFT可以超过以前用于SFT和偏好优化的公共数据集的性能,例如使用UltraFeedback进行直接偏好优化。这一优势在AlpacaEval、ArenaHard和WildB边等对齐基准上表现得很明显。

[NLP-3] he Impact of Initialization on LoRA Finetuning Dynamics
[NLP-3] 收件箱对LoRA微调动力学的影响

链接: https://arxiv.org/abs/2406.08447
作者: Soufiane Hayou,Nikhil Ghosh,Bin Yu
关键词: Low Rank Adaptation, Rank Adaptation, Low Rank, study the role, originally introduced
中文关键词: 低等级适应,等级适应,低等级,研究角色,最初介绍
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: TDLR: Different Initializations lead to completely different finetuning dynamics. One initialization (set A random and B zero) is generally better than the natural opposite initialization. arXiv admin note: text overlap with arXiv:2402.12354

点击查看摘要

Abstract:In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.
摘要:在本文中,我们研究了初始化在低阶适应(LORA)中的作用。(2021年)。从本质上讲,要从预先训练的模型开始作为精调的初始化,可以将B初始化为零,将A初始化为随机(PEFT包中的默认初始化),反之亦然。在这两种情况下,乘积BA在初始化时都等于零,这使得精调从预先训练的模型开始。这两个初始化方案看起来很相似。原则上,它们应该产生相同的性能和共享相同的最佳学习速率。我们证明了这是一个不正确的直觉,并且第一种方案(将B初始化为零,将A初始化为随机)平均比另一种方案产生更好的性能。我们的理论分析表明,这背后的原因可能是第一次初始化允许使用比第二次初始化更大的学习率(而不会导致输出不稳定),从而导致对第一种方案更有效的学习。我们在LLM上进行了大量的实验,验证了我们的结果。

[NLP-4] OLMES: A Standard for Language Model Evaluations
[NLP-4] OLMES:语言模型评估标准

链接: https://arxiv.org/abs/2406.08446
作者: Yuling Gu,Oyvind Tafjord,Bailey Kuehl,Dany Haddad,Jesse Dodge,Hannaneh Hajishirzi
关键词: claiming improved performance, measuring model capabilities, models claiming improved, claiming improved, tasks measuring model
中文关键词: 声称改进的性能、测量模型能力、声称改进的模型、声称改进的模型、测量模型的任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models in particular is challenging, as small changes to how a model is evaluated on a task can lead to large changes in measured performance. There is no common standard setup, so different models are evaluated on the same tasks in different ways, leading to claims about which models perform best not being reproducible. We propose OLMES, a completely documented, practical, open standard for reproducible LLM evaluations. In developing this standard, we identify and review the varying factors in evaluation practices adopted by the community - such as details of prompt formatting, choice of in-context examples, probability normalizations, and task formulation. In particular, OLMES supports meaningful comparisons between smaller base models that require the unnatural “cloze” formulation of multiple-choice questions against larger models that can utilize the original formulation. OLMES includes well-considered recommendations guided by results from existing literature as well as new experiments investigating open questions.
摘要:人工智能的进步通常通过新模型来展示,新模型声称在衡量模型能力的任务上提高了性能。评估语言模型尤其具有挑战性,因为在任务中评估模型的方式的微小变化可能会导致测量的性能发生重大变化。没有通用的标准设置,因此不同的模型以不同的方式在相同的任务上进行评估,导致关于哪些模型执行得最好的声明不可重现。我们提出了OLMES,这是一个完整记录的、实用的、开放的、可重复性的LLM评估标准。在制定这一标准时,我们确定并审查了社区采用的评估实践中的各种因素–例如,快速格式的细节、背景示例的选择、概率归一化和任务制定。特别是,Olmes支持在需要多项选择题的非自然“完形填空”公式的较小基础模型与可以利用原始公式的较大模型之间进行有意义的比较。奥尔姆斯包括经过深思熟虑的建议,以现有文献的结果为指导,以及调查开放问题的新实验。

[NLP-5] asTe: Teaching Large Language Models to Translate through Self-Reflection
[NLP-5] asTE:通过自我反思教授大型语言模型进行翻译

链接: https://arxiv.org/abs/2406.08434
作者: Yutong Wang,Jiali Zeng,Xuebo Liu,Fandong Meng,Jie Zhou,Min Zhang
关键词: Large language models, exhibited remarkable performance, natural language processing, language processing tasks, Large language
中文关键词: 大型语言模型,表现出出色的性能,自然语言处理,语言处理任务,大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to the ACL 2024 main conference

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks. Techniques like instruction tuning have effectively enhanced the proficiency of LLMs in the downstream task of machine translation. However, the existing approaches fail to yield satisfactory translation outputs that match the quality of supervised neural machine translation (NMT) systems. One plausible explanation for this discrepancy is that the straightforward prompts employed in these methodologies are unable to fully exploit the acquired instruction-following capabilities. To this end, we propose the TasTe framework, which stands for translating through self-reflection. The self-reflection process includes two stages of inference. In the first stage, LLMs are instructed to generate preliminary translations and conduct self-assessments on these translations simultaneously. In the second stage, LLMs are tasked to refine these preliminary translations according to the evaluation results. The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods. Our work presents a promising approach to unleash the potential of LLMs and enhance their capabilities in MT. The codes and datasets are open-sourced at this https URL.
摘要:大语言模型在各种自然语言处理任务中表现出了显著的性能。指令调优等技术有效地提高了LLMS在机器翻译下游任务中的熟练程度。然而,现有的方法不能产生与有监督神经机器翻译(NMT)系统质量相匹配的令人满意的翻译输出。对这种差异的一个可信的解释是,这些方法中采用的直接提示无法充分利用获得的指令跟随能力。为此,我们提出了品味框架,它代表通过自我反省进行翻译。自我反省的过程包括两个阶段的推理。在第一阶段,LLM被指示生成初步翻译并同时对这些翻译进行自我评估。在第二阶段,LLM的任务是根据评估结果对这些初步翻译进行提炼。在WMT22基准测试的四个语言方向上的评估结果表明,与现有方法相比,该方法是有效的。我们的工作提供了一种很有希望的方法来释放低密度脂蛋白的潜力,提高它们在机器翻译中的能力。代码和数据集在这个HTTPS URL上是开源的。

[NLP-6] Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
[NLP-6] 下一代数据库接口:基于LLM的文本转SQL概览

链接: https://arxiv.org/abs/2406.08426
作者: Zijin Hong,Zheng Yuan,Qinggang Zhang,Hao Chen,Junnan Dong,Feiran Huang,Xiao Huang
关键词: Generating accurate SQL, Generating accurate, SQL generation, accurate SQL, long-standing problem
中文关键词: 生成准确的SQL,生成准确的,SQL生成,准确的SQL,长期存在的问题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Generating accurate SQL according to natural language questions (text-to-SQL) is a long-standing problem since it is challenging in user question understanding, database schema comprehension, and SQL generation. Conventional text-to-SQL systems include human engineering and deep neural networks. Subsequently, pre-trained language models (PLMs) have been developed and utilized for text-to-SQL tasks, achieving promising performance. As modern databases become more complex and corresponding user questions more challenging, PLMs with limited comprehension capabilities can lead to incorrect SQL generation. This necessitates more sophisticated and tailored optimization methods, which, in turn, restricts the applications of PLM-based systems. Most recently, large language models (LLMs) have demonstrated significant abilities in natural language understanding as the model scale remains increasing. Therefore, integrating the LLM-based implementation can bring unique opportunities, challenges, and solutions to text-to-SQL research. In this survey, we present a comprehensive review of LLM-based text-to-SQL. Specifically, we propose a brief overview of the current challenges and the evolutionary process of text-to-SQL. Then, we provide a detailed introduction to the datasets and metrics designed to evaluate text-to-SQL systems. After that, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we discuss the remaining challenges in this field and propose expectations for future directions.
摘要:根据自然语言问题(Text-to-SQL)生成准确的SQL是一个长期存在的问题,因为它在用户问题理解、数据库模式理解和SQL生成方面具有挑战性。传统的文本到SQL系统包括人类工程学和深度神经网络。随后,开发了预先训练的语言模型(PLM),并将其用于文本到SQL的任务,取得了良好的性能。随着现代数据库变得更加复杂,相应的用户问题也越来越具有挑战性,理解能力有限的PLM可能会导致错误的SQL生成。这就需要更复杂和量身定制的优化方法,这反过来又限制了基于PLM的系统的应用。最近,随着模型规模的不断扩大,大型语言模型在自然语言理解方面显示出了巨大的能力。因此,集成基于LLM的实现可以为文本到SQL的研究带来独特的机遇、挑战和解决方案。在本次调查中,我们对基于LLM的Text-to-SQL进行了全面回顾。具体地说,我们对文本到SQL的当前挑战和演变过程进行了简要的概述。然后,我们将详细介绍为评估Text-to-SQL系统而设计的数据集和指标。然后,我们系统地分析了基于LLM的Text-to-SQL的最新进展。最后,我们讨论了该领域仍然存在的挑战,并对未来的方向提出了展望。

[NLP-7] ailoring Generative AI Chatbots for Multiethnic Communities in Disaster Preparedness Communication: Extending the CASA Paradigm
[NLP-7] 为多种族社区配备生成人工智能聊天机器人备灾通信:扩展CASA范式

链接: https://arxiv.org/abs/2406.08411
作者: Xinyan Zhao,Yuan Sun,Wenlin Liu,Chau-Wai Wong
关键词: powered by GPT, develop different prototypes, prototypes of generative, Social Actors, communicate hurricane preparedness
中文关键词: 由GPT提供支持,开发不同的原型、生成性原型、社交演员,沟通飓风准备工作
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 21 pages

点击查看摘要

Abstract:This study is among the first to develop different prototypes of generative AI (GenAI) chatbots powered by GPT 4 to communicate hurricane preparedness information to diverse residents. Drawing from the Computers Are Social Actors (CASA) paradigm and the literature on disaster vulnerability and cultural tailoring, this study conducted a between-subjects experiment with 441 Black, Hispanic, and Caucasian residents of Florida. A computational analysis of chat logs (N = 7,848) shows that anthropomorphism and personalization are key communication topics in GenAI chatbot-user interactions. SEM results (N = 441) suggest that GenAI chatbots varying in tone formality and cultural tailoring significantly predict bot perceptions and, subsequently, hurricane preparedness outcomes. These results highlight the potential of using GenAI chatbots to improve diverse communities’ disaster preparedness.
摘要:这项研究是最早开发由GPT 4驱动的生成性人工智能(GenAI)聊天机器人不同原型的研究之一,旨在向不同的居民传达飓风准备信息。本研究借鉴了计算机是社会行动者(CASA)范式以及有关灾难脆弱性和文化剪裁的文献,对佛罗里达州的441名黑人、西班牙裔和白人居民进行了一项受试者间实验。聊天日志(N = 7,848)的计算分析表明,拟人化和个性化是GenAI聊天机器人与用户交互中的关键沟通主题。扫描电子显微镜结果(N = 441)表明,GenAI聊天机器人在正式语气和文化定制方面有所不同,可以显着预测机器人的感知,并随后预测飓风准备结果。这些结果凸显了使用GenAI聊天机器人改善不同社区灾难准备能力的潜力。

[NLP-8] MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
[NLP-8] MMWorld:迈向多学科多方面的视频世界模型评估

链接: https://arxiv.org/abs/2406.08407
作者: Xuehai He,Weixi Feng,Kaizhi Zheng,Yujie Lu,Wanrong Zhu,Jiachen Li,Yue Fan,Jianfeng Wang,Linjie Li,Zhengyuan Yang,Kevin Lin,William Yang Wang,Lijuan Wang,Xin Eric Wang
关键词: Multimodal Language Language, Language Language Models, Language Language, Multimodal Language, complex real-world dynamics
中文关键词: 多模式语言语言,语言模型,语言语言,多模式语言,复杂的现实世界动态
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of “world models” – interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models’ different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.
摘要:多通道语言模型(MLLMS)展示了“世界模型”的新兴能力–解释和推理复杂的真实世界动态。为了评估这些能力,我们假设视频是理想的媒介,因为它们封装了真实世界动态和因果关系的丰富表现。为此,我们引入了MMWorld,这是一个新的多学科、多方面、多模式视频理解的基准。MMWorld与以往的视频理解基准有两个独特的优势:(1)多学科,涵盖各种学科,通常需要领域专业知识才能全面理解;(2)多方面推理,包括解释、反事实思维、未来预测等。MMWorld由一个人工标注的数据集和一个合成数据集组成,前者用于评估MLLMS,包括关于整个视频的问题;后者用于在单一感知通道内分析MLLMS。MMWorld共收录了7大学科和69个子学科的1,910个视频,包括6627个问答对和相关字幕。评估包括2个专有的和10个开源的MLLM,它们在MMWorld上举步维艰(例如,GPT-4V的性能最好,准确率只有52.3%),显示出很大的改进空间。进一步的消融研究揭示了其他有趣的发现,比如模型的技能集与人类的不同。我们希望MMWorld可以作为在视频中评估世界模型的重要一步。

[NLP-9] cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers
[NLP-9] cPAPPERS:科学论文中情景和多模式互动对话的数据集

链接: https://arxiv.org/abs/2406.08398
作者: Anirudh Sundar,Jin Xu,William Gay,Christopher Richardson,Larry Heck
关键词: multimodal interactive conversations, interactive conversations, emerging area, situated and multimodal, multimodal interactive
中文关键词: 多模式互动对话,互动对话,新兴领域,定位和多模式,多模式互动
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 1 figure

点击查看摘要

Abstract:An emerging area of research in situated and multimodal interactive conversations (SIMMC) includes interactions in scientific papers. Since scientific papers are primarily composed of text, equations, figures, and tables, SIMMC methods must be developed specifically for each component to support the depth of inquiry and interactions required by research scientists. This work introduces Conversational Papers (cPAPERS), a dataset of conversational question-answer pairs from reviews of academic papers grounded in these paper components and their associated references from scientific documents available on arXiv. We present a data collection strategy to collect these question-answer pairs from OpenReview and associate them with contextual information from LaTeX source files. Additionally, we present a series of baseline approaches utilizing Large Language Models (LLMs) in both zero-shot and fine-tuned configurations to address the cPAPERS dataset.
摘要:情景和多模式交互式对话(SSMIC)的一个新兴研究领域包括科学论文中的交互。由于科学论文主要由文本、方程、图形和表格组成,因此必须专门针对每个组成部分开发SSMIC方法,以支持研究科学家所需的深入探究和互动。这项工作介绍了对话论文(cPAPPERS),这是一个对话问答对数据集,该数据集来自基于这些论文成分的学术论文及其来自arXiv上可用的科学文件的相关参考文献的评论。我们提出了一种数据收集策略,从OpenReview收集这些问答对,并将它们与LaTeX源文件中的上下文信息关联起来。此外,我们还提出了一系列利用零触发和微调配置的大型语言模型(LLM)来解决cPAPPERS数据集的基线方法。

[NLP-10] Large Language Models Must Be Taught to Know What They Dont Know
[NLP-10] 必须教大型语言模型知道它们不知道的东西

链接: https://arxiv.org/abs/2406.08391
作者: Sanyam Kapoor,Nate Gruver,Manley Roberts,Katherine Collins,Arka Pal,Umang Bhatt,Adrian Weller,Samuel Dooley,Micah Goldblum,Andrew Gordon Wilson
关键词: high-stakes applications, trust their predictions, large language models, argue that prompting, large language
中文关键词: 高风险的应用程序,相信他们的预测,大型语言模型,认为促使大型语言
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Code available at: this https URL

点击查看摘要

Abstract:When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibration and then show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We show that a thousand graded examples are sufficient to outperform baseline methods and that training through the features of a model is necessary for good performance and tractable for large open-source models when using LoRA. We also investigate the mechanisms that enable reliable LLM uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators, applicable not just to their own uncertainties but also the uncertainty of other models. Lastly, we show that uncertainty estimates inform human use of LLMs in human-AI collaborative settings through a user study.
摘要:在高风险应用程序中使用大型语言模型(LLM)时,我们需要知道何时可以信任它们的预测。一些研究认为,推动高性能的LLM足以产生校准的不确定度,而另一些研究则引入了昂贵得令人望而却步的采样方法。在这项工作中,我们首先论证了提示本身不足以实现良好的校准,然后证明了在正确和错误答案的小数据集上进行微调可以创建具有良好泛化和较小计算开销的不确定性估计。我们表明,1000个分级的示例足以超过基线方法,并且通过模型的功能进行训练对于使用LORA的大型开源模型的良好性能和易处理是必要的。我们还研究了实现可靠的LLM不确定性估计的机制,发现许多模型可以用作通用的不确定性估计器,不仅适用于它们自己的不确定性,而且还适用于其他模型的不确定性。最后,我们通过一项用户研究表明,不确定性估计为人类在人-AI协作环境中使用LLMS提供了信息。

[NLP-11] owards Unsupervised Speech Recognition Without Pronunciation Models
[NLP-11] owards没有发音模型的无监督语音识别

链接: https://arxiv.org/abs/2406.08380
作者: Junrui Ni,Liming Wang,Yang Zhang,Kaizhi Qian,Heting Gao,Mark Hasegawa-Johnson,Chang D. Yoo
关键词: Recent advancements, automatic speech recognition, supervised automatic speech, large transcribed speech, achieved remarkable performance
中文关键词: 最近的进步,自动语音识别、监督自动语音、大转录语音,取得了显着的性能
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR. Using a curated speech corpus containing only high-frequency English words, our system achieves a word error rate of nearly 20% without parallel transcripts or oracle word boundaries. Furthermore, we experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. This innovative model surpasses the performance of previous unsupervised ASR models trained with direct distribution matching.
摘要:监督式自动语音识别(ASB)的最近进展取得了显着的性能,这主要是由于大型转录语音库的可用性不断增加。然而,大多数语言缺乏足够的配对语音和文本数据来有效训练这些系统。在本文中,我们通过提议消除对音素词典的依赖来应对开发没有配对语音和文本库的ASB系统的挑战。我们探索了一个新的研究方向:词级无监督的ASB。我们的系统使用仅包含高频英语单词的精心策划的语音库,在没有平行文字记录或Oracle单词边界的情况下实现了近20%的单词错误率。此外,我们通过实验证明,无监督语音识别器可以从语音到语音和文本到文本的联合掩蔽令牌填充中产生。这种创新模型超越了之前使用直接分布匹配训练的无监督的ASB模型的性能。

[NLP-12] Is Programming by Example solved by LLMs?
[NLP-12] LLM可以解决示例编程吗?

链接: https://arxiv.org/abs/2406.08316
作者: Wen-Ding Li,Kevin Ellis
关键词: aims to generate, generate an algorithm, algorithm from input-output, PBE, Large Language Models
中文关键词: 旨在生成,生成算法,从输入-输出、PBE、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Programming-by-Examples (PBE) aims to generate an algorithm from input-output examples. Such systems are practically and theoretically important: from an end-user perspective, they are deployed to millions of people, and from an AI perspective, PBE corresponds to a very general form of few-shot inductive inference. Given the success of Large Language Models (LLMs) in code-generation tasks, we investigate here the extent to which LLMs can be said to have `solved’ PBE. We experiment on classic domains such as lists and strings, and an uncommon graphics programming domain not well represented in typical pretraining data. We find that pretrained models are not effective at PBE, but that they can be fine-tuned for much higher performance, provided the test problems are in-distribution. We analyze empirically what causes these models to succeed and fail, and take steps toward understanding how to achieve better out-of-distribution generalization. Collectively these results suggest that LLMs make strong progress toward solving the typical suite of PBE tasks, potentially increasing the flexibility and applicability of PBE systems, while also identifying ways in which LLMs still fall short.
摘要:按实例编程(PBE)的目的是从输入输出实例生成算法。这样的系统在实践上和理论上都很重要:从最终用户的角度来看,它们被部署到数百万人身上,而从人工智能的角度来看,PBE对应于一种非常一般的形式的极少发生的归纳推理。鉴于大型语言模型(LLM)在代码生成任务中的成功,我们在这里调查LLM可以说已经在多大程度上“解决”了PBE。我们在列表和字符串等经典领域以及一个不常见的图形编程领域上进行了实验,这些领域在典型的预训练数据中没有很好地表示。我们发现,预先训练的模型在PBE中并不有效,但如果测试问题是不分布的,它们可以进行微调,以获得更高的性能。我们对导致这些模型成功和失败的原因进行了实证分析,并采取措施了解如何实现更好的分布外泛化。总而言之,这些结果表明,LLM在解决典型的PBE任务套件方面取得了巨大进展,潜在地增加了PBE系统的灵活性和适用性,同时也发现了LLM仍然存在的不足之处。

[NLP-13] M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation
[NLP-13] M3 T:多模式文档级机器翻译的新基准数据集

链接: https://arxiv.org/abs/2406.08255
作者: Benjamin Hsu,Xiaoyu Liu,Huayang Li,Yoshinari Fujinuma,Maria Nadejde,Xing Niu,Yair Kittenplon,Ron Litman,Raghavendra Pappagari
关键词: Neural Machine Translation, Neural Machine, Machine Translation, Document translation poses, translation poses
中文关键词: 神经机器翻译,神经机器,机器翻译,文档翻译姿势,翻译姿势
类目: Computation and Language (cs.CL)
备注: NAACL 2024, dataset at this https URL

点击查看摘要

Abstract:Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world documents often possess intricate text layouts that defy these assumptions. Extracting information from Optical Character Recognition (OCR) or heuristic rules can result in errors, and the layout (e.g., paragraphs, headers) may convey relationships between distant sections of text. This complexity is particularly evident in widely used PDF documents, which represent information visually. This paper addresses this gap by introducing M3T, a novel benchmark dataset tailored to evaluate NMT systems on the comprehensive task of translating semi-structured documents. This dataset aims to bridge the evaluation gap in document-level NMT systems, acknowledging the challenges posed by rich text layouts in real-world applications.
摘要:文档翻译对神经机器翻译(NMT)系统提出了挑战。大多数文档级NMT系统依赖于精心挑选的句子级并行数据,假设从文档中提取文本及其精确的阅读顺序是完美的。这些系统还倾向于忽略额外的视觉提示,如文档布局,认为它无关紧要。然而,现实世界的文档通常具有复杂的文本布局,这与这些假设背道而驰。从光学字符识别(OCR)或启发式规则中提取信息可能导致错误,并且布局(例如,段落、标题)可能传达文本的远距离部分之间的关系。这种复杂性在广泛使用的以视觉方式表示信息的PDF文档中尤为明显。本文通过引入M3T来弥补这一差距,M3T是一个新的基准数据集,专门用于评估NMT系统在翻译半结构化文档的综合任务上的性能。该数据集旨在弥补文档级NMT系统中的评估差距,承认现实世界应用程序中富文本布局带来的挑战。

[NLP-14] Leveraging Large Language Models for Web Scraping
[NLP-14] 利用大型语言模型进行网页抓取

链接: https://arxiv.org/abs/2406.08246
作者: Aman Ahluwalia,Suhrud Wani
关键词: demonstrate remarkable capabilities, replicating human tasks, demonstrate remarkable, boosting productivity, Large Language Models
中文关键词: 展示非凡的能力,复制人类任务,展示非凡的,提高生产力,大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities in replicating human tasks and boosting productivity. However, their direct application for data extraction presents limitations due to a prioritisation of fluency over factual accuracy and a restricted ability to manipulate specific information. Therefore to overcome these limitations, this research leverages the knowledge representation power of pre-trained LLMs and the targeted information access enabled by RAG models, this research investigates a general-purpose accurate data scraping recipe for RAG models designed for language generation. To capture knowledge in a more modular and interpretable way, we use pre trained language models with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus. We utilised RAG model architecture and did an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Chunking HTML text for effective understanding, and (iii) comparing results from different LLMs and ranking algorithms. While previous work has developed dedicated architectures and training procedures for HTML understanding and extraction, we show that LLMs pre-trained on standard natural language with an addition of effective chunking, searching and ranking algorithms, can prove to be efficient data scraping tool to extract complex data from unstructured text. Future research directions include addressing the challenges of provenance tracking and dynamic knowledge updates within the proposed RAG-based data extraction framework. By overcoming these limitations, this approach holds the potential to revolutionise data extraction from vast repositories of textual information.
摘要:大型语言模型(LLM)在复制人工任务和提高生产率方面表现出显著的能力。然而,由于流畅性优先于事实准确性,以及操纵特定信息的能力有限,它们在数据提取方面的直接应用存在局限性。因此,为了克服这些局限性,本研究利用预先训练的LLMS的知识表示能力和RAG模型实现的定向信息访问,研究了一种用于语言生成的RAG模型的通用准确数据抓取方法。为了以更模块化和更可解释的方式捕获知识,我们使用了预训练的语言模型和潜在的知识检索器,这允许模型从大型语料库检索和处理文档。我们使用RAG模型架构,并在三个任务下对它们的性能进行了深入的分析:(I)HTML元素的语义分类,(Ii)为有效理解而对HTML文本进行分块,以及(Iii)比较不同LLM和排序算法的结果。虽然以前的工作已经开发了专门的体系结构和训练过程来理解和提取HTML,但我们表明,在标准自然语言上预训练的LLMS,加上有效的组块、搜索和排序算法,可以被证明是从非结构化文本中提取复杂数据的有效数据抓取工具。未来的研究方向包括在提出的基于RAG的数据提取框架内解决来源跟踪和动态知识更新的挑战。通过克服这些限制,这种方法有可能彻底改变从庞大的文本信息库中提取数据的方式。

[NLP-15] Research Trends for the Interplay between Large Language Models and Knowledge Graphs
[NLP-15] 大型语言模型与知识图之间相互作用的研究趋势

链接: https://arxiv.org/abs/2406.08223
作者: Hanieh Khorashadizadeh,Fatima Zahra Amara,Morteza Ezzabady,Frédéric Ieng,Sanju Tiwari,Nandana Mihindukulasooriya,Jinghua Groppe,Soror Sahri,Farah Benamara,Sven Groppe
关键词: Large Language Models, Knowledge Graphs, relationship between Large, Language Models, Large Language
中文关键词: 大型语言模型、知识图、大型之间的关系、语言模型、大型语言
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This survey investigates the synergistic relationship between Large Language Models (LLMs) and Knowledge Graphs (KGs), which is crucial for advancing AI’s capabilities in understanding, reasoning, and language processing. It aims to address gaps in current research by exploring areas such as KG Question Answering, ontology generation, KG validation, and the enhancement of KG accuracy and consistency through LLMs. The paper further examines the roles of LLMs in generating descriptive texts and natural language queries for KGs. Through a structured analysis that includes categorizing LLM-KG interactions, examining methodologies, and investigating collaborative uses and potential biases, this study seeks to provide new insights into the combined potential of LLMs and KGs. It highlights the importance of their interaction for improving AI applications and outlines future research directions.
摘要:本调查调查了大型语言模型(LLM)和知识图(KG)之间的协同关系,这对于提高人工智能的理解、推理和语言处理能力至关重要。它旨在通过探索KG问题解答、主体生成、KG验证以及通过LLM增强KG准确性和一致性等领域来解决当前研究中的差距。本文进一步探讨了LLM在为KG生成描述性文本和自然语言查询方面的作用。通过结构化分析,包括对LLM-KG互动进行分类、检查方法以及调查协作使用和潜在偏见,本研究旨在为LLM和KG的综合潜力提供新的见解。它强调了他们的互动对于改进人工智能应用的重要性,并概述了未来的研究方向。

[NLP-16] Figuratively Speaking: Authorship Attribution via Multi-Task Figurative Language Modeling
[NLP-16] 形象地说:通过多任务形象语言建模的作者归因

链接: https://arxiv.org/abs/2406.08218
作者: Gregorios A Katsios,Ning Sa,Tomek Strzalkowski
关键词: Natural Language Processing, author intended meaning, Language Processing, Natural Language, Multi-task Figurative Language
中文关键词: 自然语言处理、作者意图含义、语言处理、自然语言、多任务具象语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The identification of Figurative Language (FL) features in text is crucial for various Natural Language Processing (NLP) tasks, where understanding of the author’s intended meaning and its nuances is key for successful communication. At the same time, the use of a specific blend of various FL forms most accurately reflects a writer’s style, rather than the use of any single construct, such as just metaphors or irony. Thus, we postulate that FL features could play an important role in Authorship Attribution (AA) tasks. We believe that our is the first computational study of AA based on FL use. Accordingly, we propose a Multi-task Figurative Language Model (MFLM) that learns to detect multiple FL features in text at once. We demonstrate, through detailed evaluation across multiple test sets, that the our model tends to perform equally or outperform specialized binary models in FL detection. Subsequently, we evaluate the predictive capability of joint FL features towards the AA task on three datasets, observing improved AA performance through the integration of MFLM embeddings.
摘要:识别文本中的比喻语言特征对于各种自然语言处理任务至关重要,而理解作者的意图及其细微差别是成功交际的关键。同时,使用各种外语形式的特定混合体最准确地反映了作家的风格,而不是使用任何单一的结构,如仅仅是隐喻或反讽。因此,我们推测外语特征在作者归因任务中可能起着重要作用。我们相信,我们的研究是第一个基于外语使用的计算研究。因此,我们提出了一个多任务比喻语言模型(MFLM),该模型学习一次检测文本中的多个外语特征。通过对多个测试集的详细评估,我们证明了我们的模型在FL检测中的性能趋于相同或优于专门的二进制模型。随后,我们在三个数据集上评估了联合FL特征对AA任务的预测能力,观察到通过整合MFLM嵌入提高了AA性能。

[NLP-17] SumHiS: Extractive Summarization Exploiting Hidden Structure
[NLP-17] SumHiS:利用隐藏结构的提取性摘要

链接: https://arxiv.org/abs/2406.08215
作者: Tikhonov Pavel,Anastasiya Ianina,Valentin Malykh
关键词: extractive summarization task, important parts, Extractive summarization, summarization task, Extractive
中文关键词: 提取摘要任务,重要部分,提取摘要,摘要任务,提取
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extractive summarization is a task of highlighting the most important parts of the text. We introduce a new approach to extractive summarization task using hidden clustering structure of the text. Experimental results on CNN/DailyMail demonstrate that our approach generates more accurate summaries than both extractive and abstractive methods, achieving state-of-the-art results in terms of ROUGE-2 metric exceeding the previous approaches by 10%. Additionally, we show that hidden structure of the text could be interpreted as aspects.
摘要:提取性摘要是一项突出文本中最重要部分的任务。我们引入了一种使用文本的隐藏集群结构来提取摘要任务的新方法。CNN/DailyMail上的实验结果表明,我们的方法比提取和抽象方法生成更准确的摘要,在ROUGE-2指标方面实现了最先进的结果,比之前的方法高出10%。此外,我们还表明文本的隐藏结构可以被解释为方面。

[NLP-18] A Dialogue Game for Eliciting Balanced Collaboration
[NLP-18] 激发平衡合作的对话游戏

链接: https://arxiv.org/abs/2406.08202
作者: Isidora Jeknić,David Schlangen,Alexander Koller
关键词: integral part, Collaboration, Abstract, Typical task-oriented dialogue, balanced collaboration
中文关键词: 组成部分,协作,抽象,典型的任务导向对话,平衡的协作
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Collaboration is an integral part of human dialogue. Typical task-oriented dialogue games assign asymmetric roles to the participants, which limits their ability to elicit naturalistic role-taking in collaboration and its negotiation. We present a novel and simple online setup that favors balanced collaboration: a two-player 2D object placement game in which the players must negotiate the goal state themselves. We show empirically that human players exhibit a variety of role distributions, and that balanced collaboration improves task performance. We also present an LLM-based baseline agent which demonstrates that automatic playing of our game is an interesting challenge for artificial systems.
摘要:协作是人类对话不可或缺的一部分。典型的任务导向对话游戏为参与者分配了不对称的角色,这限制了他们在协作和谈判中引发自然主义角色扮演的能力。我们提出了一种新颖而简单的在线设置,有利于平衡协作:一款双人2D对象放置游戏,其中玩家必须自己协商目标状态。我们经验表明,人类参与者表现出各种角色分布,平衡的协作可以提高任务绩效。我们还提出了一个基于LLM的基线代理,它证明自动玩游戏对人工系统来说是一个有趣的挑战。

[NLP-19] Underneath the Numbers: Quantitative and Qualitative Gender Fairness in LLMs for Depression Prediction
[NLP-19] 数字背后:抑郁症预测的LLM中的定量和定性性别公平

链接: https://arxiv.org/abs/2406.08183
作者: Micol Spitale,Jiaee Cheong,Hatice Gunes
关键词: Recent studies show, machine learning models, Recent studies, studies show bias, studies show
中文关键词: 最近的研究表明,机器学习模型,最近的研究表明,研究表明存在偏见,研究表明
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies show bias in many machine learning models for depression detection, but bias in LLMs for this task remains unexplored. This work presents the first attempt to investigate the degree of gender bias present in existing LLMs (ChatGPT, LLaMA 2, and Bard) using both quantitative and qualitative approaches. From our quantitative evaluation, we found that ChatGPT performs the best across various performance metrics and LLaMA 2 outperforms other LLMs in terms of group fairness metrics. As qualitative fairness evaluation remains an open research question we propose several strategies (e.g., word count, thematic analysis) to investigate whether and how a qualitative evaluation can provide valuable insights for bias analysis beyond what is possible with quantitative evaluation. We found that ChatGPT consistently provides a more comprehensive, well-reasoned explanation for its prediction compared to LLaMA 2. We have also identified several themes adopted by LLMs to qualitatively evaluate gender fairness. We hope our results can be used as a stepping stone towards future attempts at improving qualitative evaluation of fairness for LLMs especially for high-stakes tasks such as depression detection.
摘要:最近的研究表明,在许多用于抑郁症检测的机器学习模型中存在偏差,但对于这一任务,LLMS中的偏差仍未被探索。这项工作首次尝试使用定量和定性的方法来调查现有的LLM(ChatGPT、Llama 2和Bard)中存在的性别偏见的程度。从我们的定量评估中,我们发现ChatGPT在各种性能指标中表现最好,而Llama 2在组公平性指标方面优于其他LLM。由于定性公平评价仍然是一个开放的研究问题,我们提出了几种策略(例如,字数统计、主题分析)来调查定性评价是否以及如何为偏见分析提供有价值的见解,而不是定量评价所能提供的。我们发现,与骆驼2相比,ChatGPT对其预测提供了更全面、更合理的解释。我们还确定了LLMS用来定性评估性别公平的几个主题。我们希望我们的结果可以作为一个垫脚石,为未来改善LLMS公平性的定性评估提供一个垫脚石,特别是对于高风险的任务,如抑郁症检测。

[NLP-20] Semi-Supervised Spoken Language Glossification
[NLP-20] 半监督口语粉饰

链接: https://arxiv.org/abs/2406.08173
作者: Huijie Yao,Wengang Zhou,Hao Zhou,Houqiang Li
关键词: spoken language text, Spoken language glossification, sign language gloss, Spoken language, aims to translate
中文关键词: 口语文本,口语粉饰,手语光泽,口语,旨在翻译
类目: Computation and Language (cs.CL)
备注: Accepted to ACL2024 main

点击查看摘要

Abstract:Spoken language glossification (SLG) aims to translate the spoken language text into the sign language gloss, i.e., a written record of sign language. In this work, we present a framework named S emi- S upervised S poken L anguage G lossification ( S^3 LG) for SLG. To tackle the bottleneck of limited parallel data in SLG, our S^3 LG incorporates large-scale monolingual spoken language text into SLG training. The proposed framework follows the self-training structure that iteratively annotates and learns from pseudo labels. Considering the lexical similarity and syntactic difference between sign language and spoken language, our S^3 LG adopts both the rule-based heuristic and model-based approach for auto-annotation. During training, we randomly mix these complementary synthetic datasets and mark their differences with a special token. As the synthetic data may be less quality, the S^3 LG further leverages consistency regularization to reduce the negative impact of noise in the synthetic data. Extensive experiments are conducted on public benchmarks to demonstrate the effectiveness of the S^3 LG. Our code is available at \urlthis https URL.
摘要:口语语义化(SLG)旨在将口语文本翻译成手语语篇,即手语的书面记录。在这项工作中,我们提出了一个框架,称为S百代-S监督S Poken L语言G词化(S^3 LG)。为了解决SLG中并行数据有限的瓶颈,我们的S^3 LG将大规模的单语口语文本融入到SLG训练中。该框架遵循自训练结构,迭代地对伪标签进行标注和学习。考虑到手语和口语在词汇上的相似性和句法上的差异,S^3 LG采用了基于规则的启发式和基于模型的方法进行自动标注。在训练期间,我们随机混合这些互补的合成数据集,并用特殊的标记标记它们的差异。由于合成数据的质量可能较差,S^3 LG进一步利用一致性正则化来减少合成数据中噪声的负面影响。在公共基准上进行了广泛的实验,以证明S^3 LG的有效性。我们的代码位于此HTTPS URL。

[NLP-21] Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark
[NLP-21] 审查混合专家的培训后量化:基准

链接: https://arxiv.org/abs/2406.08155
作者: Pingzhi Li,Xiaolong Jin,Yu Cheng,Tianlong Chen
关键词: Large Language Models, natural language processing, Large Language, demonstrating performance improvements, language processing
中文关键词: 大型语言模型、自然语言处理、大型语言、演示性能改进、语言处理
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Our code for reproducing all our experiments is provided at this https URL

点击查看摘要

Abstract:Large Language Models~(LLMs) have become foundational in the realm of natural language processing, demonstrating performance improvements as model sizes increase. The Mixture-of-Experts~(MoE) approach offers a promising way to scale LLMs more efficiently by using fewer computational FLOPs through sparse activation. However, it suffers from significant memory overheads, necessitating model compression techniques. Post-training quantization, a popular method for model compression, proves less effective when directly applied to MoE models due to MoE’s overlooked inherent sparsity. This paper explores several MoE structure-aware quantization heuristics, ranging from coarse to fine granularity, from MoE block to individual linear weight. Our investigations reveal critical principles: different MoE structures (i.e., blocks, experts, linear layers) require varying numbers of weight bits for effective and efficient quantization. Conclusions are supported by extensive benchmarking across two representative MoE models and six tasks. We further introduce novel enhancements to more accurately identify the most critical weights in MoE quantization that necessitate higher bit allocations, including the linear weight outlier scorer and MoE block scorer. Additionally, subsequent experiments validate our findings in the context of both weight and activation quantization.
摘要:大型语言模型已经成为自然语言处理领域的基础,随着模型规模的增加,性能也在不断提高。专家混合方法通过稀疏激活使用较少的计算失败,提供了一种更有效地扩展LLMS的方法。然而,它的内存开销很大,需要使用模型压缩技术。训练后量化是一种流行的模型压缩方法,但由于MOE忽略了固有的稀疏性,直接应用于MoE模型时效率较低。本文研究了几种MOE结构感知的量化启发式算法,从粗粒度到细粒度,从MOE块到个体线性权重。我们的研究揭示了关键原理:不同的MOE结构(即块、专家、线性层)需要不同数量的加权比特来进行有效和高效的量化。对两个具有代表性的MOE模型和六项任务进行了广泛的基准测试,这支持了结论。我们进一步引入了新的增强来更准确地识别MOE量化中需要更高比特分配的最关键的权重,包括线性权重离群值记分器和MOE块记分器。此外,随后的实验在权重和激活量化的背景下验证了我们的发现。

[NLP-22] Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
[NLP-22] 图注:利用表示工程来注释偏好数据集的安全裕度

链接: https://arxiv.org/abs/2406.08124
作者: Duanyu Feng,Bowen Qin,Chen Huang,Youcheng Huang,Zheng Zhang,Wenqiang Lei
关键词: differences depends critically, high-quality preference dataset, subtle safety differences, safety differences depends, model in distinguishing
中文关键词: 差异取决于关键,高质量偏好数据集,微妙的安全差异,安全差异取决于,区分模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Our code is available at this https URL

点击查看摘要

Abstract:The success of the reward model in distinguishing between responses with subtle safety differences depends critically on the high-quality preference dataset, which should capture the fine-grained nuances of harmful and harmless responses. This motivates the need to develop a dataset involving preference margins, which accurately quantify how harmless one response is compared to another. In this paper, we take the first step to propose an effective and cost-efficient framework to promote the margin-enhanced preference dataset development. Our framework, Legend, Leverages representation engineering to annotate preference datasets. It constructs the specific direction within the LLM’s embedding space that represents safety. By leveraging this safety direction, Legend can then leverage the semantic distances of paired responses along this direction to annotate margins automatically. We experimentally demonstrate our effectiveness in both reward modeling and harmless alignment for LLMs. Legend also stands out for its efficiency, requiring only the inference time rather than additional training. This efficiency allows for easier implementation and scalability, making Legend particularly valuable for practical applications in aligning LLMs with safe conversations.
摘要:奖励模型能否成功区分具有细微安全差异的反应,关键取决于高质量的偏好数据集,该数据集应能捕捉有害和无害反应的细微差别。这激发了开发包含偏好边际的数据集的需要,该数据集准确地量化了一种反应与另一种反应的无害程度。在本文中,我们首先提出了一个有效的、低成本的框架来促进边际增强型偏好数据集的开发。我们的框架Legend利用表示工程来标注首选项数据集。它在LLM的嵌入空间内构建了代表安全的具体方向。通过利用该安全方向,Legend随后可以利用沿该方向的配对响应的语义距离来自动标注页边距。我们通过实验证明了我们在奖赏建模和无伤害对齐两个方面的有效性。Legend也因其效率而脱颖而出,只需要推理时间,而不需要额外的训练。这种效率允许更容易的实施和可扩展性,这使得联想在使LLM与安全对话保持一致的实际应用中特别有价值。

[NLP-23] Supportiveness-based Knowledge Rewriting for Retrieval-augmented Language Modeling
[NLP-23] 基于支持性的检索增强语言建模知识重写

链接: https://arxiv.org/abs/2406.08116
作者: Zile Qiao,Wei Ye,Yong Jiang,Tong Mo,Pengjun Xie,Weiping Li,Fei Huang,Shikun Zhang
关键词: Retrieval-augmented language models, Retrieval-augmented language, recently shown great, shown great potential, LLM generation
中文关键词: 检索增强语言模型,检索增强语言,最近表现出很好的、显示出巨大的潜力,LLM生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented language models (RALMs) have recently shown great potential in mitigating the limitations of implicit knowledge in LLMs, such as untimely updating of the latest expertise and unreliable retention of long-tail knowledge. However, since the external knowledge base, as well as the retriever, can not guarantee reliability, potentially leading to the knowledge retrieved not being helpful or even misleading for LLM generation. In this paper, we introduce Supportiveness-based Knowledge Rewriting (SKR), a robust and pluggable knowledge rewriter inherently optimized for LLM generation. Specifically, we introduce the novel concept of “supportiveness”–which represents how effectively a knowledge piece facilitates downstream tasks–by considering the perplexity impact of augmented knowledge on the response text of a white-box LLM. Based on knowledge supportiveness, we first design a training data curation strategy for our rewriter model, effectively identifying and filtering out poor or irrelevant rewrites (e.g., with low supportiveness scores) to improve data efficacy. We then introduce the direct preference optimization (DPO) algorithm to align the generated rewrites to optimal supportiveness, guiding the rewriter model to summarize augmented content that better improves the final response. Comprehensive evaluations across six popular knowledge-intensive tasks and four LLMs have demonstrated the effectiveness and superiority of SKR. With only 7B parameters, SKR has shown better knowledge rewriting capability over GPT-4, the current state-of-the-art general-purpose LLM.
摘要:检索增强语言模型(RALMS)最近显示出了巨大的潜力,可以缓解LLMS中隐含知识的局限性,如最新专业知识更新不及时和长尾知识不可靠的保留。然而,由于外部知识库以及检索器不能保证可靠性,潜在地导致检索到的知识对LLM生成没有帮助,甚至具有误导性。在本文中,我们介绍了基于支持度的知识重写(SKR),这是一个健壮的、可插拔的知识重写程序,本质上是针对LLM生成进行了优化。具体地说,我们引入了“支持性”的新概念–它表示知识片如何有效地促进下游任务–通过考虑扩充知识对白盒LLM响应文本的困惑影响。基于知识支持度,我们首先为我们的重写器模型设计了一种训练数据管理策略,有效地识别并过滤掉糟糕或不相关的重写(例如,支持度得分较低的重写),以提高数据效率。然后,我们引入直接偏好优化(DPO)算法来将生成的重写与最优支持度对齐,指导重写器模型总结增强的内容,以更好地改善最终响应。对六项受欢迎的知识密集型任务和四项低成本管理的综合评价证明了SKR的有效性和优越性。在仅有7B参数的情况下,SKR显示出比GPT-4更好的知识重写能力,GPT-4是目前最先进的通用LLM。

[NLP-24] CoXQL: A Dataset for Parsing Explanation Requests in Conversational XAI Systems
[NLP-24] CoXQL:用于解析对话式XAI系统中解释请求的数据集

链接: https://arxiv.org/abs/2406.08101
作者: Qianli Wang,Tatiana Anikina,Nils Feldhus,Simon Ostermann,Sebastian Möller
关键词: Conversational explainable artificial, large language models, natural language processing, explainable artificial intelligence, garnered significant interest
中文关键词: 对话可解释人工、大型语言模型、自然语言处理、可解释人工智能引起了人们的浓厚兴趣
类目: Computation and Language (cs.CL)
备注: 4 pages, short paper

点击查看摘要

Abstract:Conversational explainable artificial intelligence (ConvXAI) systems based on large language models (LLMs) have garnered significant interest from the research community in natural language processing (NLP) and human-computer interaction (HCI). Such systems can provide answers to user questions about explanations, have the potential to enhance users’ comprehension and offer more information about the decision-making and generation processes of LLMs. Currently available ConvXAI systems are based on intent recognition rather than free chat. Thus, reliably grasping users’ intentions in ConvXAI systems still presents a challenge, because there is a broad range of XAI methods to map requests onto and each of them can have multiple slots to take care of. In order to bridge this gap, we present CoXQL, the first dataset for user intent recognition in ConvXAI, covering 31 intents, seven of which require filling additional slots. Subsequently, we enhance an existing parsing approach by incorporating template validations, and conduct an evaluation of several LLMs on CoXQL using different parsing strategies. We conclude that the improved parsing approach (MP+) surpasses the performance of previous approaches. We also discover that intents with multiple slots remain highly challenging for LLMs.
摘要:基于大语言模型的对话式可解释人工智能(ConvXAI)系统在自然语言处理(NLP)和人机交互(HCI)领域得到了广泛的关注。这种系统可以为用户提出的关于解释的问题提供答案,有可能提高用户的理解,并提供更多关于低成本管理的决策和生成过程的信息。目前可用的ConvXAI系统基于意图识别,而不是自由聊天。因此,在ConvXAI系统中可靠地掌握用户的意图仍然是一个挑战,因为有广泛的XAI方法可以映射到上面,并且每个方法都有多个槽需要处理。为了弥补这一差距,我们提出了CoXQL,这是ConvXAI中第一个用于用户意图识别的数据集,涵盖了31个意图,其中7个需要填充额外的空位。随后,我们通过结合模板验证来增强现有的解析方法,并使用不同的解析策略对CoXQL上的几个LLM进行了评估。我们得出结论,改进的句法分析方法(MP+)的性能优于以前的方法。我们还发现,具有多个插槽的意图对LLM来说仍然具有高度的挑战性。

[NLP-25] Multimodal Table Understanding
[NLP-25] 多模式表理解

链接: https://arxiv.org/abs/2406.08100
作者: Mingyu Zheng,Xinwei Feng,Qingyi Si,Qiaoqiao She,Zheng Lin,Wenbin Jiang,Weiping Wang
关键词: Markdown or HTML, understanding methods including, methods including recent, including recent approaches, text sequence
中文关键词: Markdown或HTML,理解方法,包括最近的方法,包括最近的方法,文本序列
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 16 figures, ACL 2024 main conference, camera-ready version

点击查看摘要

Abstract:Although great progress has been made by previous table understanding methods including recent approaches based on large language models (LLMs), they rely heavily on the premise that given tables must be converted into a certain text sequence (such as Markdown or HTML) to serve as model input. However, it is difficult to access such high-quality textual table representations in some real-world scenarios, and table images are much more accessible. Therefore, how to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications. In this paper, we propose a new problem, multimodal table understanding, where the model needs to generate correct responses to various table-related requests based on the given table image. To facilitate both the model training and evaluation, we construct a large-scale dataset named MMTab, which covers a wide spectrum of table images, instructions and tasks. On this basis, we develop Table-LLaVA, a generalist tabular multimodal large language model (MLLM), which significantly outperforms recent open-source MLLM baselines on 23 benchmarks under held-in and held-out settings. The code and data is available at this this https URL
摘要:尽管已有的表格理解方法取得了很大的进展,包括最近的基于大语言模型(LLMS)的方法,但它们在很大程度上依赖于这样一个前提,即给定的表格必须转换为特定的文本序列(如Markdown或HTML)作为模型输入。然而,在一些真实场景中很难访问这种高质量的文本表表示,而表图像要容易得多。因此,如何利用直观的视觉信息直接理解表格,是开发更多实际应用的关键和紧迫的挑战。在本文中,我们提出了一个新的问题,多模表理解,其中该模型需要根据给定的表映像来生成对各种与表相关的请求的正确响应。为了方便模型训练和评估,我们构建了一个名为MMTab的大规模数据集,它涵盖了广泛的表格图像、指令和任务。在此基础上,我们开发了一个通用型表格多模式大型语言模型(MLLM)Table-LLaVA,该模型在坚持和坚持两种情况下的23个基准测试上都显著优于最近的开源MLLM基线。代码和数据可在此HTTPS URL获得

[NLP-26] Languages Transferred Within the Encoder: On Representation Transfer in Zero-Shot Multilingual Translation
[NLP-26] 编码器内的语言转换:零镜头多语言翻译中的表示转换

链接: https://arxiv.org/abs/2406.08092
作者: Zhi Qu,Chenchen Ding,Taro Watanabe
关键词: Understanding representation transfer, multilingual neural machine, representational issue causing, neural machine translation, Understanding representation
中文关键词: 理解表示转移,多语言神经机器,引起表示问题,神经机器翻译,理解表示
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding representation transfer in multilingual neural machine translation can reveal the representational issue causing the zero-shot translation deficiency. In this work, we introduce the identity pair, a sentence translated into itself, to address the lack of the base measure in multilingual investigations, as the identity pair represents the optimal state of representation among any language transfers. In our analysis, we demonstrate that the encoder transfers the source language to the representational subspace of the target language instead of the language-agnostic state. Thus, the zero-shot translation deficiency arises because representations are entangled with other languages and are not transferred effectively to the target language. Based on our findings, we propose two methods: 1) low-rank language-specific embedding at the encoder, and 2) language-specific contrastive learning of the representation at the decoder. The experimental results on Europarl-15, TED-19, and OPUS-100 datasets show that our methods substantially enhance the performance of zero-shot translations by improving language transfer capacity, thereby providing practical evidence to support our conclusions.
摘要:理解多语种神经机器翻译中的表征迁移,可以揭示导致零命中翻译缺陷的表征问题。在这项工作中,我们引入了身份对,这是一个被翻译成句子本身的句子,以解决多语言研究中缺乏基本衡量标准的问题,因为身份对代表了任何语言迁移中的最佳表征状态。在我们的分析中,我们证明了编码者将源语言转移到目标语言的表征子空间,而不是语言不可知的状态。因此,零镜头翻译缺陷的产生是因为表征与其他语言纠缠在一起,没有有效地转移到目的语中。基于我们的发现,我们提出了两种方法:1)在编码端嵌入低阶特定语言,2)在解码端对表征进行特定语言的对比学习。在Europarl-15、TED-19和OPUS-100数据集上的实验结果表明,我们的方法通过提高语言迁移能力大大提高了零镜头翻译的性能,从而为我们的结论提供了实践证据。

[NLP-27] AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection
[NLP-27] AustroTox:基于目标的奥地利德语攻击性语言检测数据集

链接: https://arxiv.org/abs/2406.08080
作者: Pia Pachinger,Janis Goldzycher,Anna Maria Planitzer,Wojciech Kusa,Allan Hanbury,Julia Neidhardt
关键词: toxicity detection greatly, detection greatly profits, interpretability in toxicity, greatly profits, profits from token-level
中文关键词: 毒性检测很大,检测很大,毒性的可解释性,很大,从代币层面获利
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to Findings of the Association for Computational Linguistics: ACL 2024

点击查看摘要

Abstract:Model interpretability in toxicity detection greatly profits from token-level annotations. However, currently such annotations are only available in English. We introduce a dataset annotated for offensive language detection sourced from a news forum, notable for its incorporation of the Austrian German dialect, comprising 4,562 user comments. In addition to binary offensiveness classification, we identify spans within each comment constituting vulgar language or representing targets of offensive statements. We evaluate fine-tuned language models as well as large language models in a zero- and few-shot fashion. The results indicate that while fine-tuned models excel in detecting linguistic peculiarities such as vulgar dialect, large language models demonstrate superior performance in detecting offensiveness in AustroTox. We publish the data and code.
摘要:毒性检测中的模型可解释性极大地受益于标记级注释。然而,目前此类注释只有英语版本。我们引入了一个用于攻击性语言检测的注释数据集,该数据集来源于新闻论坛,该论坛以融入奥地利德语方言而闻名,包含4,562条用户评论。除了二元冒犯性分类之外,我们还识别构成粗俗语言或代表冒犯性言论目标的每条评论中的跨度。我们以零攻击和少量攻击的方式评估微调的语言模型以及大型语言模型。结果表明,虽然微调模型在检测粗俗方言等语言特征方面表现出色,但大型语言模型在检测AustroTox中的冒犯性方面表现出色。我们发布数据和代码。

[NLP-28] A Concept-Based Explainability Framework for Large Multimodal Models
[NLP-28] 大型多峰模型的基于概念的解释性框架

链接: https://arxiv.org/abs/2406.08074
作者: Jayneel Parekh,Pegah Khayatan,Mustafa Shukor,Alasdair Newson,Matthieu Cord
关键词: large language models, combine unimodal encoders, Large multimodal models, large language, perform multimodal tasks
中文关键词: 大型语言模型,结合单模式编码器,大型多模式模型,大型语言,执行多模式任务
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as “multi-modal concepts”. We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. We will publicly release our code.
摘要:大型多通道模型(LMM)结合了单通道编码器和大语言模型(LLM)来执行多通道任务。尽管最近在这些模型的可解释性方面取得了进展,但理解LMM的内部表示在很大程度上仍然是一个谜。在本文中,我们提出了一种新的解释LMM的框架。我们提出了一种基于词典学习的方法,应用于单词的表示。学习词典的元素与我们提出的概念相对应。我们表明,这些概念在视觉和文本上都有很好的语义基础。因此,我们将这些称为“多模式概念”。我们对所学概念的结果进行定性和定量的评估。我们表明,所提取的多峰概念对于解释测试样本的表示是有用的。最后,我们对不同概念之间的解缠和基础概念的质量进行了直观和文本的评估。我们将公开发布我们的代码。

[NLP-29] Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey
[NLP-29] 大型语言模型满足以文本为中心的多模式情感分析:一项调查

链接: https://arxiv.org/abs/2406.08068
作者: Hao Yang,Yanyan Zhao,Yang Wu,Shilong Wang,Tian Zheng,Hongbo Zhang,Wanxiang Che,Bing Qin
关键词: multimodal sentiment analysis, sentiment analysis, Compared to traditional, multimodal sentiment, humans process sentiment
中文关键词: 多模式情感分析,情感分析,与传统的多模式情感相比,人类处理情感
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Compared to traditional sentiment analysis, which only considers text, multimodal sentiment analysis needs to consider emotional signals from multimodal sources simultaneously and is therefore more consistent with the way how humans process sentiment in real-world scenarios. It involves processing emotional information from various sources such as natural language, images, videos, audio, physiological signals, etc. However, although other modalities also contain diverse emotional cues, natural language usually contains richer contextual information and therefore always occupies a crucial position in multimodal sentiment analysis. The emergence of ChatGPT has opened up immense potential for applying large language models (LLMs) to text-centric multimodal tasks. However, it is still unclear how existing LLMs can adapt better to text-centric multimodal sentiment analysis tasks. This survey aims to (1) present a comprehensive review of recent research in text-centric multimodal sentiment analysis tasks, (2) examine the potential of LLMs for text-centric multimodal sentiment analysis, outlining their approaches, advantages, and limitations, (3) summarize the application scenarios of LLM-based multimodal sentiment analysis technology, and (4) explore the challenges and potential research directions for multimodal sentiment analysis in the future.
摘要:与传统的只考虑文本的情感分析相比,多通道情感分析需要同时考虑来自多个通道的情感信号,因此更符合人类在真实场景中处理情感的方式。它涉及对来自自然语言、图像、视频、音频、生理信号等各种来源的情感信息进行处理。然而,尽管其他通道也包含各种情感线索,但自然语言通常包含更丰富的上下文信息,因此在多通道情感分析中始终占据关键地位。ChatGPT的出现为将大型语言模型(LLM)应用于以文本为中心的多模式任务开辟了巨大的潜力。然而,目前尚不清楚现有的LLMS如何能够更好地适应以文本为中心的多通道情感分析任务。本文旨在(1)对以文本为中心的多通道情感分析任务的研究现状进行综述;(2)考察最小二乘支持向量机在以文本为中心的多通道情感分析中的潜力,概述其方法、优势和局限性;(3)总结基于最小二乘模型的多通道情感分析技术的应用场景;(4)探讨未来多通道情感分析面临的挑战和可能的研究方向。

[NLP-30] Learning Job Title Representation from Job Description Aggregation Network
[NLP-30] 从职位描述聚合网络学习职位名称表示

链接: https://arxiv.org/abs/2406.08055
作者: Napat Laosaengpha,Thanit Tativannarat,Chawan Piansaddhayanon,Attapol Rutherford,Ekapol Chuangsuwanich
关键词: human resource tools, developing automatic human, automatic human resource, Learning job title, job title representation
中文关键词: 人力资源工具,开发自动人力,自动人力资源,学习职位名称,职位名称表示
类目: Computation and Language (cs.CL)
备注: to be published in Findings of the Association for Computational Linguistics: ACL 2024

点击查看摘要

Abstract:Learning job title representation is a vital process for developing automatic human resource tools. To do so, existing methods primarily rely on learning the title representation through skills extracted from the job description, neglecting the rich and diverse content within. Thus, we propose an alternative framework for learning job titles through their respective job description (JD) and utilize a Job Description Aggregator component to handle the lengthy description and bidirectional contrastive loss to account for the bidirectional relationship between the job title and its description. We evaluated the performance of our method on both in-domain and out-of-domain settings, achieving a superior performance over the skill-based approach.
摘要:学习职位名称表示是开发自动人力资源工具的重要过程。为此,现有的方法主要依赖于通过从职位描述中提取的技能来学习头衔表示,而忽视了其中丰富而多样化的内容。因此,我们提出了一个通过各自的职位描述(JD)学习职位的替代框架,并利用职位描述聚合器组件来处理冗长的描述和双向对比损失,以解释职位名称及其描述之间的双向关系。我们评估了我们的方法在域内和域外设置上的性能,实现了优于基于技能的方法的性能。

[NLP-31] Adversarial Evasion Attack Efficiency against Large Language Models
[NLP-31] 针对大型语言模型的对抗规避攻击效率

链接: https://arxiv.org/abs/2406.08050
作者: João Vitorino,Eva Maia,Isabel Praça
关键词: Large Language Models, Large Language, Language Models, Large, Models
中文关键词: 大型语言模型,大型语言,语言模型,大型,模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 1 table, 2 figures, DCAI 2024 conference

点击查看摘要

Abstract:Large Language Models (LLMs) are valuable for text classification, but their vulnerabilities must not be disregarded. They lack robustness against adversarial examples, so it is pertinent to understand the impacts of different types of perturbations, and assess if those attacks could be replicated by common users with a small amount of perturbations and a small number of queries to a deployed LLM. This work presents an analysis of the effectiveness, efficiency, and practicality of three different types of adversarial attacks against five different LLMs in a sentiment classification task. The obtained results demonstrated the very distinct impacts of the word-level and character-level attacks. The word attacks were more effective, but the character and more constrained attacks were more practical and required a reduced number of perturbations and queries. These differences need to be considered during the development of adversarial defense strategies to train more robust LLMs for intelligent text classification applications.
摘要:大语言模型在文本分类中具有重要的应用价值,但其脆弱性不容忽视。它们对敌意示例缺乏健壮性,因此了解不同类型扰动的影响并评估这些攻击是否可以被普通用户复制,只需少量扰动和对已部署的LLM的少量查询。本文分析了三种不同类型的对抗性攻击在情感分类任务中对五种不同的LLM的有效性、效率和实用性。所获得的结果显示了词级攻击和字级攻击的非常明显的影响。单词攻击更有效,但字符和更受约束的攻击更实用,需要的干扰和查询次数更少。在开发对抗性防御策略以训练更健壮的LLM用于智能文本分类应用时,需要考虑这些差异。

[NLP-32] Blowfish: Topological and statistical signatures for quantifying ambiguity in semantic search
[NLP-32] Blowfish:用于量化语义搜索中模糊性的topics和统计签名

链接: https://arxiv.org/abs/2406.07990
作者: Thomas Roland Barillot,Alex De Castro
关键词: Retrieval Augmented Generation, Augmented Generation, Retrieval Augmented, works reports evidence, search and Retrieval
中文关键词: 检索增强生成、增强生成、检索增强、作品报告证据、搜索和检索
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This works reports evidence for the topological signatures of ambiguity in sentence embeddings that could be leveraged for ranking and/or explanation purposes in the context of vector search and Retrieval Augmented Generation (RAG) systems. We proposed a working definition of ambiguity and designed an experiment where we have broken down a proprietary dataset into collections of chunks of varying size - 3, 5, and 10 lines and used the different collections successively as queries and answers sets. It allowed us to test the signatures of ambiguity with removal of confounding factors. Our results show that proxy ambiguous queries (size 10 queries against size 3 documents) display different distributions of homologies 0 and 1 based features than proxy clear queries (size 5 queries against size 10 documents). We then discuss those results in terms increased manifold complexity and/or approximately discontinuous embedding submanifolds. Finally we propose a strategy to leverage those findings as a new scoring strategy of semantic similarities.
摘要:这项工作报告了句子嵌入中的歧义拓扑签名的证据,这些证据可以在向量搜索和检索增强生成(RAG)系统的上下文中用于排名和/或解释目的。我们提出了歧义的工作定义,并设计了一个实验,我们将一个专有数据集分解成不同大小的块的集合-3、5和10行,并连续使用不同的集合作为查询和答案集。它使我们能够在去除混淆因素的情况下测试模糊性的特征。我们的结果表明,代理歧义查询(针对大小为3的文档的大小为10的查询)与代理清晰查询(针对大小为10的文档的大小为5的查询)具有不同的基于同源0和1的特征分布。然后,我们用增加的流形复杂性和/或近似不连续的嵌入子流形来讨论这些结果。最后,我们提出了一种策略来利用这些发现作为一种新的语义相似性评分策略。

[NLP-33] It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF
[NLP-33] 它需要两个:论RL HF中奖励和政策模型之间的不确定性

链接: https://arxiv.org/abs/2406.07971
作者: Taiming Lu,Lingfeng Shen,Xinyu Yang,Weiting Tan,Beidi Chen,Huaxiu Yao
关键词: Reinforcement Learning, involves training policy, Human Feedback, align language models, align language
中文关键词: 强化学习,涉及培训政策、人类反馈、对齐语言模型、对齐语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align language models with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning, introducing the concept of seamlessness. Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress. Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM. To measure seamlessness between PM and RM without human effort, we propose an automatic metric, SEAM. SEAM quantifies the discrepancies between PM and RM judgments induced by data samples. We validate the effectiveness of SEAM in data selection and model augmentation. Our experiments demonstrate that (1) using SEAM-filtered data for RL training improves RLHF performance by 4.5%, and (2) SEAM-guided model augmentation results in a 4% performance improvement over standard augmentation methods.
摘要:人类反馈强化学习(RLHF)包括训练策略模型(PM)和奖励模型(RMS),以使语言模型与人类的偏好相一致。与其单独关注PM和RMS,我们建议在微调过程中检查它们的交互作用,引入无缝的概念。我们的研究从观察饱和现象开始,其中RM和PM的持续改善并不转化为RLHF的进展。我们的分析表明,RMS未能为PM响应分配适当的分数,导致与人的偏好的失配率为35%,突显了PM和RM之间的显著差异。为了在不需要人工的情况下度量PM和RM之间的无缝程度,我们提出了一个自动度量,Seam。我们的实验表明:(1)使用经过Seam过滤的数据进行RL训练,RLHF的性能提高了4.5%;(2)Seam引导的模型增强比标准增强方法的性能提高了4%。

[NLP-34] Guiding In-Context Learning of LLMs through Quality Estimation for Machine Translation
[NLP-34] 通过机器翻译质量估计指导LLM的上下文学习

链接: https://arxiv.org/abs/2406.07970
作者: Javad Pourmostafa Roshan Sharami,Dimitar Shterionov,Pieter Spronck
关键词: resulting translation quality, large language models, translation quality, output from large, closely tied
中文关键词: 由此产生的翻译质量、大型语言模型、翻译质量、大型输出,密切相关
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The quality of output from large language models (LLMs), particularly in machine translation (MT), is closely tied to the quality of in-context examples (ICEs) provided along with the query, i.e., the text to translate. The effectiveness of these ICEs is influenced by various factors, such as the domain of the source text, the order in which the ICEs are presented, the number of these examples, and the prompt templates used. Naturally, selecting the most impactful ICEs depends on understanding how these affect the resulting translation quality, which ultimately relies on translation references or human judgment. This paper presents a novel methodology for in-context learning (ICL) that relies on a search algorithm guided by domain-specific quality estimation (QE). Leveraging the XGLM model, our methodology estimates the resulting translation quality without the need for translation references, selecting effective ICEs for MT to maximize translation quality. Our results demonstrate significant improvements over existing ICL methods and higher translation performance compared to fine-tuning a pre-trained language model (PLM), specifically mBART-50.
摘要:大型语言模型(LLM)的输出质量,特别是在机器翻译(MT)中的输出质量,与随查询提供的上下文实例(ICE)的质量密切相关,即要翻译的文本。这些ICE的有效性受到各种因素的影响,如源文本的域、ICE的呈现顺序、这些例子的数量以及所使用的提示模板。当然,选择最有影响力的ICE取决于了解这些因素如何影响所产生的翻译质量,这最终取决于翻译参考或人的判断。本文提出了一种新的上下文学习方法,该方法依赖于特定领域质量估计(QE)指导的搜索算法。利用XGLM模型,我们的方法在不需要翻译参考的情况下估计结果的翻译质量,为机器翻译选择有效的ICE来最大化翻译质量。与微调预训练的语言模型(PLM),特别是mBART-50相比,我们的结果显示出比现有的ICL方法有显著的改进和更高的翻译性能。

[NLP-35] Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling
[NLP-35] 比随机更好:通过约束主动采样进行可靠的NLG人体评估

链接: https://arxiv.org/abs/2406.07967
作者: Jie Ruan,Xiao Pu,Mingqi Gao,Xiaojun Wan,Yuesheng Zhu
关键词: expensive and time-consuming, Human evaluation, Constrained Active Sampling, Active Sampling Framework, reliable evaluation method
中文关键词: 昂贵且耗时,人为评估,约束主动抽样,主动抽样框架,可靠的评估方法
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: With Appendix

点击查看摘要

Abstract:Human evaluation is viewed as a reliable evaluation method for NLG which is expensive and time-consuming. To save labor and costs, researchers usually perform human evaluation on a small subset of data sampled from the whole dataset in practice. However, different selection subsets will lead to different rankings of the systems. To give a more correct inter-system ranking and make the gold standard human evaluation more reliable, we propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. CASF operates through a Learner, a Systematic Sampler and a Constrained Controller to select representative samples for getting a more correct inter-system ranking.Experiment results on 137 real NLG evaluation setups with 44 human evaluation metrics across 16 datasets and 5 NLG tasks demonstrate CASF receives 93.18% top-ranked system recognition accuracy and ranks first or ranks second on 90.91% of the human metrics with 0.83 overall inter-system ranking Kendall correlation.Code and data are publicly available online.
摘要:人工评价被认为是一种可靠的NLG评价方法,但评价费用高、耗时长。为了节省人力和成本,在实践中,研究人员通常会对从整个数据集中采样的一小部分数据进行人工评估。然而,不同的选择子集将导致不同的系统排名。为了给出更准确的系统间排名,使黄金标准人类评价更可靠,我们提出了一种用于可靠人类判断的约束主动抽样框架(CASF)。CASF通过学习器、系统采样器和约束控制器来选择具有代表性的样本,以获得更准确的系统间排名。在137个实际NLG评估设置上的实验结果表明,CASF获得了93.18%的最高系统识别准确率,在90.91%的人类指标上排名第一或第二,系统间Kendall相关系数为0.83。代码和数据在线公开。

[NLP-36] Political Leaning Inference through Plurinational Scenarios
[NLP-36] 通过多民族场景的政治倾向推理

链接: https://arxiv.org/abs/2406.07964
作者: Joseba Fernandez de Landa,Rodrigo Agerri
关键词: media users express, Social media users, spontaneous declarations, participation in communities, Social media
中文关键词: 媒体用户表达、社交媒体用户、自发声明、参与社区、社交媒体
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Social media users express their political preferences via interaction with other users, by spontaneous declarations or by participation in communities within the network. This makes a social network such as Twitter a valuable data source to study computational science approaches to political learning inference. In this work we focus on three diverse regions in Spain (Basque Country, Catalonia and Galicia) to explore various methods for multi-party categorization, required to analyze evolving and complex political landscapes, and compare it with binary left-right approaches. We use a two-step method involving unsupervised user representations obtained from the retweets and their subsequent use for political leaning detection. Comprehensive experimentation on a newly collected and curated dataset comprising labeled users and their interactions demonstrate the effectiveness of using Relational Embeddings as representation method for political ideology detection in both binary and multi-party frameworks, even with limited training data. Finally, data visualization illustrates the ability of the Relational Embeddings to capture intricate intra-group and inter-group political affinities.
摘要:社交媒体用户通过与其他用户的互动、自发的声明或参与网络内的社区来表达他们的政治偏好。这使得Twitter等社交网络成为研究政治学习推理的计算科学方法的宝贵数据来源。在这项工作中,我们关注西班牙的三个不同地区(巴斯克国家、加泰罗尼亚和加利西亚),以探索多党分类的各种方法,需要分析不断演变和复杂的政治格局,并将其与左翼和右翼的二元方法进行比较。我们使用两步法,涉及从转发中获得的无监督用户表征,并随后将其用于政治倾向检测。在一个新收集和精选的数据集上的综合实验表明,即使在有限的训练数据下,关系嵌入作为政治意识形态检测的表示方法在二元和多方框架中都是有效的。最后,数据可视化说明了关系嵌入捕捉复杂的组内和组间政治亲和力的能力。

[NLP-37] oward a Method to Generate Capability Ontologies from Natural Language Descriptions
[NLP-37] 拥有一种从自然语言描述生成能力实体的方法

链接: https://arxiv.org/abs/2406.07962
作者: Luis Miguel Vieira da Silva,Aljosha Köcher,Felix Gehlhoff,Alexander Fay
关键词: Large Language Models, adaptable system, capability ontology, achieve a flexible, flexible and adaptable
中文关键词: 大型语言模型、适应性系统、能力本体,实现灵活、灵活、适应性强
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To achieve a flexible and adaptable system, capability ontologies are increasingly leveraged to describe functions in a machine-interpretable way. However, modeling such complex ontological descriptions is still a manual and error-prone task that requires a significant amount of effort and ontology expertise. This contribution presents an innovative method to automate capability ontology modeling using Large Language Models (LLMs), which have proven to be well suited for such tasks. Our approach requires only a natural language description of a capability, which is then automatically inserted into a predefined prompt using a few-shot prompting technique. After prompting an LLM, the resulting capability ontology is automatically verified through various steps in a loop with the LLM to check the overall correctness of the capability ontology. First, a syntax check is performed, then a check for contradictions, and finally a check for hallucinations and missing ontology elements. Our method greatly reduces manual effort, as only the initial natural language description and a final human review and possible correction are necessary, thereby streamlining the capability ontology generation process.
摘要:为了实现一个灵活的、适应性强的系统,能力本体越来越多地被用来以机器可解释的方式描述功能。然而,建模如此复杂的本体描述仍然是一项手动且容易出错的任务,需要大量的工作和本体专业知识。这一贡献提供了一种使用大型语言模型(LLM)自动进行能力本体建模的创新方法,这些模型已被证明非常适合于此类任务。我们的方法只需要功能的自然语言描述,然后使用几个镜头提示技术将其自动插入到预定义的提示中。在提示LLM后,通过与LLM循环的各个步骤自动验证得到的能力本体,以检查能力本体的整体正确性。首先,执行语法检查,然后检查矛盾,最后检查幻觉和遗漏的本体元素。我们的方法大大减少了人工工作,因为只需要初始的自然语言描述和最终的人工审查和可能的更正,从而简化了能力本体的生成过程。

[NLP-38] Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG Evaluation
[NLP-38] 在人类评估指南中定义和检测脆弱性:可靠NLG评估的初步研究

链接: https://arxiv.org/abs/2406.07935
作者: Jie Ruan,Wenqing Wang,Xiaojun Wan
关键词: Natural Language Generation, Language Generation, quality of Natural, Natural Language, Human evaluation
中文关键词: 自然语言生成,语言生成,自然质量,自然语言,人类评估
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human evaluation serves as the gold standard for assessing the quality of Natural Language Generation (NLG) systems. Nevertheless, the evaluation guideline, as a pivotal element ensuring reliable and reproducible human assessment, has received limited attention.Our investigation revealed that only 29.84% of recent papers involving human evaluation at top conferences release their evaluation guidelines, with vulnerabilities identified in 77.09% of these guidelines. Unreliable evaluation guidelines can yield inaccurate assessment outcomes, potentially impeding the advancement of NLG in the right direction. To address these challenges, we take an initial step towards reliable evaluation guidelines and propose the first human evaluation guideline dataset by collecting annotations of guidelines extracted from existing papers as well as generated via Large Language Models (LLMs). We then introduce a taxonomy of eight vulnerabilities and formulate a principle for composing evaluation guidelines. Furthermore, a method for detecting guideline vulnerabilities has been explored using LLMs, and we offer a set of recommendations to enhance reliability in human evaluation. The annotated human evaluation guideline dataset and code for the vulnerability detection method are publicly available online.
摘要:人类评价是评价自然语言生成系统质量的金标准。然而,作为确保人类评估可靠和可重复性的关键因素,评估指南得到的关注有限。我们的调查显示,最近在顶级会议上涉及人类评估的论文中,只有29.84%发布了评估指南,其中77.09%的指南发现了漏洞。不可靠的评价准则可能会产生不准确的评价结果,有可能阻碍NLG朝着正确的方向前进。为了应对这些挑战,我们朝着可靠的评价指南迈出了第一步,并通过收集从现有论文中提取的指南以及通过大型语言模型(LLMS)生成的指南的注释,提出了第一个人类评估指南数据集。然后,我们介绍了八个漏洞的分类,并制定了组成评估指南的原则。此外,我们还探索了一种使用LLMS来检测指南漏洞的方法,并提出了一套提高人类评估可靠性的建议。带注释的人类评估指南数据集和漏洞检测方法的代码在网上公开可用。

[NLP-39] Large Language Model Unlearning via Embedding-Corrupted Prompts
[NLP-39] 通过嵌入损坏的脚本取消学习大型语言模型

链接: https://arxiv.org/abs/2406.07933
作者: Chris Yuhao Liu,Yaxuan Wang,Jeffrey Flanigan,Yang Liu
关键词: Large language models, Large language, language models, advanced to encompass, encompass extensive knowledge
中文关键词: 大型语言模型,大型语言,语言模型,高级涵盖,涵盖广泛的知识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 55 pages, 4 figures, 66 tables

点击查看摘要

Abstract:Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present Embedding-COrrupted (ECO) Prompts, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at nearly zero side effects in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases.
摘要:大型语言模型(LLM)已经发展到涵盖跨不同领域的广泛知识。然而,控制大型语言模型不应该知道的内容对于确保一致性和安全使用非常重要。然而,由于保留和遗忘之间的模糊边界造成的潜在附带损害,以及在具有数千亿个参数的最新模型中进行优化的巨大计算需求,准确而有效地忘记LLM中的知识仍然具有挑战性。在这项工作中,我们提出了嵌入-破坏(ECO)提示,一个轻量级的遗忘框架,为大型语言模型,以解决挑战的知识纠缠和遗忘效率。我们不依赖LLM本身来忘记,而是在推理过程中强制处于未学习状态,方法是使用提示分类器来识别和保护忘记提示。我们通过零阶优化向遗忘目标学习添加到提示嵌入的损坏,离线学习和在推理过程中由分类器标记的损坏提示。我们发现,这些嵌入破坏的提示不仅产生了满足遗忘目标的期望输出,而且非常接近于从未对用于遗忘的数据进行训练的模型的输出。通过大量的遗忘实验,我们证明了该方法的优越性,在一般领域和与未学习领域密切相关的领域中实现了几乎为零的副作用。此外,我们强调了我们的方法可以扩展到100个LLM,范围从0.5B到236B参数,不会随着参数数量的增加而产生额外的成本。

[NLP-40] Automated Information Extraction from Thyroid Operation Narrative: A Comparative Study of GPT-4 and Fine-tuned KoELECTRA
[NLP-40] 从甲状腺手术叙述中自动提取信息:GPT-4和微调KoELECTRA的比较研究

链接: https://arxiv.org/abs/2406.07922
作者: Dongsuk Jang,Hyeryun Park,Jiye Son,Hyeonuk Hwang,Sujin Kim,Jinwook Choi
关键词: rapidly evolving field, artificial intelligence, clinical workflows, efficiency and accuracy, rapidly evolving
中文关键词: 快速发展的领域、人工智能、临床工作流程、效率和准确性、快速发展
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 3 tables

点击查看摘要

Abstract:In the rapidly evolving field of healthcare, the integration of artificial intelligence (AI) has become a pivotal component in the automation of clinical workflows, ushering in a new era of efficiency and accuracy. This study focuses on the transformative capabilities of the fine-tuned KoELECTRA model in comparison to the GPT-4 model, aiming to facilitate automated information extraction from thyroid operation narratives. The current research landscape is dominated by traditional methods heavily reliant on regular expressions, which often face challenges in processing free-style text formats containing critical details of operation records, including frozen biopsy reports. Addressing this, the study leverages advanced natural language processing (NLP) techniques to foster a paradigm shift towards more sophisticated data processing systems. Through this comparative study, we aspire to unveil a more streamlined, precise, and efficient approach to document processing in the healthcare domain, potentially revolutionizing the way medical data is handled and analyzed.
摘要:在快速发展的医疗保健领域,人工智能(AI)的集成已经成为临床工作流程自动化的关键组件,开启了一个高效和准确的新时代。这项研究的重点是比较微调的KoELECTRA模型和GPT-4模型的转换能力,旨在促进从甲状腺手术叙述中自动提取信息。目前的研究格局主要是严重依赖正则表达式的传统方法,这些方法在处理包含手术记录关键细节的自由样式文本格式方面经常面临挑战,包括冻结的活检报告。为了解决这一问题,这项研究利用先进的自然语言处理(NLP)技术来促进向更复杂的数据处理系统的范式转变。通过这项比较研究,我们希望在医疗保健领域推出一种更精简、更精确和更高效的文档处理方法,潜在地彻底改变医疗数据的处理和分析方式。

[NLP-41] DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning
[NLP-41] DeTriever:基于解码器表示的检索器,用于改进NL2SQL内上下文学习

链接: https://arxiv.org/abs/2406.07913
作者: Yuxi Feng,Raymond Li,Zhenan Fan,Giuseppe Carenini,Mohammadreza Pourreza,Weiwei Zhang,Yong Zhang
关键词: Structured Query Language, Large Language Models, natural language questions, translating natural language, open research problem
中文关键词: 结构化查询语言、大型语言模型、自然语言问题、翻译自然语言、开放性研究问题
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:While in-context Learning (ICL) has proven to be an effective technique to improve the performance of Large Language Models (LLMs) in a variety of complex tasks, notably in translating natural language questions into Structured Query Language (NL2SQL), the question of how to select the most beneficial demonstration examples remains an open research problem. While prior works often adapted off-the-shelf encoders to retrieve examples dynamically, an inherent discrepancy exists in the representational capacities between the external retrievers and the LLMs. Further, optimizing the selection of examples is a non-trivial task, since there are no straightforward methods to assess the relative benefits of examples without performing pairwise inference. To address these shortcomings, we propose DeTriever, a novel demonstration retrieval framework that learns a weighted combination of LLM hidden states, where rich semantic information is encoded. To train the model, we propose a proxy score that estimates the relative benefits of examples based on the similarities between output queries. Experiments on two popular NL2SQL benchmarks demonstrate that our method significantly outperforms the state-of-the-art baselines on one-shot NL2SQL tasks.
摘要:虽然情境学习(ICL)已经被证明是一种有效的技术,可以提高大型语言模型(LLM)在各种复杂任务中的性能,特别是在将自然语言问题转换为结构化查询语言(NL2SQL)方面,但如何选择最有益的演示示例仍然是一个开放的研究问题。虽然以前的工作通常采用现成的编码器来动态检索样本,但外部检索者和LLMS之间的表征能力存在固有的差异。此外,优化样本的选择不是一项简单的任务,因为没有直接的方法来评估样本的相对效益,而不执行成对推理。针对这些不足,我们提出了一种新颖的演示检索框架DeTriever,该框架学习LLM隐藏状态的加权组合,其中编码了丰富的语义信息。为了训练模型,我们提出了一个代理评分,该评分基于输出查询之间的相似性来估计示例的相对收益。在两个流行的NL2SQL基准测试上的实验表明,我们的方法在单次NL2SQL任务上的性能明显优于最先进的基线。

[NLP-42] Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited Annotations
[NLP-42] 探索自我监督多视图对比学习用于有限注释的语音情感识别

链接: https://arxiv.org/abs/2406.07900
作者: Bulat Khaertdinov,Pedro Jeuris,Annanda Sousa,Enrique Hortal
关键词: Speech Emotion Recognition, reaching unprecedented levels, Emotion Recognition, Recent advancements, Self-Supervised Learning
中文关键词: 语音情感识别,达到前所未有的水平,情感识别,最新进展,自我监督学习
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:Recent advancements in Deep and Self-Supervised Learning (SSL) have led to substantial improvements in Speech Emotion Recognition (SER) performance, reaching unprecedented levels. However, obtaining sufficient amounts of accurately labeled data for training or fine-tuning the models remains a costly and challenging task. In this paper, we propose a multi-view SSL pre-training technique that can be applied to various representations of speech, including the ones generated by large speech models, to improve SER performance in scenarios where annotations are limited. Our experiments, based on wav2vec 2.0, spectral and paralinguistic features, demonstrate that the proposed framework boosts the SER performance, by up to 10% in Unweighted Average Recall, in settings with extremely sparse data annotations.
摘要:深度学习和自我监督学习(SSL)的最新进展导致语音情感识别(BER)性能得到了大幅提高,达到了前所未有的水平。然而,获得足够数量的准确标记数据来训练或微调模型仍然是一项昂贵且具有挑战性的任务。在本文中,我们提出了一种多视图SSL预训练技术,该技术可以应用于语音的各种表示,包括由大型语音模型生成的表示,以提高注释有限的场景中的BER性能。我们的实验基于wav2vec 2.0、频谱和双语言特征,证明在数据注释极其稀疏的环境中,拟议的框架在未加权平均召回中将BER性能提高了高达10%。

[NLP-43] An Empirical Study of Mamba-based Language Models
[NLP-43] 基于曼巴语的语言模型的实证研究

链接: https://arxiv.org/abs/2406.07887
作者: Roger Waleffe,Wonmin Byeon,Duncan Riach,Brandon Norick,Vijay Korthikanti,Tri Dao,Albert Gu,Ali Hatamizadeh,Sudhakar Singh,Deepak Narayanan,Garvit Kulshreshtha,Vartika Singh,Jared Casper,Jan Kautz,Mohammad Shoeybi,Bryan Catanzaro
关键词: Selective state-space models, quadratic computational complexity, large inference-time memory, inference-time memory requirements, Selective state-space
中文关键词: 选择性状态空间模型、二次计算复杂性、大推理时内存、推理时内存要求、选择性状态空间
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments comparing SSMs to Transformers. To understand the strengths and weaknesses of these architectures at larger scales, we present a direct comparison between 8B-parameter Mamba, Mamba-2, and Transformer models trained on the same datasets of up to 3.5T tokens. We also compare these models to a hybrid architecture consisting of 43% Mamba-2, 7% attention, and 50% MLP layers (Mamba-2-Hybrid). Using a diverse set of tasks, we answer the question of whether Mamba models can match Transformers at larger training budgets. Our results show that while pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks which require strong copying or in-context learning abilities (e.g., 5-shot MMLU, Phonebook) or long-context reasoning. In contrast, we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average. To enable further study, we release the checkpoints as well as the code used to train our models as part of NVIDIA’s Megatron-LM project.
摘要:像Mamba这样的选择性状态空间模型(SSM)克服了Transformers的一些缺点,如序列长度的二次计算复杂性和对键值缓存的推理时间存储需求。此外,最近的研究表明,SSM可以媲美或超过Transformers的语言建模能力,使其成为一个有吸引力的替代方案。然而,在受控环境下(例如,相同的数据),到目前为止,研究只提供了将SSM与变压器进行比较的小规模实验。为了了解这些架构在更大规模上的优势和劣势,我们提供了8B参数Mamba、Mamba-2和Transformer模型之间的直接比较,这些模型基于高达3.5T令牌的相同数据集进行训练。我们还将这些模型与由43%的Mamba-2、7%的关注度和50%的MLP层(Mamba-2-Private)组成的混合架构进行了比较。使用一组不同的任务,我们回答了Mamba模型是否可以在更大的培训预算下与变形金刚相匹配的问题。我们的结果表明,尽管纯SSM在许多任务上与Transformers不相上下,甚至超过Transformers,但在需要较强的复制或情境学习能力(如5-shot MMLU、Phonebook)或长情境推理的任务上,他们落后于Transformers。相反,我们发现在我们评估的所有12个标准任务中,8B Mamba-2-Private都超过了8B Transformer(平均+2.65分),并且在推理时生成令牌时预计速度最高可快8倍。为了验证长上下文能力,我们提供了额外的实验,评估Mamba-2混合和Transformer的变体,扩展到支持16K、32K和128K序列。在另外23个长环境任务中,混合模型的平均水平继续接近或超过Transformer。为了便于进一步研究,我们发布了检查点以及用于训练我们的模型的代码,作为NVIDIA威震天项目的一部分。

[NLP-44] Label-aware Hard Negative Sampling Strategies with Momentum Contrastive Learning for Implicit Hate Speech Detection
[NLP-44] 用于内隐仇恨语音检测的标签感知硬负采样策略和动量对比学习

链接: https://arxiv.org/abs/2406.07886
作者: Jaehoon Kim,Seungwan Jin,Sohyun Park,Someen Park,Kyungsik Han
关键词: Detecting implicit hate, directly hateful remains, implicit hate speech, Detecting implicit, implicit hate
中文关键词: 检测隐性仇恨、直接仇恨遗骸、隐性仇恨言论、检测隐性、隐性仇恨
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:Detecting implicit hate speech that is not directly hateful remains a challenge. Recent research has attempted to detect implicit hate speech by applying contrastive learning to pre-trained language models such as BERT and RoBERTa, but the proposed models still do not have a significant advantage over cross-entropy loss-based learning. We found that contrastive learning based on randomly sampled batch data does not encourage the model to learn hard negative samples. In this work, we propose Label-aware Hard Negative sampling strategies (LAHN) that encourage the model to learn detailed features from hard negative samples, instead of naive negative samples in random batch, using momentum-integrated contrastive learning. LAHN outperforms the existing models for implicit hate speech detection both in- and cross-datasets. The code is available at this https URL
摘要:检测不直接仇恨的隐性仇恨言论仍然是一个挑战。最近的研究试图通过将对比学习应用于BERT和RoBERTa等预训练的语言模型来检测隐性仇恨言论,但提出的模型仍然没有比基于交叉熵损失的学习具有显着优势。我们发现,基于随机抽样批数据的对比学习并不鼓励模型学习硬负样本。在这项工作中,我们提出了标签感知硬负采样策略(LAHN),鼓励模型使用动量集成对比学习从硬负样本中学习详细特征,而不是随机批中的天真负样本。LAHN在数据集中和跨数据集中的隐式仇恨言论检测方面的表现优于现有模型。该代码可在此https URL中获取

[NLP-45] Designing a Dashboard for Transparency and Control of Conversational AI
[NLP-45] 设计对话人工智能的透明度和控制仪表板

链接: https://arxiv.org/abs/2406.07882
作者: Yida Chen,Aoyu Wu,Trevor DePodesta,Catherine Yeh,Kenneth Li,Nicholas Castillo Marin,Oam Patel,Jan Riecke,Shivam Raval,Olivia Seow,Martin Wattenberg,Fernanda Viégas
关键词: Conversational LLMs function, leaving users guessing, Conversational LLMs, black box systems, function as black
中文关键词: 对话式LLM功能,让用户猜测,对话式LLM,黑匣子系统,功能为黑色
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Project page: this https URL 38 pages, 23 figures

点击查看摘要

Abstract:Conversational LLMs function as black box systems, leaving users guessing about why they see the output they do. This lack of transparency is potentially problematic, especially given concerns around bias and truthfulness. To address this issue, we present an end-to-end prototype-connecting interpretability techniques with user experience design-that seeks to make chatbots more transparent. We begin by showing evidence that a prominent open-source LLM has a “user model”: examining the internal state of the system, we can extract data related to a user’s age, gender, educational level, and socioeconomic status. Next, we describe the design of a dashboard that accompanies the chatbot interface, displaying this user model in real time. The dashboard can also be used to control the user model and the system’s behavior. Finally, we discuss a study in which users conversed with the instrumented system. Our results suggest that users appreciate seeing internal states, which helped them expose biased behavior and increased their sense of control. Participants also made valuable suggestions that point to future directions for both design and machine learning research. The project page and video demo of our TalkTuner system are available at this https URL
摘要:对话型LLM的功能就像黑匣子系统,让用户猜测为什么他们会看到他们所做的输出。这种缺乏透明度可能会带来问题,特别是考虑到人们对偏见和真实性的担忧。为了解决这个问题,我们提出了一个端到端的原型–将可解释性技术与用户体验设计相结合–试图使聊天机器人更加透明。我们首先展示一个著名的开源LLM有一个“用户模型”的证据:检查系统的内部状态,我们可以提取与用户的年龄、性别、教育水平和社会经济地位相关的数据。接下来,我们将描述随Chatbot界面一起使用的仪表板的设计,该仪表板实时显示该用户模型。仪表板还可用于控制用户模型和系统行为。最后,我们讨论了一个用户与仪表化系统对话的研究。我们的结果表明,用户喜欢看到内部状态,这有助于他们揭露有偏见的行为,并增加他们的控制感。与会者还提出了宝贵的建议,为设计和机器学习研究指明了未来的方向。我们的TalkTuner系统的项目页面和视频演示可通过以下HTTPS URL获得

[NLP-46] BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain
[NLP-46] BookSQL:会计领域的大规模文本到SQL数据集

链接: https://arxiv.org/abs/2406.07860
作者: Rahul Kumar,Amar Raja Dibbu,Shrutendra Harsola,Vignesh Subrahmaniam,Ashutosh Modi
关键词: recently been proposed, Spider, natural language interfaces, natural language, accounting
中文关键词: 最近提出的、Spider、自然语言界面、自然语言、会计
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at NAACL 2024; 20 Pages (main + appendix)

点击查看摘要

Abstract:Several large-scale datasets (e.g., WikiSQL, Spider) for developing natural language interfaces to databases have recently been proposed. These datasets cover a wide breadth of domains but fall short on some essential domains, such as finance and accounting. Given that accounting databases are used worldwide, particularly by non-technical people, there is an imminent need to develop models that could help extract information from accounting databases via natural language queries. In this resource paper, we aim to fill this gap by proposing a new large-scale Text-to-SQL dataset for the accounting and financial domain: BookSQL. The dataset consists of 100k natural language queries-SQL pairs, and accounting databases of 1 million records. We experiment with and analyze existing state-of-the-art models (including GPT-4) for the Text-to-SQL task on BookSQL. We find significant performance gaps, thus pointing towards developing more focused models for this domain.
摘要:几个大规模数据集(例如,最近有人提出了用于开发数据库自然语言接口的WikiSQL、Spider。这些数据集涵盖了广泛的领域,但在一些基本领域(例如金融和会计)方面存在不足。鉴于会计数据库在全球范围内使用,特别是由非技术人员使用,迫切需要开发可以帮助通过自然语言查询从会计数据库中提取信息的模型。在这份资源文件中,我们的目标是通过为会计和金融领域提出一个新的大规模文本到SQL数据集:BookSQL来填补这一空白。该数据集由10万个自然语言查询-SQL对和100万条记录的会计数据库组成。我们实验并分析了BookSQL上的文本到SQL任务的现有最先进模型(包括GPT-4)。我们发现了显着的性能差距,因此需要为此领域开发更有针对性的模型。

[NLP-47] VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
[NLP-47] WAL-E R:通过单调对齐实现稳健高效的零镜头文本到语音合成

链接: https://arxiv.org/abs/2406.07855
作者: Bing Han,Long Zhou,Shujie Liu,Sanyuan Chen,Lingwei Meng,Yanming Qian,Yanqing Liu,Sheng Zhao,Jinyu Li,Furu Wei
关键词: large language models, neural audio codecs, large language, language models, increasingly been recognized
中文关键词: 大型语言模型、神经音频编解码器、大型语言、语言模型,越来越被认可
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings huge computational overhead to the inference process of autoregression. To address these issues, we propose VALL-E R, a robust and efficient zero-shot TTS system, building upon the foundation of VALL-E. Specifically, we introduce a phoneme monotonic alignment strategy to strengthen the connection between phonemes and acoustic sequence, ensuring a more precise alignment by constraining the acoustic tokens to match their associated phonemes. Furthermore, we employ a codec-merging approach to downsample the discrete codes in shallow quantization layer, thereby accelerating the decoding speed while preserving the high quality of speech output. Benefiting from these strategies, VALL-E R obtains controllablity over phonemes and demonstrates its strong robustness by approaching the WER of ground truth. In addition, it requires fewer autoregressive steps, with over 60% time reduction during inference. This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia. Audio samples will be available at: this https URL.
摘要:在离散神经音频编解码器的帮助下,大语言模型(LLM)越来越被认为是一种很有前途的零镜头文本到语音(TTS)合成方法。然而,基于采样的解码策略在给新一代带来惊人多样性的同时,也带来了诸如打字错误、遗漏和重复等健壮性问题。此外,音频的高采样率也给自回归的推理过程带来了巨大的计算开销。为了解决这些问题,我们在Vall-E的基础上提出了一种健壮而高效的零激发TTS系统Vall-E R。具体地说,我们引入了音素单调对齐策略来加强音素和声学序列之间的联系,通过约束声学标记与其关联的音素匹配来确保更精确的对准。此外,我们还采用了一种编解码合并的方法,对浅量化层的离散编码进行下采样,从而在保持高质量语音输出的同时加快了解码速度。得益于这些策略,VALL-E-R获得了对音素的可控性,并通过逼近基本事实的WER来展示其强大的稳健性。此外,它需要更少的自回归步骤,推理时间减少60%以上。这项研究有可能应用于有意义的项目,包括为失语症患者创造语音。音频样本可在以下网址获得:This HTTPS URL。

[NLP-48] Dynamic Stochastic Decoding Strategy for Open-Domain Dialogue Generation
[NLP-48] 开放领域对话生成的动态随机解码策略

链接: https://arxiv.org/abs/2406.07850
作者: Yiwei Li,Fei Mi,Yitong Li,Yasheng Wang,Bin Sun,Shaoxiong Feng,Kan Li
关键词: dialogue generation task, Stochastic sampling strategies, generation task, sampling strategies, top-k and top-p
中文关键词: 对话生成任务、随机抽样策略、生成任务、抽样策略、top-k和top-p
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2024 Findings

点击查看摘要

Abstract:Stochastic sampling strategies such as top-k and top-p have been widely used in dialogue generation task. However, as an open-domain chatting system, there will be two different conversation scenarios, i.e. chit-chat and knowledge-based question answering. In the former situation, responses diversity is essential due to the one-to-many nature in dialogue. The latter, on the other hand, requires less randomness given that stochastic decoding strategy entails the risk of generating incorrect information. As a result, an adaptive and flexible decoding strategy is needed to cope with these two scenarios simultaneously. To this end, we propose the dynamic decoding strategy (DDS), which can adjust the decoding space w.r.t. different contexts. In DDS, both sequence-level and token-level adaptive search can be achieved to adjust the decoding process in a unified framework. Besides, our adaptive algorithm can not only be used during model inference, but it can also be applied during the model training stage to further enhance the performance. Comprehensive experiments indicate that the proposed decoding strategy can consistently improve the performance of pre-trained dialogue models when coupled with four well-used stochastic decoding algorithms.
摘要:TOP-K和TOP-P等随机抽样策略在对话生成任务中得到了广泛的应用。然而,作为一个开放领域的聊天系统,将有两种不同的对话场景,即Chit-chat和基于知识的问答。在前一种情况下,由于对话中的一对多性质,回应的多样性至关重要。另一方面,后者需要较少的随机性,因为随机解码策略会带来产生错误信息的风险。因此,需要一种自适应和灵活的解码策略来同时应对这两种情况。为此,我们提出了动态译码策略(DDS),该策略可以调整译码空间w.r.t.不同的背景。在DDS中,可以同时实现序列级和令牌级的自适应搜索,从而在统一的框架内调整译码过程。此外,我们的自适应算法不仅可以用于模型推理,还可以应用于模型训练阶段,以进一步提高性能。综合实验表明,该译码策略与四种常用的随机译码算法相结合,能够持续提高预先训练好的对话模型的性能。

[NLP-49] Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
[NLP-49] 用多模式分层交叉注意模型标记在线视频中的喜剧恶作剧内容

链接: https://arxiv.org/abs/2406.07841
作者: Elaheh Baharlouei,Mahsa Shafaei,Yigeng Zhang,Hugo Jair Escalante,Thamar Solorio
关键词: detecting questionable content, comic mischief, comic mischief detection, specifically the subcategory, address the challenge
中文关键词: 检测可疑内容、漫画恶作剧、漫画恶作剧检测,特别是子类别,可以解决挑战
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We address the challenge of detecting questionable content in online media, specifically the subcategory of comic mischief. This type of content combines elements such as violence, adult content, or sarcasm with humor, making it difficult to detect. Employing a multimodal approach is vital to capture the subtle details inherent in comic mischief content. To tackle this problem, we propose a novel end-to-end multimodal system for the task of comic mischief detection. As part of this contribution, we release a novel dataset for the targeted task consisting of three modalities: video, text (video captions and subtitles), and audio. We also design a HIerarchical Cross-attention model with CAPtions (HICCAP) to capture the intricate relationships among these modalities. The results show that the proposed approach makes a significant improvement over robust baselines and state-of-the-art models for comic mischief detection and its type classification. This emphasizes the potential of our system to empower users, to make informed decisions about the online content they choose to see. In addition, we conduct experiments on the UCF101, HMDB51, and XD-Violence datasets, comparing our model against other state-of-the-art approaches showcasing the outstanding performance of our proposed model in various scenarios.
摘要:我们解决了在网络媒体中发现可疑内容的挑战,特别是漫画恶作剧这一子类别。这类内容将暴力、成人内容或讽刺等元素与幽默结合在一起,使其难以被检测到。采用多模式方法对于捕捉喜剧恶作剧内容中固有的微妙细节至关重要。为了解决这一问题,我们提出了一种新的端到端多通道系统用于漫画恶作剧检测任务。作为这一贡献的一部分,我们发布了一个新的数据集,用于目标任务,包括三种形式:视频、文本(视频字幕和字幕)和音频。我们还设计了一个带字幕的分层交叉注意模型(HICCAP)来捕捉这些通道之间的复杂关系。实验结果表明,与稳健基线和现有模型相比,该方法在漫画恶作剧的检测和类型分类上有了明显的改进。这强调了我们的系统的潜力,赋予用户权力,对他们选择观看的在线内容做出明智的决定。此外,我们在UCF101、HMDB51和XD-Violence数据集上进行了实验,将我们的模型与其他最先进的方法进行了比较,展示了我们提出的模型在各种场景中的出色性能。

[NLP-50] SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature
[NLP-50] SciRIFF:一个增强科学文献语言模型教学遵循的资源

链接: https://arxiv.org/abs/2406.07835
作者: David Wadden,Kejian Shi,Jacob Morrison,Aakanksha Naik,Shruti Singh,Nitzan Barzilay,Kyle Lo,Tom Hope,Luca Soldaini,Shannon Zejiang Shen,Doug Downey,Hannaneh Hajishirzi,Arman Cohan
关键词: literature understanding capabilities, question answering, claim verification, scientific literature understanding, essential scientific literature
中文关键词: 文献理解能力、问题回答、主张验证、科学文献理解、基本科学文献
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to NeurIPS Datasets and Benchmarks 2024

点击查看摘要

Abstract:We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF demonstrations are notable for their long input contexts, detailed task specifications, and complex structured outputs. While instruction-following resources are available in specific domains such as clinical medicine and chemistry, SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields. To demonstrate the utility of SciRIFF, we develop a sample-efficient strategy to adapt a general instruction-following model for science by performing additional finetuning on a mix of general-domain and SciRIFF demonstrations. In evaluations on nine held-out scientific tasks, our model – called SciTulu – improves over a strong LLM baseline by 28.1% and 6.5% at the 7B and 70B scales respectively, while maintaining general instruction-following performance within 2% of the baseline. We are optimistic that SciRIFF will facilitate the development and evaluation of LLMs to help researchers navigate the ever-growing body of scientific literature. We release our dataset, model checkpoints, and data processing and evaluation code to enable further research.
摘要:我们提出了SciRIFF(科学指导和微调资源),这是一个137K指令跟随演示的数据集,用于54个任务,涵盖了五个基本的科学文献理解能力:信息提取、摘要、问题回答、主张验证和分类。SciRIFF演示以其冗长的输入上下文、详细的任务规范和复杂的结构化输出而闻名。虽然在临床医学和化学等特定领域提供了指导资源,但SciRIFF是第一个专注于从广泛科学领域的研究文献中提取和合成信息的数据集。为了证明SciRIFF的有效性,我们开发了一种样本高效的策略,通过对一般领域和SciRIFF演示的混合执行额外的微调,来适应科学的一般指令遵循模型。在对九项悬而未决的科学任务的评估中,我们的模型–名为SciTulu–在7B和70B量表上分别比强大的LLM基线提高了28.1%和6.5%,同时将一般指令遵循的表现保持在基线的2%以内。我们乐观地认为,本论坛将促进LLMS的开发和评估,以帮助研究人员在不断增长的科学文献中导航。我们发布了我们的数据集、模型检查点以及数据处理和评估代码,以便进行进一步的研究。

[NLP-51] PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding
[NLP-51] PRoDeliberation:端到端口语理解的并行稳健审议

链接: https://arxiv.org/abs/2406.07823
作者: Trang Le,Daniel Lazar,Suyoun Kim,Shan Jiang,Duc Le,Adithya Sagar,Aleksandr Livshits,Ahmed Aly,Akshat Shrivastava
关键词: Spoken Language Understanding, Spoken Language, Language Understanding, Connectionist Temporal Classification-based, voice assistants
中文关键词: 口语理解,口语,语言理解,基于连接主义时态分类的语音助手
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions of autoregressive deliberation systems. We further show that the design of the denoising training allows PRoDeliberation to overcome the limitations of small ASR devices, and we provide analysis on the necessity of each component of the system.
摘要:口语理解(SLU)是语音助手的重要组成部分,它包括将语音转换为语义分析以执行任务。前人的工作探索了端到端模型来提高SLU模型的质量和稳健性,但是这些模型仍然是自回归的,导致了更高的延迟。在这项工作中,我们引入了PRoDeliberation,这是一种新的方法,利用了一种基于连接主义时态分类的解码策略和去噪目标来训练健壮的非自回归审议模型。我们证明了PRoDeliberation实现了并行解码的延迟减少(比自回归模型提高了2-10倍),同时保留了纠正自回归审议系统的自动语音识别(ASR)误译的能力。我们进一步证明了去噪训练的设计使PRoDeliberation能够克服小型ASR设备的局限性,并对系统的每个组件的必要性进行了分析。

[NLP-52] Me Whats Next: Textual Foresight for Generic UI Representations
[NLP-52] 接下来:通用UI表示的文本展望

链接: https://arxiv.org/abs/2406.07822
作者: Andrea Burns,Kate Saenko,Bryan A. Plummer
关键词: automating user commands, user interfaces, app user interfaces, Textual Foresight, user commands
中文关键词: 自动化用户命令、用户界面、应用程序用户界面、文本前瞻、用户命令
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Findings. Data and code to be released at this https URL

点击查看摘要

Abstract:Mobile app user interfaces (UIs) are rich with action, text, structure, and image content that can be utilized to learn generic UI representations for tasks like automating user commands, summarizing content, and evaluating the accessibility of user interfaces. Prior work has learned strong visual representations with local or global captioning losses, but fails to retain both granularities. To combat this, we propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. Our approach requires joint reasoning over elements and entire screens, resulting in improved UI features: on generation tasks, UI agents trained with Textual Foresight outperform state-of-the-art by 2% with 28x fewer images. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning. OpenApp enables new baselines, and we find Textual Foresight improves average task performance over them by 5.7% while having access to 2x less data.
摘要:移动应用程序用户界面(UI)包含丰富的动作、文本、结构和图像内容,可用于学习任务的通用UI表示,如自动执行用户命令、总结内容和评估用户界面的可访问性。以前的工作已经学习了具有局部或全局字幕丢失的强视觉表示,但未能保留这两个粒度。为了解决这一问题,我们提出了文本预见,这是一种用于学习UI屏幕表示的新的预训练目标。Text Foresight生成给定当前用户界面和采取的本地操作的未来用户界面状态的全局文本描述。我们的方法需要对元素和整个屏幕进行联合推理,从而改进了用户界面功能:在生成任务方面,使用Text Foresight训练的用户界面代理在图像减少28倍的情况下比最先进的用户界面代理的性能高出2%。我们使用我们新构建的移动应用程序数据集OpenApp进行训练,这将产生第一个用于应用程序UI表示学习的公共数据集。OpenApp支持新的基线,我们发现Text Foresight比它们的平均任务性能提高了5.7%,同时访问的数据减少了2倍。

[NLP-53] Are Large Language Models Good Statisticians?
[NLP-53] 大型语言模型是优秀的统计学家吗?

链接: https://arxiv.org/abs/2406.07815
作者: Yizhang Zhu,Shiyin Du,Boyan Li,Yuyu Luo,Nan Tang
关键词: Large Language Models, Large Language, tasks including mathematics, scientific tasks including, demonstrated impressive capabilities
中文关键词: 大型语言模型、大型语言、包括数学在内的任务、包括科学任务在内的展示了令人印象深刻的能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages, 10 figures,19 tables. Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across a range of scientific tasks including mathematics, physics, and chemistry. Despite their successes, the effectiveness of LLMs in handling complex statistical tasks remains systematically under-explored. To bridge this gap, we introduce StatQA, a new benchmark designed for statistical analysis tasks. StatQA comprises 11,623 examples tailored to evaluate LLMs’ proficiency in specialized statistical tasks and their applicability assessment capabilities, particularly for hypothesis testing methods. We systematically experiment with representative LLMs using various prompting strategies and show that even state-of-the-art models such as GPT-4o achieve a best performance of only 64.83%, indicating significant room for improvement. Notably, while open-source LLMs (e.g. LLaMA-3) show limited capability, those fine-tuned ones exhibit marked improvements, outperforming all in-context learning-based methods (e.g. GPT-4o). Moreover, our comparative human experiments highlight a striking contrast in error types between LLMs and humans: LLMs primarily make applicability errors, whereas humans mostly make statistical task confusion errors. This divergence highlights distinct areas of proficiency and deficiency, suggesting that combining LLM and human expertise could lead to complementary strengths, inviting further investigation into their collaborative potential.
摘要:大型语言模型在包括数学、物理和化学在内的一系列科学任务中表现出了令人印象深刻的能力。尽管取得了成功,但小岛屿发展中国家在处理复杂统计任务方面的有效性仍未得到系统的探讨。为了弥补这一差距,我们引入了StatQA,这是一种为统计分析任务设计的新基准。STATQA包括11,623个实例,用于评估LLMS在专门统计任务中的熟练程度及其适用性评估能力,特别是假设检验方法的适用性评估能力。我们使用不同的提示策略对具有代表性的LLMS进行了系统的实验,结果表明,即使是最先进的模型,如GPT-40,其最佳性能也只有64.83%,表明有很大的改进空间。值得注意的是,尽管开源LLMS(例如,LLAMA-3)显示的能力有限,但那些经过微调的LLM显示出显著的改进,表现优于所有基于上下文学习的方法(例如,GPT-40)。此外,我们的对比人体实验突出了LLMS和人类在错误类型上的显著差异:LLM主要犯适用性错误,而人类主要犯统计任务混淆错误。这种差异突出了不同领域的熟练程度和不足之处,表明将LLM和人类的专业知识结合起来可能会导致优势互补,从而促使对它们的合作潜力进行进一步调查。

[NLP-54] Collective Constitutional AI: Aligning a Language Model with Public Input
[NLP-54] 集体宪法人工智能:将语言模型与公共输入保持一致

链接: https://arxiv.org/abs/2406.07814
作者: Saffron Huang,Divya Siddarth,Liane Lovitt,Thomas I. Liao,Esin Durmus,Alex Tamkin,Deep Ganguli
关键词: present Collective Constitutional, growing consensus, sole deciders, methods that enable, enable the broader
中文关键词: 目前的集体宪法、不断增长的共识、唯一的决定者、能够实现、能够实现更广泛的目标的方法
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs-from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.
摘要:越来越多的人达成共识,认为语言模型(LM)开发人员不应该是LM行为的唯一决策者,这就产生了对方法的需求,使更广泛的公众能够集体地塑造影响他们的LM系统的行为。为了满足这一需求,我们提出了集体宪法人工智能(CCAI):一个多阶段的过程,用于寻找公共投入并将其整合到LMS中–从确定目标人群到寻找原则,再到训练和评估一个模型。我们通过创建据我们所知的第一个使用集体来源的公共输入进行微调的LM,并根据LM开发人员根据既定原则训练的基线模型来评估此模型,从而展示了这种方法的现实世界实用性。我们的定量评估展示了我们的方法的几个好处:与基线模型相比,CCAI培训的模型在九个社交维度上显示出更低的偏见,同时在语言、数学和有益无害的评估上保持相同的表现。对这些模型的定性比较表明,这些模型因其各自的构成而不同,例如,当提示有争议的话题时,经CCAI训练的模型往往会产生积极的反应,而不是拒绝。这些结果展示了一条通向公开信息的语言模型发展的有希望的、可处理的途径。

[NLP-55] o be Continuous or to be Discrete Those are Bits of Questions
[NLP-55] o是连续的还是离散的这些都是一些问题

链接: https://arxiv.org/abs/2406.07812
作者: Yiran Wang,Masao Utiyama
关键词: Recently, continuous input vectors, binary representation, replace continuous input, representation
中文关键词: 最近,连续输入载体、二进制表示法取代了连续输入、表示法
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL-2024

点击查看摘要

Abstract:Recently, binary representation has been proposed as a novel representation that lies between continuous and discrete representations. It exhibits considerable information-preserving capability when being used to replace continuous input vectors. In this paper, we investigate the feasibility of further introducing it to the output side, aiming to allow models to output binary labels instead. To preserve the structural information on the output side along with label information, we extend the previous contrastive hashing method as structured contrastive hashing. More specifically, we upgrade CKY from label-level to bit-level, define a new similarity function with span marginal probabilities, and introduce a novel contrastive loss function with a carefully designed instance selection strategy. Our model achieves competitive performance on various structured prediction tasks, and demonstrates that binary representation can be considered a novel representation that further bridges the gap between the continuous nature of deep learning and the discrete intrinsic property of natural languages.
摘要:最近,二进制表示被提出为介于连续表示和离散表示之间的一种新的表示。当它被用来代替连续输入向量时,它表现出相当大的信息保持能力。在本文中,我们研究了将其进一步引入到输出端的可行性,目的是允许模型输出二进制标签。为了保留输出端的结构信息和标签信息,我们将以前的对比散列方法扩展为结构化对比散列。具体地说,我们将CKY从标签级提升到比特级,定义了一个具有跨度边际概率的新的相似度函数,并引入了一种新的对比损失函数和精心设计的实例选择策略。我们的模型在不同的结构化预测任务上取得了有竞争力的性能,并表明二进制表示可以被认为是一种新的表示,进一步弥合了深度学习的连续性和自然语言的离散固有属性之间的差距。

[NLP-56] PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models
[NLP-56] PolySpeech:探索统一的多任务语音模型,以与单任务模型竞争

链接: https://arxiv.org/abs/2406.07801
作者: Runyan Yang,Huibao Yang,Xiqing Zhang,Tiantian Ye,Ying Liu,Yingying Gao,Shilei Zhang,Chao Deng,Junlan Feng
关键词: speech, attempts to integrate, Recently, speech processing tasks, tasks
中文关键词: 语音,试图集成,最近,语音处理任务,任务
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Recently, there have been attempts to integrate various speech processing tasks into a unified model. However, few previous works directly demonstrated that joint optimization of diverse tasks in multitask speech models has positive influence on the performance of individual tasks. In this paper we present a multitask speech model – PolySpeech, which supports speech recognition, speech synthesis, and two speech classification tasks. PolySpeech takes multi-modal language model as its core structure and uses semantic representations as speech inputs. We introduce semantic speech embedding tokenization and speech reconstruction methods to PolySpeech, enabling efficient generation of high-quality speech for any given speaker. PolySpeech shows competitiveness across various tasks compared to single-task models. In our experiments, multitask optimization achieves performance comparable to single-task optimization and is especially beneficial for specific tasks.
摘要:最近,有人试图将各种语音处理任务集成到统一模型中。然而,之前很少有工作直接证明多任务语音模型中不同任务的联合优化对单个任务的性能有积极影响。本文提出了一种多任务语音模型-- PolySpeech,它支持语音识别、语音合成和两种语音分类任务。PolySpeech以多模式语言模型为核心结构,使用语义表示作为语音输入。我们向PolySpeech引入了语义语音嵌入标记化和语音重建方法,能够为任何指定说话者高效生成高质量语音。与单任务模型相比,PolySpeech在各种任务中表现出竞争力。在我们的实验中,多任务优化的性能与单任务优化相当,并且对于特定任务特别有利。

[NLP-57] IndirectRequests: Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests
[NLP-57] 间接请求:通过综合生成间接用户请求,使面向任务的对话数据集更加自然

链接: https://arxiv.org/abs/2406.07794
作者: Amogh Mannekote,Jinseok Nam,Ziming Li,Jian Gao,Kristy Elizabeth Boyer,Bonnie J. Dorr
关键词: Existing benchmark corpora, Existing benchmark, giving template-based goal, template-based goal descriptions, descriptions to crowdworkers
中文关键词: 现有基准库、现有基准、给出基于模板的目标、基于模板的目标描述、对众筹者的描述
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing benchmark corpora of task-oriented dialogue are collected either using a “machines talking to machines” approach or by giving template-based goal descriptions to crowdworkers. These methods, however, often produce utterances that are markedly different from natural human conversations in which people often convey their preferences in indirect ways, such as through small talk. We term such utterances as Indirect User Requests (IURs). Understanding such utterances demands considerable world knowledge and reasoning capabilities on the listener’s part. Our study introduces an LLM-based pipeline to automatically generate realistic, high-quality IURs for a given domain, with the ultimate goal of supporting research in natural language understanding (NLU) and dialogue state tracking (DST) for task-oriented dialogue systems. Our findings show that while large LLMs such as GPT-3.5 and GPT-4 generate high-quality IURs, achieving similar quality with smaller models is more challenging. We release IndirectRequests, a dataset of IURs that advances beyond the initial Schema-Guided Dialog (SGD) dataset in that it provides a challenging testbed for testing the “in the wild” performance of NLU and DST models.
摘要:现有的任务型对话基准语料库的收集要么是使用机器与机器对话的方法,要么是通过向众筹人员提供基于模板的目标描述。然而,这些方法往往产生与自然的人类对话截然不同的话语,在自然对话中,人们经常以间接的方式传达他们的偏好,比如通过闲聊。我们将这种发声称为间接用户请求(IURs)。理解这样的话语需要听者具备相当多的世界知识和推理能力。我们的研究引入了一种基于LLM的管道来自动生成给定领域的真实、高质量的IURs,最终目标是支持面向任务的对话系统的自然语言理解(NLU)和对话状态跟踪(DST)方面的研究。我们的发现表明,虽然GPT-3.5和GPT-4等大型LLM可以生成高质量的IURs,但用较小的模型实现类似质量的IURs更具挑战性。我们发布了IndirectRequest,这是一个IURs的数据集,它超越了初始的架构引导对话(SGD)数据集,因为它为测试NLU和DST模型的“野外”性能提供了一个具有挑战性的试验台。

[NLP-58] Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs
[NLP-58] 评判法官:LLM成对比较评估中立场偏差的系统调查

链接: https://arxiv.org/abs/2406.07791
作者: Lin Shi,Weicheng Ma,Soroush Vosoughi
关键词: compromise its effectiveness, offers a promising, inherent biases, promising alternative, alternative to human
中文关键词: 损害其有效性,提供了一个有希望的、固有的偏见、有希望的替代方案、人类的替代方案
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 70 pages, around 200 figures and subfigures

点击查看摘要

Abstract:LLM-as-a-Judge offers a promising alternative to human judges across various tasks, yet inherent biases, particularly position bias - a systematic preference for answers based on their position in the prompt - compromise its effectiveness. Our study investigates this issue by developing a framework to systematically study and quantify position bias using metrics such as repetitional consistency, positional consistency, and positional fairness. We conduct experiments with 9 judge models across 22 tasks from the MTBench and DevBench benchmarks and nearly 40 answer-generating models, generating approximately 80,000 evaluation instances. This comprehensive assessment reveals significant variations in bias across judges and tasks. Although GPT-4 often excels in positional consistency and fairness, some more cost-effective models perform comparably or even better in specific tasks, highlighting essential trade-offs between consistency, fairness, and cost. Our results also demonstrate high consistency of judgment across repetitions, confirming that position bias is not due to random variations. This research significantly contributes to the field by introducing new concepts for understanding position bias and providing a multi-dimensional framework for evaluation. These insights guide the selection of optimal judge models, enhance benchmark design, and lay the foundation for future research into effective debiasing strategies, ultimately enhancing the reliability of LLM evaluators.
摘要:LLM-as-a-Court为人类法官在各种任务中提供了一种有希望的替代方案,但固有的偏见,特别是立场偏见–根据他们在即时事件中的位置系统性地偏爱答案–影响了它的有效性。我们的研究通过建立一个框架来系统地研究和量化位置偏差,并使用重复一致性、位置一致性和位置公平性来研究这一问题。我们使用9个判断模型对来自MTB边和DevBtch基准测试的22个任务和近40个答案生成模型进行了测试,生成了大约8万个评估实例。这项综合评估显示,不同法官和不同任务之间的偏见差异很大。尽管GPT-4通常在位置一致性和公平性方面表现出色,但一些更具成本效益的模型在特定任务中的表现与之相当,甚至更好,突出了一致性、公平性和成本之间的重要权衡。我们的结果也证明了重复判断的高度一致性,证实了位置偏差不是由于随机变化造成的。本研究通过引入新的概念来理解位置偏差,并为评估提供了一个多维的框架,从而为该领域做出了重大贡献。这些见解指导了最优判断模型的选择,加强了基准设计,并为未来研究有效的去偏策略奠定了基础,最终提高了LLM评估者的可靠性。

[NLP-59] A Critical Look At Tokenwise Reward-Guided Text Generation
[NLP-59] 批判性地审视Tokenwise奖励引导文本生成

链接: https://arxiv.org/abs/2406.07780
作者: Ahmad Rashid,Ruotian Wu,Julia Grosse,Agustinus Kristiadi,Pascal Poupart
关键词: Large language models, so-called reinforcement learning, Large language, human preferences, human feedback
中文关键词: 大型语言模型,所谓的强化学习,大型语言,人类偏好,人类反馈
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can significantly be improved by aligning to human preferences – the so-called reinforcement learning from human feedback (RLHF). However, the cost of fine-tuning an LLM is prohibitive for many users. Due to their ability to bypass LLM finetuning, tokenwise reward-guided text generation (RGTG) methods have recently been proposed. They use a reward model trained on full sequences to score partial sequences during a tokenwise decoding, in a bid to steer the generation towards sequences with high rewards. However, these methods have so far been only heuristically motivated and poorly analyzed. In this work, we show that reward models trained on full sequences are not compatible with scoring partial sequences. To alleviate this issue, we propose to explicitly train a Bradley-Terry reward model on partial sequences, and autoregressively sample from the implied tokenwise policy during decoding time. We study the property of this reward model and the implied policy. In particular, we show that this policy is proportional to the ratio of two distinct RLHF policies. We show that our simple approach outperforms previous RGTG methods and achieves similar performance as strong offline baselines but without large-scale LLM finetuning.
摘要:大语言模型(LLM)可以通过与人类偏好保持一致而得到显著改进,这就是所谓的人类反馈强化学习(RLHF)。然而,微调LLM的成本对许多用户来说是令人望而却步的。由于它们能够绕过LLM精调,最近提出了标记式奖励制导文本生成(RGTG)方法。他们使用在全序列上训练的奖励模型在令牌式解码过程中对部分序列进行评分,以期引导这一代人获得高回报的序列。然而,到目前为止,这些方法只是启发式的动机和分析不足。在这项工作中,我们证明了在全序列上训练的奖励模型与评分部分序列不兼容。为了缓解这一问题,我们提出了显式训练部分序列的Bradley-Terry奖励模型,并在解码过程中从隐含的令牌式策略中自动回归采样。我们研究了这一报酬模型的性质及其隐含策略。特别地,我们证明了该策略与两个不同的RLHF策略的比率成正比。我们证明了我们的SIMPLE方法优于以前的RGTG方法,并且获得了与强离线基线相似的性能,但没有大规模的LLM精调。

[NLP-60] On Trojans in Refined Language Models
[NLP-60] 精致语言模型中的特洛伊木马

链接: https://arxiv.org/abs/2406.07778
作者: Jayaram Raghuram,George Kesidis,David J. Miller
关键词: product reviews, determining the sentiment, sentiment of product, language model, Trojan
中文关键词: 产品评论、确定情绪、产品情绪、语言模型、特洛伊木马
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A Trojan in a language model can be inserted when the model is refined for a particular application such as determining the sentiment of product reviews. In this paper, we clarify and empirically explore variations of the data-poisoning threat model. We then empirically assess two simple defenses each for a different defense scenario. Finally, we provide a brief survey of related attacks and defenses.
摘要:当针对特定应用程序改进模型(例如确定产品评论的情绪)时,可以插入语言模型中的特洛伊木马。在本文中,我们澄清并从经验上探讨了数据中毒威胁模型的变体。然后,我们根据经验评估了两种简单的防御方案,每种防御方案都适用于不同的防御方案。最后,我们提供了相关攻击和防御的简要调查。

[NLP-61] LT4SG@SMM4H24: Tweets Classification for Digital Epidemiology of Childhood Health Outcomes Using Pre-Trained Language Models
[NLP-61] LT 4SG@SMM4H24:使用预先训练的语言模型对儿童健康结果的数字流行病学进行推文分类

链接: https://arxiv.org/abs/2406.07759
作者: Dasun Athukoralage,Thushari Atapattu,Menasha Thilakaratne,Katrina Falkner
关键词: English tweets reporting, children medical disorders, Shared Task, tweets reporting children, reporting children medical
中文关键词: 英语推文报告、儿童医疗疾病、共享任务、报告儿童的推文、报告儿童医疗
类目: Computation and Language (cs.CL)
备注: Submitted for the 9th Social Media Mining for Health Research and Applications Workshop and Shared Tasks- Large Language Models (LLMs) and Generalizability for Social Media NLP

点击查看摘要

Abstract:This paper presents our approaches for the SMM4H24 Shared Task 5 on the binary classification of English tweets reporting children’s medical disorders. Our first approach involves fine-tuning a single RoBERTa-large model, while the second approach entails ensembling the results of three fine-tuned BERTweet-large models. We demonstrate that although both approaches exhibit identical performance on validation data, the BERTweet-large ensemble excels on test data. Our best-performing system achieves an F1-score of 0.938 on test data, outperforming the benchmark classifier by 1.18%.
摘要:本文介绍了我们针对SMM4 H24共享任务5对报告儿童医学疾病的英语推文进行二进制分类的方法。我们的第一种方法涉及微调单个RoBERTA大型模型,而第二种方法需要集成三个微调BERTweet大型模型的结果。我们证明,尽管两种方法在验证数据上表现出相同的性能,但BERTweet大型集成在测试数据上表现出色。我们性能最好的系统在测试数据上的F1评分为0.938,比基准分类器高出1.18%。

[NLP-62] he MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor Recognition
[NLP-62] MuSe 2024多模式情绪分析挑战:社会感知和幽默识别

链接: https://arxiv.org/abs/2406.07753
作者: Shahin Amiriparian,Lukas Christ,Alexander Kathan,Maurice Gerczuk,Niklas Müller,Steffen Klug,Lukas Stappen,Andreas König,Erik Cambria,Björn Schuller,Simone Eulitz
关键词: Social Perception Sub-Challenge, Social Perception, Multimodal Sentiment Analysis, contemporary multimodal affect, Football Coach Humor
中文关键词: 社会感知子挑战、社会感知、多模式情绪分析、当代多模式情感、足球教练幽默
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Multimodal Sentiment Analysis Challenge (MuSe) 2024 addresses two contemporary multimodal affect and sentiment analysis problems: In the Social Perception Sub-Challenge (MuSe-Perception), participants will predict 16 different social attributes of individuals such as assertiveness, dominance, likability, and sincerity based on the provided audio-visual data. The Cross-Cultural Humor Detection Sub-Challenge (MuSe-Humor) dataset expands upon the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset, focusing on the detection of spontaneous humor in a cross-lingual and cross-cultural setting. The main objective of MuSe 2024 is to unite a broad audience from various research domains, including multimodal sentiment analysis, audio-visual affective computing, continuous signal processing, and natural language processing. By fostering collaboration and exchange among experts in these fields, the MuSe 2024 endeavors to advance the understanding and application of sentiment analysis and affective computing across multiple modalities. This baseline paper provides details on each sub-challenge and its corresponding dataset, extracted features from each data modality, and discusses challenge baselines. For our baseline system, we make use of a range of Transformers and expert-designed features and train Gated Recurrent Unit (GRU)-Recurrent Neural Network (RNN) models on them, resulting in a competitive baseline system. On the unseen test datasets of the respective sub-challenges, it achieves a mean Pearson’s Correlation Coefficient ( \rho ) of 0.3573 for MuSe-Perception and an Area Under the Curve (AUC) value of 0.8682 for MuSe-Humor.
摘要:2024年多通道情绪分析挑战赛(MUSE)涉及两个当代的多通道情绪和情绪分析问题:在社会感知分挑战赛(MUSE-Percept)中,参与者将根据提供的视听数据预测个人的16种不同社会属性,如自信、优势、可爱和真诚。跨文化幽默检测子挑战(缪斯-幽默)数据集扩展了帕骚自发足球教练幽默(帕骚-SFCH)数据集,重点关注跨语言和跨文化环境中自发幽默的检测。缪斯2024的主要目标是团结来自不同研究领域的广泛受众,包括多通道情感分析、视听情感计算、连续信号处理和自然语言处理。通过促进这些领域专家之间的合作和交流,MUSE 2024努力促进情感分析和情感计算在多种模式中的理解和应用。这份基线白皮书提供了每个子挑战及其相应数据集的详细信息,从每个数据模式中提取了特征,并讨论了挑战基线。对于我们的基线系统,我们利用了一系列变压器和专家设计的功能,并在它们上训练门控递归单元(GRU)-递归神经网络(RNN)模型,从而产生一个具有竞争力的基线系统。在看不见的各个子挑战的测试数据集上,缪斯感知的平均皮尔逊相关系数(\RHO)为0.3573,缪斯幽默的曲线下面积(AuC值)为0.8682。

[NLP-63] UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback
[NLP-63] UICoder:微调大型语言模型以通过自动反馈生成用户界面代码

链接: https://arxiv.org/abs/2406.07739
作者: Jason Wu,Eldon Schoop,Alan Leung,Titus Barik,Jeffrey P. Bigham,Jeffrey Nichols
关键词: visually relevant designs, produces visually relevant, struggle to consistently, relevant designs, visually relevant
中文关键词: 视觉相关设计,产生视觉相关,努力保持一致,相关设计,视觉相关
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Accepted to NAACL 2024

点击查看摘要

Abstract:Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.
摘要:大型语言模型(LLM)难以始终如一地生成可编译和生成视觉相关设计的UI代码。现有的改进世代的方法依赖于昂贵的人工反馈或提炼出专有模型。在本文中,我们探索使用自动反馈(编译器和多模式模型)来指导LLM生成高质量的UI代码。我们的方法从现有的LLM开始,通过使用原始模型自动生成大型合成数据集,应用自动化工具积极地对数据进行筛选、评分和重复数据删除,从而迭代生成改进的模型。通过对改进后的数据集进行微调,改进了原始的LLM算法。我们将我们的方法应用于几个开源的LLM,并将所得到的性能与具有自动化度量和人工偏好的基线模型进行比较。我们的评估显示,结果模型的表现优于所有其他可下载基准,并接近更大的专有模型的性能。

[NLP-64] MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models
[NLP-64] MultiPragEval:大型语言模型的多语言务实评估

链接: https://arxiv.org/abs/2406.07736
作者: Dojun Park,Jiwoo Lee,Seohyun Park,Hyeyun Jeong,Youngeun Koo,Soonha Hwang,Seonwoo Park,Sungeun Lee
关键词: basic knowledge assessment, higher-level language understanding, focusing on higher-level, Grice Cooperative Principle, increasingly important
中文关键词: 基础知识评估、更高层次的语言理解、专注于更高层次的格赖斯合作原则,越来越重要
类目: Computation and Language (cs.CL)
备注: 8 pages, under review

点击查看摘要

Abstract:As the capabilities of LLMs expand, it becomes increasingly important to evaluate them beyond basic knowledge assessment, focusing on higher-level language understanding. This study introduces MultiPragEval, a robust test suite designed for the multilingual pragmatic evaluation of LLMs across English, German, Korean, and Chinese. Comprising 1200 question units categorized according to Grice’s Cooperative Principle and its four conversational maxims, MultiPragEval enables an in-depth assessment of LLMs’ contextual awareness and their ability to infer implied meanings. Our findings demonstrate that Claude3-Opus significantly outperforms other models in all tested languages, establishing a state-of-the-art in the field. Among open-source models, Solar-10.7B and Qwen1.5-14B emerge as strong competitors. This study not only leads the way in the multilingual evaluation of LLMs in pragmatic inference but also provides valuable insights into the nuanced capabilities necessary for advanced language comprehension in AI systems.
摘要:随着LLMS能力的扩展,对其进行评估变得越来越重要,不再局限于基本知识评估,而是侧重于更高层次的语言理解。本研究介绍了一套健壮的测试套件MultiPragEval,该测试套件旨在对英语、德语、韩语和汉语中的LLMS进行多语种语用评估。根据格赖斯的合作原则和四条会话准则,由1200个问题单元组成的多语用价值评估系统能够深入评估LLMS的语境意识及其推断隐含意义的能力。我们的发现表明,Claude3-Opus在所有测试语言中的表现明显优于其他模型,在该领域建立了最先进的水平。在开源模型中,Solar-10.7B和Qwen1.5-14B成为强大的竞争对手。这项研究不仅在语用推理中对LLMS的多语言评估起到了引领作用,而且对人工智能系统中高级语言理解所需的细微差别能力提供了有价值的见解。

[NLP-65] REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy
[NLP-65] 真实抽样:通过渐进熵提高开放一代的事实性和多样性

链接: https://arxiv.org/abs/2406.07735
作者: Haw-Shiuan Chang,Nanyun Peng,Mohit Bansal,Anil Ramakrishna,Tagyoung Chung
关键词: REAL sampling, large language models, sampling, REAL, REAL sampling predicts
中文关键词: REAL抽样,大型语言模型,抽样,REAL,REAL抽样预测
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Decoding methods for large language models (LLMs) usually struggle with the tradeoff between ensuring factuality and maintaining diversity. For example, a higher p threshold in the nucleus (top-p) sampling increases the diversity but decreases the factuality, and vice versa. In this paper, we propose REAL (Residual Entropy from Asymptotic Line) sampling, a decoding method that achieves improved factuality and diversity over nucleus sampling by predicting an adaptive threshold of p . Specifically, REAL sampling predicts the step-wise likelihood of an LLM to hallucinate, and lowers the p threshold when an LLM is likely to hallucinate. Otherwise, REAL sampling increases the p threshold to boost the diversity. To predict the step-wise hallucination likelihood without supervision, we construct a Token-level Hallucination Forecasting (THF) model to predict the asymptotic entropy (i.e., inherent uncertainty) of the next token by extrapolating the next-token entropies from a series of LLMs with different sizes. If a LLM’s entropy is higher than the asymptotic entropy (i.e., the LLM is more uncertain than it should be), the THF model predicts a high hallucination hazard, which leads to a lower p threshold in REAL sampling. In the FactualityPrompts benchmark, we demonstrate that REAL sampling based on a 70M THF model can substantially improve the factuality and diversity of 7B LLMs simultaneously, judged by both retrieval-based metrics and human evaluation. After combined with contrastive decoding, REAL sampling outperforms 9 sampling methods, and generates texts that are more factual than the greedy sampling and more diverse than the nucleus sampling with p=0.5 . Furthermore, the predicted asymptotic entropy is also a useful unsupervised signal for hallucination detection tasks.
摘要:大型语言模型的译码方法通常需要在保证真实性和保持多样性之间进行权衡。例如,核(top-p)采样中较高的p阈值增加了多样性,但降低了真实性,反之亦然。在本文中,我们提出了实数(来自渐近线的残差熵)采样,这是一种通过预测自适应阈值p来实现比核采样更好的真实性和多样性的译码方法。具体地说,真实抽样预测了LLM产生幻觉的逐步可能性,并在LLM可能出现幻觉时降低了p阈值。否则,实数采样会增加p门限以提高分集。为了在无监督的情况下预测阶梯式幻觉可能性,我们构建了一个令牌级幻觉预测(THF)模型,通过外推一系列不同大小的LLM的下一个令牌熵来预测下一个令牌的渐近熵(即内在不确定性)。如果LLM的熵高于渐近熵(即LLM比其应有的更不确定),THF模型预测有较高的幻觉风险,这导致在实际采样中p阈值较低。在FactualityPrompt基准测试中,我们证明了基于70M THF模型的真实采样可以显著提高7B LLM的真实性和多样性,无论是基于检索的度量还是人工评估。在与对比解码相结合后,实数抽样优于9种抽样方法,并且生成的文本比贪婪抽样更真实,比p=0.5的核抽样更多样化。此外,预测的渐近熵也是用于幻觉检测任务的有用的非监督信号。

[NLP-66] Sustainable self-supervised learning for speech representations
[NLP-66] 语音表示的可持续自我监督学习

链接: https://arxiv.org/abs/2406.07696
作者: Luis Lugo,Valentin Vielzeuf
关键词: artificial intelligence focuses, Sustainable artificial intelligence, make machine learning, machine learning models, focuses on data
中文关键词: 人工智能专注,可持续人工智能,制作机器学习,机器学习模型,专注于数据
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sustainable artificial intelligence focuses on data, hardware, and algorithms to make machine learning models more environmentally responsible. In particular, machine learning models for speech representations are computationally expensive, generating environmental concerns because of their high energy consumption. Thus, we propose a sustainable self-supervised model to learn speech representation, combining optimizations in neural layers and training to reduce computing costs. The proposed model improves over a resource-efficient baseline, reducing both memory usage and computing cost estimations. It pretrains using a single GPU in less than a day. On top of that, it improves the error rate performance of the baseline in downstream task evaluations. When comparing it to large speech representation approaches, there is an order of magnitude reduction in memory usage, while computing cost reductions represent almost three orders of magnitude improvement.
摘要:可持续人工智能专注于数据、硬件和算法,使机器学习模型对环境更加负责。特别是,语音表示的机器学习模型计算成本高昂,并因其高能耗而引发环境问题。因此,我们提出了一种可持续的自我监督模型来学习语音表示,将神经层的优化和训练相结合以降低计算成本。提出的模型在资源高效基线上进行了改进,减少了内存使用和计算成本估计。它在不到一天的时间内使用单个图形处理器进行预训练。最重要的是,它还提高了下游任务评估中基线的错误率性能。当与大型语音表示方法进行比较时,内存使用量减少了一个数量级,而计算成本的减少则代表了近三个数量级的改进。

[NLP-67] A Labelled Dataset for Sentiment Analysis of Videos on YouTube TikTok and Other Sources about the 2024 Outbreak of Measles
[NLP-67] 用于YouTube TikTok和其他来源上有关2024年麻疹爆发的视频情绪分析的标签数据集

链接: https://arxiv.org/abs/2406.07693
作者: Nirmalya Thakur,Vanessa Su,Mingchen Shao,Kesha A. Patel,Hongseok Jeong,Victoria Knieling,Andrew Brian
关键词: internet between January, ongoing outbreak, outbreak of measles, measles published, video
中文关键词: 一月份之间的互联网,持续爆发,麻疹爆发,麻疹发布,视频
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 19 pages

点击查看摘要

Abstract:The work of this paper presents a dataset that contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. The dataset is available at this https URL. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. Finally, this paper also presents a list of open research questions that may be investigated using this dataset.
摘要:本文的工作提供了一个数据集,其中包含2024年1月1日至2024年5月31日期间在互联网264个网站上发布的4011个关于麻疹持续爆发的视频的数据。该数据集可在此HTTPS URL上找到。这些网站主要包括YouTube和TikTok,分别占视频总量的48.6%和15.2%。其余的网站包括Instagram和Facebook,以及各种全球和地方新闻机构的网站。对于这些视频中的每一个,视频的URL、帖子的标题、帖子的描述和视频的发布日期作为单独的属性呈现在数据集中。在开发了该数据集之后,对视频标题和视频描述进行了情感分析(使用Vader)、主观性分析(使用TextBlob)和细粒度情感分析(使用DistilRoBERTa-base)。这包括将每个视频标题和视频描述归入(I)一种情绪类别,即积极、消极或中性,(Ii)一种主观情绪类别,即高度固执、中性或最不固执,以及(Iii)一种细粒度情绪类别,即恐惧、惊讶、喜悦、悲伤、愤怒、厌恶或中性。这些结果在数据集中作为单独的属性呈现,用于机器学习算法的训练和测试,用于执行该领域中的情感分析或主观性分析,以及用于其他应用。最后,本文还列出了可能使用该数据集进行研究的开放研究问题清单。

[NLP-68] ransformer Models in Education: Summarizing Science Textbooks with AraBART MT5 AraT5 and mBART
[NLP-68] 教育中的转换器模型:用AraBART MT 5 AraT 5和mBART总结科学教科书

链接: https://arxiv.org/abs/2406.07692
作者: Sari Masri,Yaqeen Raddad,Fidaa Khandaqji,Huthaifa I. Ashqar,Mohammed Elhenawy
关键词: develop effective tools, increasing amount, urgent to develop, develop effective, effective tools
中文关键词: 开发有效的工具,数量不断增加,迫切需要开发,开发有效的、有效的工具
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Recently, with the rapid development in the fields of technology and the increasing amount of text t available on the internet, it has become urgent to develop effective tools for processing and understanding texts in a way that summaries the content without losing the fundamental essence of the information. Given this challenge, we have developed an advanced text summarization system targeting Arabic textbooks. Relying on modern natu-ral language processing models such as MT5, AraBART, AraT5, and mBART50, this system evaluates and extracts the most important sentences found in biology textbooks for the 11th and 12th grades in the Palestinian curriculum, which enables students and teachers to obtain accurate and useful summaries that help them easily understand the content. We utilized the Rouge metric to evaluate the performance of the trained models. Moreover, experts in education Edu textbook authoring assess the output of the trained models. This approach aims to identify the best solutions and clarify areas needing improvement. This research provides a solution for summarizing Arabic text. It enriches the field by offering results that can open new horizons for research and development in the technologies for understanding and generating the Arabic language. Additionally, it contributes to the field with Arabic texts through creating and compiling schoolbook texts and building a dataset.
摘要:近年来,随着科技领域的快速发展和互联网上可获得的文本数量的不断增加,迫切需要开发一种有效的工具来处理和理解文本,在不丢失信息的基本本质的情况下总结内容。考虑到这一挑战,我们开发了一个针对阿拉伯语教科书的高级文本摘要系统。该系统依靠MT5、AraBART、AraT5和mBART50等现代自然语言处理模型,对巴勒斯坦课程中11年级和12年级的生物教科书中最重要的句子进行评估和提取,使学生和教师能够获得准确和有用的摘要,帮助他们更容易地理解内容。我们使用Rouge度量来评估训练模型的性能。此外,EDU教科书编写方面的专家还会评估训练过的模型的输出情况。这一方法旨在确定最佳解决方案并澄清需要改进的领域。本研究为阿拉伯文本的摘要提供了一种解决方案。它通过提供能够为理解和生成阿拉伯语的技术的研究和开发开辟新的视野的成果,丰富了该领域。此外,它还通过创建和汇编教科书文本以及建立数据集,为阿拉伯文本领域作出贡献。

[NLP-69] Out-Of-Context Prompting Boosts Fairness and Robustness in Large Language Model Predictions
[NLP-69] 上下文外预算提高大型语言模型预测的公平性和稳健性

链接: https://arxiv.org/abs/2406.07685
作者: Leonardo Cotta,Chris J. Maddison
关键词: Large Language Models, Frontier Large Language, Large Language, Language Models, Frontier Large
中文关键词: 大型语言模型,Frontier大型语言,大型语言,语言模型,Frontier大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Frontier Large Language Models (LLMs) are increasingly being deployed for high-stakes decision-making. On the other hand, these models are still consistently making predictions that contradict users’ or society’s expectations, e.g., hallucinating, or discriminating. Thus, it is important that we develop test-time strategies to improve their trustworthiness. Inspired by prior work, we leverage causality as a tool to formally encode two aspects of trustworthiness in LLMs: fairness and robustness. Under this perspective, existing test-time solutions explicitly instructing the model to be fair or robust implicitly depend on the LLM’s causal reasoning capabilities. In this work, we explore the opposite approach. Instead of explicitly asking the LLM for trustworthiness, we design prompts to encode the underlying causal inference algorithm that will, by construction, result in more trustworthy predictions. Concretely, we propose out-of-context prompting as a test-time solution to encourage fairness and robustness in LLMs. Out-of-context prompting leverages the user’s prior knowledge of the task’s causal model to apply (random) counterfactual transformations and improve the model’s trustworthiness. Empirically, we show that out-of-context prompting consistently improves the fairness and robustness of frontier LLMs across five different benchmark datasets without requiring additional data, finetuning or pre-training.
摘要:前沿大语言模型(LLM)越来越多地被用于高风险决策。另一方面,这些模型仍然在始终如一地做出与用户或社会期望相矛盾的预测,例如,幻觉或歧视。因此,重要的是我们制定测试时间策略,以提高他们的可信度。受以前工作的启发,我们利用因果关系作为工具来正式编码LLMS中的可信性的两个方面:公平性和健壮性。在这种观点下,现有的测试时间解决方案显式地指示模型是公平的或健壮的,隐含地依赖于LLM的因果推理能力。在这项工作中,我们探索了相反的方法。我们不是显式地要求LLM提供可信度,而是设计提示来编码潜在的因果推理算法,通过构建,将产生更可靠的预测。具体地说,我们提出了上下文外提示作为一种测试时间解决方案,以鼓励LLMS中的公平性和健壮性。断章取义提示利用用户对任务因果模型的先验知识来应用(随机)反事实转换,并提高模型的可信度。实验表明,断章取义的提示在五个不同的基准数据集上一致地提高了前沿最小二乘模型的公平性和稳健性,而不需要额外的数据、精调或预训练。

[NLP-70] OPTune: Efficient Online Preference Tuning
[NLP-70] OPTune:高效的在线偏好调整

链接: https://arxiv.org/abs/2406.07657
作者: Lichang Chen,Jiuhai Chen,Chenxi Liu,John Kirchenbauer,Davit Soselia,Chen Zhu,Tom Goldstein,Tianyi Zhou,Heng Huang
关键词: Large Language Models, aligning Large Language, Language Models, Large Language, aligning Large
中文关键词: 大型语言模型,对齐大型语言,语言模型,大型语言,对齐大型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Reinforcement learning with human feedback~(RLHF) is critical for aligning Large Language Models (LLMs) with human preference. Compared to the widely studied offline version of RLHF, \emphe.g. direct preference optimization (DPO), recent works have shown that the online variants achieve even better alignment. However, online alignment requires on-the-fly generation of new training data, which is costly, hard to parallelize, and suffers from varying quality and utility. In this paper, we propose a more efficient data exploration strategy for online preference tuning (OPTune), which does not rely on human-curated or pre-collected teacher responses but dynamically samples informative responses for on-policy preference alignment. During data generation, OPTune only selects prompts whose (re)generated responses can potentially provide more informative and higher-quality training signals than the existing responses. In the training objective, OPTune reweights each generated response (pair) by its utility in improving the alignment so that learning can be focused on the most helpful samples. Throughout our evaluations, OPTune’d LLMs maintain the instruction-following benefits provided by standard preference tuning whilst enjoying 1.27-1.56x faster training speed due to the efficient data exploration strategy.
摘要:带人类反馈的强化学习是使大语言模型(LLM)符合人类偏好的关键。与被广泛研究的RLHF离线版本相比,直接偏好优化(DPO),最近的研究表明,在线变体实现了更好的比对。然而,在线比对需要即时生成新的训练数据,这是昂贵的、难以并行化的,并且存在质量和实用性不同的问题。在本文中,我们提出了一种更有效的在线偏好调整的数据挖掘策略(Optune),该策略不依赖于人工策划或预先收集的教师回复,而是动态采样信息性回复,以进行策略偏好匹配。在数据生成过程中,Optune只选择其(重新)生成的响应可能提供比现有响应更具信息量和更高质量的训练信号的提示。在培训目标中,Optune根据其在改进比对方面的效用对每个生成的响应(对)重新加权,以便学习可以集中在最有帮助的样本上。在我们的整个评估过程中,Optune‘d LLM保持了标准首选项调整提供的指令遵循优势,同时由于高效的数据探索策略,培训速度提高了1.27-1.56倍。

[NLP-71] ag and correct: high precision post-editing approach to correction of speech recognition errors
[NLP-71] AG and correct:高精度后期编辑方法来纠正语音识别错误

链接: https://arxiv.org/abs/2406.07589
作者: Tomasz Ziętkiewicz
关键词: correcting speech recognition, Automatic Speech Recognition, speech recognition errors, speech recognition, correcting speech
中文关键词: 纠正语音识别,自动语音识别,语音识别错误,语音识别,纠正语音
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, Published in Proceedings of the 17th Conference on Computer Science and Intelligence Systems (FedCSIS 2022)

点击查看摘要

Abstract:This paper presents a new approach to the problem of correcting speech recognition errors by means of post-editing. It consists of using a neural sequence tagger that learns how to correct an ASR (Automatic Speech Recognition) hypothesis word by word and a corrector module that applies corrections returned by the tagger. The proposed solution is applicable to any ASR system, regardless of its architecture, and provides high-precision control over errors being corrected. This is especially crucial in production environments, where avoiding the introduction of new mistakes by the error correction model may be more important than the net gain in overall results. The results show that the performance of the proposed error correction models is comparable with previous approaches while requiring much smaller resources to train, which makes it suitable for industrial applications, where both inference latency and training times are critical factors that limit the use of other techniques.
摘要:本文提出了一种新的语音识别纠错方法–后编辑纠错法。它包括使用神经序列标记器和校正器模块,神经序列标记器学习如何逐字纠正ASR(自动语音识别)假设,校正器模块应用标记器返回的校正。建议的解决方案适用于任何ASR系统,无论其架构如何,并提供对被纠正的错误的高精度控制。这在生产环境中尤其重要,在生产环境中,避免通过纠错模型引入新的错误可能比总体结果的净收益更重要。结果表明,所提出的纠错模型的性能与以前的方法相当,而需要训练的资源要少得多,这使得它适合于工业应用,其中推理延迟和训练时间都是限制其他技术使用的关键因素。

[NLP-72] AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning
[NLP-72] 目标:让任何多模式大型语言模型拥抱高效的上下文学习

链接: https://arxiv.org/abs/2406.07588
作者: Jun Gao,Qian Qiao,Ziqiang Cao,Zili Wang,Wenjie Li
关键词: Large Language Models, facilitates Large Language, multi-modal Large Language, Language Models, Large Language
中文关键词: 大型语言模型,促进大型语言,多模式大型语言,语言模型,大型语言
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning (ICL) facilitates Large Language Models (LLMs) exhibiting emergent ability on downstream tasks without updating billions of parameters. However, in the area of multi-modal Large Language Models (MLLMs), two problems hinder the application of multi-modal ICL: (1) Most primary MLLMs are only trained on single-image datasets, making them unable to read multi-modal demonstrations. (2) With the demonstrations increasing, thousands of visual tokens highly challenge hardware and degrade ICL performance. During preliminary explorations, we discovered that the inner LLM tends to focus more on the linguistic modality within multi-modal demonstrations to generate responses. Therefore, we propose a general and light-weighted framework \textbfAIM to tackle the mentioned problems through \textbfAggregating \textbfImage information of \textbfMultimodal demonstrations to the dense latent space of the corresponding linguistic part. Specifically, AIM first uses the frozen backbone MLLM to read each image-text demonstration and extracts the vector representations on top of the text. These vectors naturally fuse the information of the image-text pair, and AIM transforms them into fused virtual tokens acceptable for the inner LLM via a trainable projection layer. Ultimately, these fused tokens function as variants of multi-modal demonstrations, fed into the MLLM to direct its response to the current query as usual. Because these fused tokens stem from the textual component of the image-text pair, a multi-modal demonstration is nearly reduced to a pure textual demonstration, thus seamlessly applying to any MLLMs. With its de facto MLLM frozen, AIM is parameter-efficient and we train it on public multi-modal web corpora which have nothing to do with downstream test tasks.
摘要:情境学习(ICL)使大型语言模型(LLM)在下游任务中表现出应急能力,而不需要更新数十亿个参数。然而,在多模式大语言模型(MLLMS)领域,有两个问题阻碍了多模式ICL的应用:(1)大多数初级MLLMS只在单图像数据集上进行训练,无法阅读多模式演示。(2)随着演示的增多,数以千计的视觉令牌极大地挑战了硬件,降低了ICL的性能。在初步的探索中,我们发现LLM的内部倾向于更多地关注多通道演示中的语言情态来产生反应。因此,我们提出了一个通用的轻量级框架TextbfAIM,通过将多通道演示的图像信息聚集到相应语言部分的稠密潜在空间来解决上述问题。具体地说,AIM首先使用冻结的主干MLLM来读取每个图文演示,并提取文本顶部的矢量表示。这些向量自然地融合了图文对的信息,而AIM通过一个可训练的投影层将它们转换成内部LLM可接受的融合虚拟标记。最终,这些融合的令牌充当多模式演示的变体,像往常一样输入MLLM以指导其对当前查询的响应。由于这些融合的符号来源于图文对的文本成分,多模式演示几乎被简化为纯文本演示,从而无缝地适用于任何MLLMS。由于其事实上的MLLM被冻结,AIM是参数高效的,并且我们在与下游测试任务无关的公共多模式网络语料库上对其进行训练。

[NLP-73] BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models
[NLP-73] BrainChat:使用视觉语言预训练模型从fMRI中解码语义信息

链接: https://arxiv.org/abs/2406.07584
作者: Wanaiu Huang
关键词: enables non-invasive clinical, non-invasive clinical augmentative, activity enables non-invasive, brain activity enables, semantic information decoding
中文关键词: 实现非侵入性临床、非侵入性临床增强、活动实现非侵入性、大脑活动实现、语义信息解码
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semantic information is vital for human interaction, and decoding it from brain activity enables non-invasive clinical augmentative and alternative communication. While there has been significant progress in reconstructing visual images, few studies have focused on the language aspect. To address this gap, leveraging the powerful capabilities of the decoder-based vision-language pretrained model CoCa, this paper proposes BrainChat, a simple yet effective generative framework aimed at rapidly accomplishing semantic information decoding tasks from brain activity, including fMRI question answering and fMRI captioning. BrainChat employs the self-supervised approach of Masked Brain Modeling to encode sparse fMRI data, obtaining a more compact embedding representation in the latent space. Subsequently, BrainChat bridges the gap between modalities by applying contrastive loss, resulting in aligned representations of fMRI, image, and text embeddings. Furthermore, the fMRI embeddings are mapped to the generative Brain Decoder via cross-attention layers, where they guide the generation of textual content about fMRI in a regressive manner by minimizing caption loss. Empirically, BrainChat exceeds the performance of existing state-of-the-art methods in the fMRI captioning task and, for the first time, implements fMRI question answering. Additionally, BrainChat is highly flexible and can achieve high performance without image data, making it better suited for real-world scenarios with limited data.
摘要:语义信息对人类交互至关重要,从大脑活动中解码语义信息可以实现非侵入性的临床增强性和替代性交流。虽然在视觉图像重建方面已经有了很大的进展,但很少有研究集中在语言方面。为了弥补这一空白,利用基于解码器的视觉语言预训练模型COCA的强大功能,提出了一种简单而有效的生成式框架BrainChat,旨在快速完成从大脑活动中进行语义信息解码的任务,包括fMRI问答和fMRI字幕。BrainChat使用掩蔽脑建模的自监督方法对稀疏的fMRI数据进行编码,获得了更紧凑的潜在空间嵌入表示。随后,BrainChat通过应用对比损失来弥合通道之间的差距,导致fMRI、图像和文本嵌入的对准表示。此外,通过交叉注意层将fMRI嵌入映射到生成性脑解码器,在那里它们通过最小化字幕损失以一种回归的方式指导关于fMRI的文本内容的生成。从经验来看,BrainChat在fMRI字幕任务中的表现超过了现有最先进的方法,并首次实现了fMRI问答。此外,BrainChat的灵活性很高,在没有图像数据的情况下可以实现高性能,更适合数据有限的现实场景。

[NLP-74] Inference Acceleration for Large Language Models on CPUs
[NLP-74] 处理器上大型语言模型的推理加速

链接: https://arxiv.org/abs/2406.07553
作者: Ditto PS,Jithin VG,Adarsh MS
关键词: demonstrated remarkable performance, large language models, natural language processing, recent years, demonstrated remarkable
中文关键词: 表现出非凡的性能、大型语言模型、自然语言处理、近年来,表现出非凡的性能
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, large language models have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, deploying these models for real-world applications often requires efficient inference solutions to handle the computational demands. In this paper, we explore the utilization of CPUs for accelerating the inference of large language models. Specifically, we introduce a parallelized approach to enhance throughput by 1) Exploiting the parallel processing capabilities of modern CPU architectures, 2) Batching the inference request. Our evaluation shows the accelerated inference engine gives an 18-22x improvement in the generated token per sec. The improvement is more with longer sequence and larger models. In addition to this, we can also run multiple workers in the same machine with NUMA node isolation to further improvement in tokens/s. Table 2, we have received 4x additional improvement with 4 workers. This would also make Gen-AI based products and companies environment friendly, our estimates shows that CPU usage for Inference could reduce the power consumption of LLMs by 48.9% while providing production ready throughput and latency.
摘要:近年来,大型语言模型在各种自然语言处理(NLP)任务中表现出了显著的性能。然而,为现实世界的应用程序部署这些模型通常需要有效的推理解决方案来处理计算需求。本文探讨了如何利用CPU来加速大型语言模型的推理。具体地说,我们引入了一种并行化方法来提高吞吐量,方法是1)利用现代CPU体系结构的并行处理能力,2)对推理请求进行批处理。我们的评估表明,加速推理引擎在每秒生成的令牌上提高了18-22倍。更长的序列和更大的型号带来的改善更大。除此之外,我们还可以在NUMA节点隔离的同一台机器上运行多个Worker,以进一步提高令牌/S。表2,我们在4个Worker的情况下获得了4倍的额外改进。这也将使基于Gen-AI的产品和公司对环境友好,我们的估计表明,用于推理的CPU使用率可以降低LLMS的功耗48.9%,同时提供生产就绪型吞吐量和延迟。

[NLP-75] Understanding Sounds Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
[NLP-75] 理解缺少问题的声音:大型音频模型中对象幻觉的挑战

链接: https://arxiv.org/abs/2406.08402
作者: Chun-Yi Kuan,Wei-Ping Huang,Hung-yi Lee
关键词: traditional large language, tackle audio-related tasks, Large audio-language models, large language models, enhance traditional large
中文关键词: 传统大型语言,处理音频相关任务,大型音频语言模型,大型语言模型,增强传统大型
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:Large audio-language models (LALMs) enhance traditional large language models by integrating audio perception capabilities, allowing them to tackle audio-related tasks. Previous research has primarily focused on assessing the performance of LALMs across various tasks, yet overlooking their reliability, particularly concerning issues like object hallucination. In our study, we introduce methods to assess the extent of object hallucination of publicly available LALMs. Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content, but struggle to answer discriminative questions, specifically those requiring the identification of the presence of particular object sounds within an audio clip. This limitation highlights a critical weakness in current LALMs: their inadequate understanding of discriminative queries. Moreover, we explore the potential of prompt engineering to enhance LALMs’ performance on discriminative questions.
摘要:大型音频语言模型通过集成音频感知能力来增强传统的大型语言模型,使其能够处理与音频相关的任务。以前的研究主要集中在评估LALM在各种任务中的表现,而忽略了它们的可靠性,特别是关于物体幻觉等问题。在我们的研究中,我们介绍了评估公共可用LALM的对象幻觉程度的方法。我们的发现表明,LALM在理解音频内容方面可以与专门的音频字幕模型相媲美,但难以回答歧视性问题,特别是那些需要识别音频片段中是否存在特定对象声音的问题。这一限制突出了当前法律和体制管理的一个严重弱点:它们对歧视性查询的理解不足。此外,我们还探索了即时工程在提高LALMS在区分问题上的性能方面的潜力。

[NLP-76] Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques
[NLP-76] 利用ASB文字记录进行语音情感识别:错误率和融合技术的综合研究

链接: https://arxiv.org/abs/2406.08353
作者: Yuanchao Li,Peter Bell,Catherine Lai
关键词: Speech Emotion Recognition, enhance Speech Emotion, Automatic Speech Recognition, Emotion Recognition, Speech Emotion
中文关键词: 语音情感识别,增强语音情感,自动语音识别,情感识别,语音情感
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) on well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes text-only and bimodal SER with diverse fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. This research is expected to provide insights into SER with ASR assistance, especially for real-world applications.
摘要:文本数据通常被用作增强语音情感识别(SER)性能和可靠性的主要输入。然而,在大多数研究中,对人类转录文本的依赖阻碍了实用SER系统的发展,在实验室研究和以自动语音识别(ASR)作为文本来源的现实世界场景之间造成了差距。因此,这项研究在IEMOCAP、CMU-MOSI和MSP-Podcast等知名语料库上,使用具有不同错误率的ASR记录对SER性能进行了基准测试。我们的评估包括纯文本和具有不同融合技术的双峰SER,旨在进行全面分析,揭示当前SER研究面临的新发现和挑战。此外,我们提出了一个集成ASR纠错和通道门控融合的统一ASR容错框架,与性能最佳的ASR转录本相比,实现了更低的WER和更高的SER结果。这项研究有望在ASR的帮助下为SER提供深入的见解,特别是对于现实世界的应用。

[NLP-77] ransformer-based Model for ASR N-Best Rescoring and Rewriting
[NLP-77] 基于转换器的ASB N-Best重新评分和重写模型

链接: https://arxiv.org/abs/2406.08207
作者: Iwen E. Kang,Christophe Van Gysel,Man-Hung Siu
关键词: Automatic Speech Recognition, on-device Automatic Speech, Voice assistants increasingly, Speech Recognition, Automatic Speech
中文关键词: 自动语音识别、设备上自动语音、语音助理日益增多、语音识别、自动语音
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Interspeech '24

点击查看摘要

Abstract:Voice assistants increasingly use on-device Automatic Speech Recognition (ASR) to ensure speed and privacy. However, due to resource constraints on the device, queries pertaining to complex information domains often require further processing by a search engine. For such applications, we propose a novel Transformer based model capable of rescoring and rewriting, by exploring full context of the N-best hypotheses in parallel. We also propose a new discriminative sequence training objective that can work well for both rescore and rewrite tasks. We show that our Rescore+Rewrite model outperforms the Rescore-only baseline, and achieves up to an average 8.6% relative Word Error Rate (WER) reduction over the ASR system by itself.
摘要:语音助理越来越多地使用设备上自动语音识别(ASB)来确保速度和隐私。然而,由于设备上的资源限制,涉及复杂信息域的查询通常需要搜索引擎进一步处理。对于此类应用,我们提出了一种新颖的基于Transformer的模型,能够通过并行探索N个最佳假设的完整上下文来重新排序和重写。我们还提出了一个新的区分序列训练目标,该目标可以很好地适用于重新筛选和重写任务。我们表明,我们的Resore + Rewriter模型优于仅Resore的基线,并且比ASC系统本身平均降低了高达8.6%的相对字错误率(WER)。

[NLP-78] LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning
[NLP-78] LibriTTS-P:具有说话风格和说话者身份的数据库,支持文本转语音和风格字幕

链接: https://arxiv.org/abs/2406.07969
作者: Masaya Kawamura,Ryuichi Yamamoto,Yuma Shirahata,Takuya Hasumi,Kentaro Tachibana
关键词: includes utterance-level descriptions, utterance-level descriptions, includes utterance-level, speaker characteristics, speaking style
中文关键词: 包括话语级描述、话语级描述、包括话语级、说话者特征、说话风格
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing English prompt datasets, our corpus provides more diverse prompt annotations for all speakers of LibriTTS-R. Experimental results for prompt-based controllable TTS demonstrate that the TTS model trained with LibriTTS-P achieves higher naturalness than the model using the conventional dataset. Furthermore, the results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset. Our corpus, LibriTTS-P, is available at this https URL.
摘要:我们引入LibriTTS-P,这是一个基于LibriTTS-R的新文集,包括话语级描述(即,说话风格的提示)和说话者特征的说话者级提示。我们采用混合方法来构建提示注释:(1)捕捉人类对说话者特征的感知的手动注释和(2)对说话风格的合成注释。与现有的英语提示数据集相比,我们的数据库为LibriTTS-R的所有使用者提供了更多样化的提示注释。基于预算的可控TTC的实验结果表明,使用LibriTTS-P训练的TTC模型比使用传统数据集的模型实现了更高的自然度。此外,风格字幕任务的结果表明,利用LibriTTS-P的模型生成的单词是使用传统数据集的模型的2.5倍。我们的文集LibriTTS-P可在httpsURL上找到。

[NLP-79] Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation
[NLP-79] 使用自我知识蒸馏指导框架级CSC对准

链接: https://arxiv.org/abs/2406.07909
作者: Eungbeom Kim,Hantae Kim,Kyogu Lee
关键词: connectionist temporal classification, automatic speech recognition, Transformer encoder, temporal classification, framework is widely
中文关键词: 连接主义时态分类、自动语音识别、Transformer编码器、时态分类、框架广泛
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD); Machine Learning (stat.ML)
备注: Accepted by Interspeech 2024

点击查看摘要

Abstract:Transformer encoder with connectionist temporal classification (CTC) framework is widely used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR displays a problem of disagreement between teacher-student models in frame-level alignment which ultimately hinders it from improving the student model’s performance. In order to resolve this problem, this paper introduces a self-knowledge distillation (SKD) method that guides the frame-level alignment during the training time. In contrast to the conventional method using separate teacher and student models, this study introduces a simple and effective method sharing encoder layers and applying the sub-model as the student model. Overall, our approach is effective in improving both the resource efficiency as well as performance. We also conducted an experimental analysis of the spike timings to illustrate that the proposed method improves performance by reducing the alignment disagreement.
摘要:具有连接主义时态分类(CTC)框架Transformer编码器被广泛用于自动语音识别(ASB)。然而,ASB的知识提炼(KD)表现出师生模型在框架级对齐方面存在分歧的问题,这最终阻碍了其提高学生模型的绩效。为了解决这个问题,本文引入了一种自我知识蒸馏(SKD)方法,在训练期间指导帧级对齐。与使用单独的教师和学生模型的传统方法不同,本研究引入了一种简单有效的方法,共享编码器层并应用子模型作为学生模型。总体而言,我们的方法在提高资源效率和性能方面有效。我们还对峰值时间进行了实验分析,以说明所提出的方法通过减少对齐不一致来提高性能。

[NLP-80] Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions
[NLP-80] 探索儿童与成人二元互动中说话者扩大化的言语基础模型

链接: https://arxiv.org/abs/2406.07890
作者: Anfeng Xu,Kevin Huang,Tiantian Feng,Lue Shen,Helen Tager-Flusberg,Shrikanth Narayanan
关键词: Speech foundation models, opened unique opportunities, addressing challenging low-resource, challenging low-resource speech, foundation models
中文关键词: 演讲基础模型,打开独特的机会,解决具有挑战性的低资源演讲,基础模型
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Interspeech 2024

点击查看摘要

Abstract:Speech foundation models, trained on vast datasets, have opened unique opportunities in addressing challenging low-resource speech understanding, such as child speech. In this work, we explore the capabilities of speech foundation models on child-adult speaker diarization. We show that exemplary foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate, respectively, compared to previous speaker diarization methods. In addition, we benchmark and evaluate the speaker diarization results of the speech foundation models with varying the input audio window size, speaker demographics, and training data ratio. Our results highlight promising pathways for understanding and adopting speech foundation models to facilitate child speech understanding.
摘要:在大量数据集上训练的语音基础模型为解决具有挑战性的低资源语音理解(例如儿童语音)开辟了独特的机会。在这项工作中,我们探索了言语基础模型在儿童-成人说话者二元化方面的能力。我们表明,与之前的扬声器拨号方法相比,示例性基础模型可以分别相对降低39.5%和62.3%。此外,我们还通过改变输入音频窗口大小、说话者人口统计数据和训练数据比率来基准和评估语音基础模型的说话者日记化结果。我们的结果强调了理解和采用言语基础模型以促进儿童言语理解的有希望的途径。

[NLP-81] Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR
[NLP-81] 低等级自适应的双管道用于多语言ASB中的新语言集成

链接: https://arxiv.org/abs/2406.07842
作者: Yerbolat Khassanov,Zhipeng Chen,Tianfeng Chen,Tze Yuang Chong,Wei Li,Jun Zhang,Lu Lu,Yuxuan Wang
关键词: automatic speech recognition, paper addresses challenges, multilingual automatic speech, pre-trained multilingual automatic, speech recognition
中文关键词: 自动语音识别,论文解决挑战,多语言自动语音,预训练的多语言自动,语音识别
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 5 pages, 2 figures, 4 tables

点击查看摘要

Abstract:This paper addresses challenges in integrating new languages into a pre-trained multilingual automatic speech recognition (mASR) system, particularly in scenarios where training data for existing languages is limited or unavailable. The proposed method employs a dual-pipeline with low-rank adaptation (LoRA). It maintains two data flow pipelines-one for existing languages and another for new languages. The primary pipeline follows the standard flow through the pre-trained parameters of mASR, while the secondary pipeline additionally utilizes language-specific parameters represented by LoRA and a separate output decoder module. Importantly, the proposed approach minimizes the performance degradation of existing languages and enables a language-agnostic operation mode, facilitated by a decoder selection strategy. We validate the effectiveness of the proposed method by extending the pre-trained Whisper model to 19 new languages from the FLEURS dataset
摘要:本文解决了将新语言集成到预训练的多语言自动语音识别(mASB)系统中所面临的挑战,特别是在现有语言的训练数据有限或不可用的情况下。所提出的方法采用具有低等级自适应(LoRA)的双管道。它维护两个数据流管道-一个用于现有语言,另一个用于新语言。主管道遵循mASB预训练参数的标准流程,而辅助管道还利用LoRA和单独的输出解码器模块表示的语言特定参数。重要的是,所提出的方法最大限度地减少了现有语言的性能下降,并在解码器选择策略的帮助下实现了语言不可知的操作模式。我们通过将预训练的Whisper模型扩展到来自FLEURS数据集的19种新语言来验证所提出方法的有效性

[NLP-82] Spoof Diarization: “What Spoofed When” in Partially Spoofed Audio
[NLP-82] 恶搞日记:部分恶搞音频中的“什么时候被恶搞”

链接: https://arxiv.org/abs/2406.07816
作者: Lin Zhang,Xin Wang,Erica Cooper,Mireia Diez,Federico Landini,Nicholas Evans,Junichi Yamagishi
关键词: paper defines Spoof, Partial Spoof, defines Spoof Diarization, paper defines, Spoof Diarization
中文关键词: 论文定义了Spoof、部分Spoof、定义了Spoof Diaration、论文定义了Spoof Diaration
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:This paper defines Spoof Diarization as a novel task in the Partial Spoof (PS) scenario. It aims to determine what spoofed when, which includes not only locating spoof regions but also clustering them according to different spoofing methods. As a pioneering study in spoof diarization, we focus on defining the task, establishing evaluation metrics, and proposing a benchmark model, namely the Countermeasure-Condition Clustering (3C) model. Utilizing this model, we first explore how to effectively train countermeasures to support spoof diarization using three labeling schemes. We then utilize spoof localization predictions to enhance the diarization performance. This first study reveals the high complexity of the task, even in restricted scenarios where only a single speaker per audio file and an oracle number of spoofing methods are considered. Our code is available at this https URL.
摘要:本文将恶搞日记定义为部分恶搞(PS)场景中的一项新颖任务。它的目的是确定什么时候被欺骗,这不仅包括定位欺骗区域,还包括根据不同的欺骗方法对它们进行聚集。作为恶搞日记化领域的开创性研究,我们专注于定义任务、建立评估指标并提出基准模型,即Countermeasure-Conduct集群(3C)模型。利用这个模型,我们首先探索如何有效地训练对策以支持使用三种标签方案的欺骗日记化。然后,我们利用欺骗定位预测来增强日记化性能。第一项研究揭示了该任务的高度复杂性,即使在每个音频文件只考虑一个扬声器并且Oracle数量的欺骗方法的有限场景中也是如此。我们的代码可在此https URL上找到。

计算机视觉

[CV-0] ICE-G: Image Conditional Editing of 3D Gaussian Splats

链接: https://arxiv.org/abs/2406.08488
作者: Vishnu Jaganathan,Hannah Hanyun Huang,Muhammad Zubair Irshad,Varun Jampani,Amit Raj,Zsolt Kira
关键词: create high quality, emerged to create, create high, Recently, Recently many techniques
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to CVPR AI4CC Workshop 2024. Project page: this https URL

点击查看摘要

Abstract:Recently many techniques have emerged to create high quality 3D assets and scenes. When it comes to editing of these objects, however, existing approaches are either slow, compromise on quality, or do not provide enough customization. We introduce a novel approach to quickly edit a 3D model from a single reference view. Our technique first segments the edit image, and then matches semantically corresponding regions across chosen segmented dataset views using DINO features. A color or texture change from a particular region of the edit image can then be applied to other views automatically in a semantically sensible manner. These edited views act as an updated dataset to further train and re-style the 3D scene. The end-result is therefore an edited 3D model. Our framework enables a wide variety of editing tasks such as manual local edits, correspondence based style transfer from any example image, and a combination of different styles from multiple example images. We use Gaussian Splats as our primary 3D representation due to their speed and ease of local editing, but our technique works for other methods such as NeRFs as well. We show through multiple examples that our method produces higher quality results while offering fine-grained control of editing. Project page: this http URL

[CV-1] Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

链接: https://arxiv.org/abs/2406.08487
作者: Yi-Fan Zhang,Qingsong Wen,Chaoyou Fu,Xue Wang,Zhang Zhang,Liang Wang,Rong Jin
关键词: Large Multimodal Models, Multimodal Models, Large Multimodal, foundation of Large, local
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Seeing clearly with high resolution is a foundation of Large Multimodal Models (LMMs), which has been proven to be vital for visual perception and reasoning. Existing works usually employ a straightforward resolution upscaling method, where the image consists of global and local branches, with the latter being the sliced image patches but resized to the same resolution as the former. This means that higher resolution requires more local patches, resulting in exorbitant computational expenses, and meanwhile, the dominance of local image tokens may diminish the global context. In this paper, we dive into the problems and propose a new framework as well as an elaborate optimization strategy. Specifically, we extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks. With regard to local patches, learnable query embeddings are introduced to reduce image tokens, the most important tokens accounting for the user question will be further selected by a similarity-based selector. Our empirical results demonstrate a `less is more’ pattern, where \textitutilizing fewer but more informative local image tokens leads to improved performance. Besides, a significant challenge lies in the training strategy, as simultaneous end-to-end training of the global mining block and local compression block does not yield optimal results. We thus advocate for an alternating training way, ensuring balanced learning between global and local aspects. Finally, we also introduce a challenging dataset with high requirements for image detail, enhancing the training of the local compression layer. The proposed method, termed LMM with Sophisticated Tasks, Local image compression, and Mixture of global Experts (SliME), achieves leading performance across various benchmarks with only 2 million training data.

[CV-2] Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation

链接: https://arxiv.org/abs/2406.08482
作者: Raphael Tang,Xinyu Zhang,Lixinyu Xu,Yao Lu,Wenyan Li,Pontus Stenetorp,Jimmy Lin,Ferhan Ture
关键词: variability remains understudied, remains understudied, perceptual variability remains, Diffusion models, variability
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 13 pages, 11 figures

点击查看摘要

Abstract:Diffusion models are the state of the art in text-to-image generation, but their perceptual variability remains understudied. In this paper, we examine how prompts affect image variability in black-box diffusion-based models. We propose W1KP, a human-calibrated measure of variability in a set of images, bootstrapped from existing image-pair perceptual distances. Current datasets do not cover recent diffusion models, thus we curate three test sets for evaluation. Our best perceptual distance outperforms nine baselines by up to 18 points in accuracy, and our calibration matches graded human judgements 78% of the time. Using W1KP, we study prompt reusability and show that Imagen prompts can be reused for 10-50 random seeds before new images become too similar to already generated images, while Stable Diffusion XL and DALL-E 3 can be reused 50-200 times. Lastly, we analyze 56 linguistic features of real prompts, finding that the prompt’s length, CLIP embedding norm, concreteness, and word senses influence variability most. As far as we are aware, we are the first to analyze diffusion variability from a visuolinguistic perspective. Our project page is at this http URL

[CV-3] Enhancing End-to-End Autonomous Driving with Latent World Model

链接: https://arxiv.org/abs/2406.08481
作者: Yingyan Li,Lue Fan,Jiawei He,Yuqi Wang,Yuntao Chen,Zhaoxiang Zhang,Tieniu Tan
关键词: garnered widespread attention, widespread attention, garnered widespread, autonomous driving, LAtent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:End-to-end autonomous driving has garnered widespread attention. Current end-to-end approaches largely rely on the supervision from perception tasks such as detection, tracking, and map segmentation to aid in learning scene representations. However, these methods require extensive annotations, hindering the data scalability. To address this challenge, we propose a novel self-supervised method to enhance end-to-end driving without the need for costly labels. Specifically, our framework \textbfLAW uses a LAtent World model to predict future latent features based on the predicted ego actions and the latent feature of the current frame. The predicted latent features are supervised by the actually observed features in the future. This supervision jointly optimizes the latent feature learning and action prediction, which greatly enhances the driving performance. As a result, our approach achieves state-of-the-art performance in both open-loop and closed-loop benchmarks without costly annotations.

[CV-4] Real3D: Scaling Up Large Reconstruction Models with Real-World Images

链接: https://arxiv.org/abs/2406.08479
作者: Hanwen Jiang,Qixing Huang,Georgios Pavlakos
关键词: Large Reconstruction Models, single-view Large Reconstruction, Large Reconstruction, fully supervised route, training single-view Large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:The default strategy for training single-view Large Reconstruction Models (LRMs) follows the fully supervised route using large-scale datasets of synthetic 3D assets or multi-view captures. Although these resources simplify the training procedure, they are hard to scale up beyond the existing datasets and they are not necessarily representative of the real distribution of object shapes. To address these limitations, in this paper, we introduce Real3D, the first LRM system that can be trained using single-view real-world images. Real3D introduces a novel self-training framework that can benefit from both the existing synthetic data and diverse single-view real images. We propose two unsupervised losses that allow us to supervise LRMs at the pixel- and semantic-level, even for training examples without ground-truth 3D or novel views. To further improve performance and scale up the image data, we develop an automatic data curation approach to collect high-quality examples from in-the-wild images. Our experiments show that Real3D consistently outperforms prior work in four diverse evaluation settings that include real and synthetic data, as well as both in-domain and out-of-domain shapes. Code and model can be found here: this https URL

[CV-5] What If We Recaption Billions of Web Images with LLaMA-3?

链接: https://arxiv.org/abs/2406.08478
作者: Xianhang Li,Haoqin Tu,Mude Hui,Zeyu Wang,Bingchen Zhao,Junfei Xiao,Sucheng Ren,Jieru Mei,Qing Liu,Huangjie Zheng,Yuyin Zhou,Cihang Xie
关键词: Web-crawled image-text pairs, Web-crawled image-text, inherently noisy, Web-crawled, image-text pairs
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: * denotes equal contributions

点击查看摘要

Abstract:Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textitopen-sourced LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users’ text instructions, especially in following complex queries. Our project page is this https URL

[CV-6] RMem: Restricted Memory Banks Improve Video Object Segmentation

链接: https://arxiv.org/abs/2406.08476
作者: Junbao Zhou,Ziqi Pang,Yu-Xiong Wang
关键词: memory banks, expanding memory banks, benchmarks evolving, memory, VOS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: CVPR 2024, Project Page: this https URL

点击查看摘要

Abstract:With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios, we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory banks to accommodate extensive historical information. Our specially designed “memory deciphering” study offers a pivotal insight underpinning such a strategy: expanding memory banks, while seemingly beneficial, actually increases the difficulty for VOS modules to decode relevant features due to the confusion from redundant information. By restricting memory banks to a limited number of essential frames, we achieve a notable improvement in VOS accuracy. This process balances the importance and freshness of frames to maintain an informative memory bank within a bounded capacity. Additionally, restricted memory banks reduce the training-inference discrepancy in memory lengths compared with continuous expansion. This fosters new opportunities in temporal reasoning and enables us to introduce the previously overlooked “temporal positional embedding.” Finally, our insights are embodied in “RMem” (“R” for restricted), a simple yet effective VOS modification that excels at challenging VOS scenarios and establishes new state of the art for object state changes (on the VOST dataset) and long videos (on the Long Videos dataset). Our code and demo are available at this https URL.

[CV-7] Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

链接: https://arxiv.org/abs/2406.08475
作者: Yuxuan Xue,Xianghui Xie,Riccardo Marin,Gerard Pons-Moll
关键词: Creating realistic avatars, Creating realistic, challenging problem, attractive yet challenging, single RGB image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Creating realistic avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot provide multi-view shape priors with guaranteed 3D consistency. We propose Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion. Our key insight is that 2D multi-view diffusion and 3D reconstruction models provide complementary information for each other, and by coupling them in a tight manner, we can fully leverage the potential of both models. We introduce a novel image-conditioned generative 3D Gaussian Splats reconstruction model that leverages the priors from 2D multi-view diffusion models, and provides an explicit 3D representation, which further guides the 2D reverse sampling process to have better 3D consistency. Experiments show that our proposed framework outperforms state-of-the-art methods and enables the creation of realistic avatars from a single RGB image, achieving high-fidelity in both geometry and appearance. Extensive ablations also validate the efficacy of our design, (1) multi-view 2D priors conditioning in generative 3D reconstruction and (2) consistency refinement of sampling trajectory via the explicit 3D representation. Our code and models will be released on this https URL.

[CV-8] Real2Code: Reconstruct Articulated Objects via Code Generation

链接: https://arxiv.org/abs/2406.08474
作者: Zhao Mandi,Yijia Weng,Dominik Bauer,Shuran Song
关键词: code generation, reconstructing articulated objects, real world objects, reconstructing articulated, objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Real2Code, a novel approach to reconstructing articulated objects via code generation. Given visual observations of an object, we first reconstruct its part geometry using an image segmentation model and a shape completion model. We then represent the object parts with oriented bounding boxes, which are input to a fine-tuned large language model (LLM) to predict joint articulation as code. By leveraging pre-trained vision and language models, our approach scales elegantly with the number of articulated parts, and generalizes from synthetic training data to real world objects in unstructured environments. Experimental results demonstrate that Real2Code significantly outperforms previous state-of-the-art in reconstruction accuracy, and is the first approach to extrapolate beyond objects’ structural complexity in the training set, and reconstructs objects with up to 10 articulated parts. When incorporated with a stereo reconstruction model, Real2Code also generalizes to real world objects from a handful of multi-view RGB images, without the need for depth or camera information.

[CV-9] Self-supervised Learning of Neural Implicit Feature Fields for Camera Pose Refinement

链接: https://arxiv.org/abs/2406.08463
作者: Maxime Pietrantoni,Gabriela Csurka,Martin Humenberger,Torsten Sattler
关键词: scene representation, localization techniques rely, scene, techniques rely, Visual localization techniques
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in 3DV24 (highlight)

点击查看摘要

Abstract:Visual localization techniques rely upon some underlying scene representation to localize against. These representations can be explicit such as 3D SFM map or implicit, such as a neural network that learns to encode the scene. The former requires sparse feature extractors and matchers to build the scene representation. The latter might lack geometric grounding not capturing the 3D structure of the scene well enough. This paper proposes to jointly learn the scene representation along with a 3D dense feature field and a 2D feature extractor whose outputs are embedded in the same metric space. Through a contrastive framework we align this volumetric field with the image-based extractor and regularize the latter with a ranking loss from learned surface information. We learn the underlying geometry of the scene with an implicit field through volumetric rendering and design our feature field to leverage intermediate geometric information encoded in the implicit field. The resulting features are discriminative and robust to viewpoint change while maintaining rich encoded information. Visual localization is then achieved by aligning the image-based features and the rendered volumetric features. We show the effectiveness of our approach on real-world scenes, demonstrating that our approach outperforms prior and concurrent work on leveraging implicit scene representations for localization.

[CV-10] ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery

链接: https://arxiv.org/abs/2406.08457
作者: Kam Woh Ng,Xiatian Zhu,Yi-Zhe Song,Tao Xiang
关键词: Existing fine-grained hashing, hashing methods typically, methods typically lack, typically lack code, code bits holistically
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPRW 2024 - FGVC11 best paper award

点击查看摘要

Abstract:Existing fine-grained hashing methods typically lack code interpretability as they compute hash code bits holistically using both global and local features. To address this limitation, we propose ConceptHash, a novel method that achieves sub-code level interpretability. In ConceptHash, each sub-code corresponds to a human-understandable concept, such as an object part, and these concepts are automatically discovered without human annotations. Specifically, we leverage a Vision Transformer architecture and introduce concept tokens as visual prompts, along with image patch tokens as model inputs. Each concept is then mapped to a specific sub-code at the model output, providing natural sub-code interpretability. To capture subtle visual differences among highly similar sub-categories (e.g., bird species), we incorporate language guidance to ensure that the learned hash codes are distinguishable within fine-grained object classes while maintaining semantic alignment. This approach allows us to develop hash codes that exhibit similarity within families of species while remaining distinct from species in other families. Extensive experiments on four fine-grained image retrieval benchmarks demonstrate that ConceptHash outperforms previous methods by a significant margin, offering unique sub-code interpretability as an additional benefit. Code at: this https URL.

[CV-11] GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

链接: https://arxiv.org/abs/2406.08451
作者: Quanfeng Lu,Wenqi Shao,Zitao Liu,Fanqing Meng,Boxuan Li,Botong Chen,Siyuan Huang,Kaipeng Zhang,Yu Qiao,Ping Luo
关键词: social media platforms, Graphical User Interface, Smartphone users, GUI Odyssey, Autonomous Graphical User
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 8 figures, a cross-app GUI navigation dataset

点击查看摘要

Abstract:Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms. Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets comprising simple tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we introduce GUI Odyssey, a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos. Leveraging GUI Odyssey, we developed OdysseyAgent, a multimodal cross-app navigation agent by fine-tuning the Qwen-VL model with a history resampling module. Extensive experiments demonstrate OdysseyAgent’s superior accuracy compared to existing models. For instance, OdysseyAgent surpasses fine-tuned Qwen-VL and zero-shot GPT-4V by 1.44% and 55.49% in-domain accuracy, and 2.29% and 48.14% out-of-domain accuracy on average. The dataset and code will be released in \urlthis https URL.

[CV-12] PixMamba: Leveraging State Space Models in a Dual-Level Architecture for Underwater Image Enhancement

链接: https://arxiv.org/abs/2406.08444
作者: Wei-Tung Lin,Yong-Xiang Lin,Jyun-Wei Chen,Kai-Lung Hua
关键词: complex color distortions, Underwater Image Enhancement, State Space Models, Image Enhancement, severe blurring
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Underwater Image Enhancement (UIE) is critical for marine research and exploration but hindered by complex color distortions and severe blurring. Recent deep learning-based methods have achieved remarkable results, yet these methods struggle with high computational costs and insufficient global modeling, resulting in locally under- or over- adjusted regions. We present PixMamba, a novel architecture, designed to overcome these challenges by leveraging State Space Models (SSMs) for efficient global dependency modeling. Unlike convolutional neural networks (CNNs) with limited receptive fields and transformer networks with high computational costs, PixMamba efficiently captures global contextual information while maintaining computational efficiency. Our dual-level strategy features the patch-level Efficient Mamba Net (EMNet) for reconstructing enhanced image feature and the pixel-level PixMamba Net (PixNet) to ensure fine-grained feature capturing and global consistency of enhanced image that were previously difficult to obtain. PixMamba achieves state-of-the-art performance across various underwater image datasets and delivers visually superior results. Code is available at: this https URL.

[CV-13] ransformation-Dependent Adversarial Attacks

链接: https://arxiv.org/abs/2406.08443
作者: Yaoteng Tan,Zikui Cai,M. Salman Asif
关键词: single additive perturbation, trigger diverse, single additive, mis-predictions by systematically, systematically transforming
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce transformation-dependent adversarial attacks, a new class of threats where a single additive perturbation can trigger diverse, controllable mis-predictions by systematically transforming the input (e.g., scaling, blurring, compression). Unlike traditional attacks with static effects, our perturbations embed metamorphic properties to enable different adversarial attacks as a function of the transformation parameters. We demonstrate the transformation-dependent vulnerability across models (e.g., convolutional networks and vision transformers) and vision tasks (e.g., image classification and object detection). Our proposed geometric and photometric transformations enable a range of targeted errors from one crafted input (e.g., higher than 90% attack success rate for classifiers). We analyze effects of model architecture and type/variety of transformations on attack effectiveness. This work forces a paradigm shift by redefining adversarial inputs as dynamic, controllable threats. We highlight the need for robust defenses against such multifaceted, chameleon-like perturbations that current techniques are ill-prepared for.

[CV-14] Coherent Optical Modems for Full-Wavefield Lidar

链接: https://arxiv.org/abs/2406.08439
作者: Parsa Mirdehghan,Brandon Buscaino,Maxx Wu,Doug Charlton,Mohammad E. Mousa-Pasandi,Kiriakos N. Kutulakos,David B. Lindell
关键词: multiple polarization states, coherent optical modems, devices that modulate, coherent optical, digital age
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:The advent of the digital age has driven the development of coherent optical modems – devices that modulate the amplitude and phase of light in multiple polarization states. These modems transmit data through fiber optic cables that are thousands of kilometers in length at data rates exceeding one terabit per second. This remarkable technology is made possible through near-THz-rate programmable control and sensing of the full optical wavefield. While coherent optical modems form the backbone of telecommunications networks around the world, their extraordinary capabilities also provide unique opportunities for imaging. Here, we introduce full-wavefield lidar: a new imaging modality that repurposes off-the-shelf coherent optical modems to simultaneously measure distance, axial velocity, and polarization. We demonstrate this modality by combining a 74 GHz-bandwidth coherent optical modem with free-space coupling optics and scanning mirrors. We develop a time-resolved image formation model for this system and formulate a maximum-likelihood reconstruction algorithm to recover depth, velocity, and polarization information at each scene point from the modem’s raw transmitted and received symbols. Compared to existing lidars, full-wavefield lidar promises improved mm-scale ranging accuracy from brief, microsecond exposure times, reliable velocimetry, and robustness to intererence from ambient light or other lidar signals.

[CV-15] Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2406.08431
作者: Benjamin Biggs,Arjun Seshadri,Yang Zou,Achin Jain,Aditya Golatkar,Yusheng Xie,Alessandro Achille,Ashwin Swaminathan,Stefano Soatto
关键词: present Diffusion Soup, Diffusion Soup, compartmentalization method, Diffusion Soup samples, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Diffusion Soup, a compartmentalization method for Text-to-Image Generation that averages the weights of diffusion models trained on sharded data. By construction, our approach enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging. We show that Diffusion Soup samples from a point in weight space that approximates the geometric mean of the distributions of constituent datasets, which offers anti-memorization guarantees and enables zero-shot style mixing. Empirically, Diffusion Soup outperforms a paragon model trained on the union of all data shards and achieves a 30% improvement in Image Reward (.34 \to .44) on domain sharded data, and a 59% improvement in IR (.37 \to .59) on aesthetic data. In both cases, souping also prevails in TIFA score (respectively, 85.5 \to 86.5 and 85.6 \to 86.8). We demonstrate robust unlearning – removing any individual domain shard only lowers performance by 1% in IR (.45 \to .44) – and validate our theoretical insights on anti-memorization using real data. Finally, we showcase Diffusion Soup’s ability to blend the distinct styles of models finetuned on different shards, resulting in the zero-shot generation of hybrid styles.

[CV-16] AWGUNET: Attention-Aided Wavelet Guided U-Net for Nuclei Segmentation in Histopathology Images

链接: https://arxiv.org/abs/2406.08425
作者: Ayush Roy,Payel Pramanik,Dmitrii Kaplun,Sergei Antonov,Ram Sarkar
关键词: Accurate nuclei segmentation, Accurate nuclei, automating nuclei segmentation, Accurate, nuclei segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate nuclei segmentation in histopathological images is crucial for cancer diagnosis. Automating this process offers valuable support to clinical experts, as manual annotation is time-consuming and prone to human errors. However, automating nuclei segmentation presents challenges due to uncertain cell boundaries, intricate staining, and diverse structures. In this paper, we present a segmentation approach that combines the U-Net architecture with a DenseNet-121 backbone, harnessing the strengths of both to capture comprehensive contextual and spatial information. Our model introduces the Wavelet-guided channel attention module to enhance cell boundary delineation, along with a learnable weighted global attention module for channel-specific attention. The decoder module, composed of an upsample block and convolution block, further refines segmentation in handling staining patterns. The experimental results conducted on two publicly accessible histopathology datasets, namely Monuseg and TNBC, underscore the superiority of our proposed model, demonstrating its potential to advance histopathological image analysis and cancer diagnosis. The code is made available at: this https URL.

[CV-17] PRIBOOT: A New Data-Driven Expert for Improved Driving Simulations

链接: https://arxiv.org/abs/2406.08421
作者: Daniel Coelho,Miguel Oliveira,Vitor Santos,Antonio M. Lopez
关键词: real-world automotive technologies, advancing real-world automotive, development of Autonomous, CARLA introduced Leaderboard, systems in simulated
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The development of Autonomous Driving (AD) systems in simulated environments like CARLA is crucial for advancing real-world automotive technologies. To drive innovation, CARLA introduced Leaderboard 2.0, significantly more challenging than its predecessor. However, current AD methods have struggled to achieve satisfactory outcomes due to a lack of sufficient ground truth data. Human driving logs provided by CARLA are insufficient, and previously successful expert agents like Autopilot and Roach, used for collecting datasets, have seen reduced effectiveness under these more demanding conditions. To overcome these data limitations, we introduce PRIBOOT, an expert agent that leverages limited human logs with privileged information. We have developed a novel BEV representation specifically tailored to meet the demands of this new benchmark and processed it as an RGB image to facilitate the application of transfer learning techniques, instead of using a set of masks. Additionally, we propose the Infraction Rate Score (IRS), a new evaluation metric designed to provide a more balanced assessment of driving performance over extended routes. PRIBOOT is the first model to achieve a Route Completion (RC) of 75% in Leaderboard 2.0, along with a Driving Score (DS) and IRS of 20% and 45%, respectively. With PRIBOOT, researchers can now generate extensive datasets, potentially solving the data availability issues that have hindered progress in this benchmark.

[CV-18] OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

链接: https://arxiv.org/abs/2406.08418
作者: Qingyun Li,Zhe Chen,Weiyun Wang,Wenhai Wang,Shenglong Ye,Zhenjiang Jin,Guanzhou Chen,Yinan He,Zhangwei Gao,Erfei Cui,Jiashuo Yu,Hao Tian,Jiasheng Zhou,Chao Xu,Bin Wang,Xingjian Wei,Wei Li,Wenjian Zhang,Bo Zhang,Pinlong Cai,Licheng Wen,Xiangchao Yan,Pei Chu,Yi Wang,Min Dou,Changyao Tian,Xizhou Zhu,Lewei Lu,Yushi Chen,Junjun He,Tong Lu,Yali Wang,Limin Wang,Dahua Lin,Yu Qiao,Botian Shi,Conghui He,Jifeng Dai
关键词: human reading habits, closely resembles human, resembles human reading, Image-text interleaved, Image-text interleaved data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at this https URL.

[CV-19] MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

链接: https://arxiv.org/abs/2406.08407
作者: Xuehai He,Weixi Feng,Kaizhi Zheng,Yujie Lu,Wanrong Zhu,Jiachen Li,Yue Fan,Jianfeng Wang,Linjie Li,Zhengyuan Yang,Kevin Lin,William Yang Wang,Lijuan Wang,Xin Eric Wang
关键词: Multimodal Language Language, Language Language Models, Language Language, Multimodal Language, complex real-world dynamics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of “world models” – interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models’ different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.

[CV-20] VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

链接: https://arxiv.org/abs/2406.08394
作者: Jiannan Wu,Muyan Zhong,Sen Xing,Zeqiang Lai,Zhaoyang Liu,Wenhai Wang,Zhe Chen,Xizhou Zhu,Lewei Lu,Tong Lu,Ping Luo,Yu Qiao,Jifeng Dai
关键词: generalist multimodal large, unifies visual perception, multimodal large model, generalist multimodal, single framework
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 43 pages

点击查看摘要

Abstract:We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed “super link”, as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs.

[CV-21] FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation

链接: https://arxiv.org/abs/2406.08392
作者: Xinzhi Mu,Li Chen,Bohan Chen,Shuyang Gu,Jianmin Bao,Dong Chen,Ji Li,Yuhui Yuan
关键词: garnered significant interest, creating artistic fonts, modern diffusion-based, professional designers, significant interest
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project-page: this https URL

点击查看摘要

Abstract:Recently, the application of modern diffusion-based text-to-image generation models for creating artistic fonts, traditionally the domain of professional designers, has garnered significant interest. Diverging from the majority of existing studies that concentrate on generating artistic typography, our research aims to tackle a novel and more demanding challenge: the generation of text effects for multilingual fonts. This task essentially requires generating coherent and consistent visual content within the confines of a font-shaped canvas, as opposed to a traditional rectangular canvas. To address this task, we introduce a novel shape-adaptive diffusion model capable of interpreting the given shape and strategically planning pixel distributions within the irregular canvas. To achieve this, we curate a high-quality shape-adaptive image-text dataset and incorporate the segmentation mask as a visual condition to steer the image generation process within the irregular-canvas. This approach enables the traditionally rectangle canvas-based diffusion model to produce the desired concepts in accordance with the provided geometric shapes. Second, to maintain consistency across multiple letters, we also present a training-free, shape-adaptive effect transfer method for transferring textures from a generated reference letter to others. The key insights are building a font effect noise prior and propagating the font effect information in a concatenated latent space. The efficacy of our FontStudio system is confirmed through user preference studies, which show a marked preference (78% win-rates on aesthetics) for our system even when compared to the latest unrivaled commercial product, Adobe Firefly.

[CV-22] LaneCPP: Continuous 3D Lane Detection using Physical Priors

链接: https://arxiv.org/abs/2406.08381
作者: Maximilian Pittner,Joel Janai,Alexandru P. Condurache
关键词: locating lane markings, autonomous driving, fundamental problem, context of autonomous, comprises the tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

点击查看摘要

Abstract:Monocular 3D lane detection has become a fundamental problem in the context of autonomous driving, which comprises the tasks of finding the road surface and locating lane markings. One major challenge lies in a flexible but robust line representation capable of modeling complex lane structures, while still avoiding unpredictable behavior. While previous methods rely on fully data-driven approaches, we instead introduce a novel approach LaneCPP that uses a continuous 3D lane detection model leveraging physical prior knowledge about the lane structure and road geometry. While our sophisticated lane model is capable of modeling complex road structures, it also shows robust behavior since physical constraints are incorporated by means of a regularization scheme that can be analytically applied to our parametric representation. Moreover, we incorporate prior knowledge about the road geometry into the 3D feature space by modeling geometry-aware spatial features, guiding the network to learn an internal road surface representation. In our experiments, we show the benefits of our contributions and prove the meaningfulness of using priors to make 3D lane detection more robust. The results show that LaneCPP achieves state-of-the-art performance in terms of F-Score and geometric errors.

[CV-23] Eyes Wide Unshut: Unsupervised Mistake Detection in Egocentric Video by Detecting Unpredictable Gaze

链接: https://arxiv.org/abs/2406.08379
作者: Michele Mazzamuto,Antonino Furnari,Giovanni Maria Farinella
关键词: advancing user assistance, smart glasses, detection in egocentric, critical component, component for advancing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we address the challenge of unsupervised mistake detection in egocentric video through the analysis of gaze signals, a critical component for advancing user assistance in smart glasses. Traditional supervised methods, reliant on manually labeled mistakes, suffer from domain-dependence and scalability issues. This research introduces an unsupervised method for detecting mistakes in videos of human activities, overcoming the challenges of domain-specific requirements and the necessity for annotated data. By analyzing unusual gaze patterns that signal user disorientation during tasks, we propose a gaze completion model that forecasts eye gaze trajectories from incomplete inputs. The difference between the anticipated and observed gaze paths acts as an indicator for identifying errors. Our method is validated on the EPIC-Tent dataset, showing its superiority compared to current one-class supervised and unsupervised techniques.

[CV-24] DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor

链接: https://arxiv.org/abs/2406.08377
作者: Juncheng Wu,Zhangkai Ni,Hanli Wang,Wenhan Yang,Yuyin Zhou,Shiqi Wang
关键词: deep features extracted, Image deep features, Deep Degradation Response, deep features, informative representations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image deep features extracted by pre-trained networks are known to contain rich and informative representations. In this paper, we present Deep Degradation Response (DDR), a method to quantify changes in image deep features under varying degradation conditions. Specifically, our approach facilitates flexible and adaptive degradation, enabling the controlled synthesis of image degradation through text-driven prompts. Extensive evaluations demonstrate the versatility of DDR as an image descriptor, with strong correlations observed with key image attributes such as complexity, colorfulness, sharpness, and overall quality. Moreover, we demonstrate the efficacy of DDR across a spectrum of applications. It excels as a blind image quality assessment metric, outperforming existing methodologies across multiple datasets. Additionally, DDR serves as an effective unsupervised learning objective in image restoration tasks, yielding notable advancements in image deblurring and single-image super-resolution. Our code will be made available.

[CV-25] 2.5D Multi-view Averaging Diffusion Model for 3D Medical Image Translation: Application to Low-count PET Reconstruction with CT-less Attenuation Correction

链接: https://arxiv.org/abs/2406.08374
作者: Tianqi Chen,Jun Hou,Yinchi Zhou,Huidong Xie,Xiongchao Chen,Qiong Liu,Xueqi Guo,Menghua Xia,James S. Duncan,Chi Liu,Bo Zhou
关键词: Positron Emission Tomography, Positron Emission, Emission Tomography, important clinical imaging, clinical imaging tool
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:Positron Emission Tomography (PET) is an important clinical imaging tool but inevitably introduces radiation hazards to patients and healthcare providers. Reducing the tracer injection dose and eliminating the CT acquisition for attenuation correction can reduce the overall radiation dose, but often results in PET with high noise and bias. Thus, it is desirable to develop 3D methods to translate the non-attenuation-corrected low-dose PET (NAC-LDPET) into attenuation-corrected standard-dose PET (AC-SDPET). Recently, diffusion models have emerged as a new state-of-the-art deep learning method for image-to-image translation, better than traditional CNN-based methods. However, due to the high computation cost and memory burden, it is largely limited to 2D applications. To address these challenges, we developed a novel 2.5D Multi-view Averaging Diffusion Model (MADM) for 3D image-to-image translation with application on NAC-LDPET to AC-SDPET translation. Specifically, MADM employs separate diffusion models for axial, coronal, and sagittal views, whose outputs are averaged in each sampling step to ensure the 3D generation quality from multiple views. To accelerate the 3D sampling process, we also proposed a strategy to use the CNN-based 3D generation as a prior for the diffusion model. Our experimental results on human patient studies suggested that MADM can generate high-quality 3D translation images, outperforming previous CNN-based and Diffusion-based baseline methods.

[CV-26] APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentatio

链接: https://arxiv.org/abs/2406.08372
作者: Weizhao He,Yang Zhang,Wei Zhuo,Linlin Shen,Jiaqi Yang,Songhe Deng,Liang Sun
关键词: segment unseen classes, labeled samples, unseen classes, Few-shot semantic segmentation, Current FSS methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:Few-shot semantic segmentation (FSS) endeavors to segment unseen classes with only a few labeled samples. Current FSS methods are commonly built on the assumption that their training and application scenarios share similar domains, and their performances degrade significantly while applied to a distinct domain. To this end, we propose to leverage the cutting-edge foundation model, the Segment Anything Model (SAM), for generalization enhancement. The SAM however performs unsatisfactorily on domains that are distinct from its training data, which primarily comprise natural scene images, and it does not support automatic segmentation of specific semantics due to its interactive prompting mechanism. In our work, we introduce APSeg, a novel auto-prompt network for cross-domain few-shot semantic segmentation (CD-FSS), which is designed to be auto-prompted for guiding cross-domain segmentation. Specifically, we propose a Dual Prototype Anchor Transformation (DPAT) module that fuses pseudo query prototypes extracted based on cycle-consistency with support prototypes, allowing features to be transformed into a more stable domain-agnostic space. Additionally, a Meta Prompt Generator (MPG) module is introduced to automatically generate prompt embeddings, eliminating the need for manual visual prompts. We build an efficient model which can be applied directly to target domains without fine-tuning. Extensive experiments on four cross-domain datasets show that our model outperforms the state-of-the-art CD-FSS method by 5.24% and 3.10% in average accuracy on 1-shot and 5-shot settings, respectively.

[CV-27] From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition

链接: https://arxiv.org/abs/2406.08358
作者: Shiwei Wu,Chao Zhang,Joya Chen,Tong Xu,Likang Wu,Yao Hu,Enhong Chen
关键词: People social relationships, People social, wedding rings, holding hands, interactions acting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:People’s social relationships are often manifested through their surroundings, with certain objects or interactions acting as symbols for specific relationships, e.g., wedding rings, roses, hugs, or holding hands. This brings unique challenges to recognizing social relationships, requiring understanding and capturing the essence of these contexts from visual appearances. However, current methods of social relationship understanding rely on the basic classification paradigm of detected persons and objects, which fails to understand the comprehensive context and often overlooks decisive social factors, especially subtle visual cues. To highlight the social-aware context and intricate details, we propose a novel approach that recognizes \textbfContextual \textbfSocial \textbfRelationships (\textbfConSoR) from a social cognitive perspective. Specifically, to incorporate social-aware semantics, we build a lightweight adapter upon the frozen CLIP to learn social concepts via our novel multi-modal side adapter tuning mechanism. Further, we construct social-aware descriptive language prompts (e.g., scene, activity, objects, emotions) with social relationships for each image, and then compel ConSoR to concentrate more intensively on the decisive visual social factors via visual-linguistic contrasting. Impressively, ConSoR outperforms previous methods with a 12.2% gain on the People-in-Social-Context (PISC) dataset and a 9.8% increase on the People-in-Photo-Album (PIPA) benchmark. Furthermore, we observe that ConSoR excels at finding critical visual evidence to reveal social relationships.

[CV-28] DocSynthv2: A Practical Autoregressive Modeling for Document Generation

链接: https://arxiv.org/abs/2406.08354
作者: Sanket Biswas,Rajiv Jain,Vlad I. Morariu,Jiuxiang Gu,Puneet Mathur,Curtis Wigington,Tong Sun,Josep Lladós
关键词: extensively explored, comprehensive document generation, document generation encompassing, comprehensive document, complex challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Spotlight (Oral) Acceptance to CVPR 2024 Workshop for Graphic Design Understanding and Generation (GDUG)

点击查看摘要

Abstract:While the generation of document layouts has been extensively explored, comprehensive document generation encompassing both layout and content presents a more complex challenge. This paper delves into this advanced domain, proposing a novel approach called DocSynthv2 through the development of a simple yet effective autoregressive structured model. Our model, distinct in its integration of both layout and textual cues, marks a step beyond existing layout-generation approaches. By focusing on the relationship between the structural elements and the textual content within documents, we aim to generate cohesive and contextually relevant documents without any reliance on visual components. Through experimental studies on our curated benchmark for the new task, we demonstrate the ability of our model combining layout and textual information in enhancing the generation quality and relevance of documents, opening new pathways for research in document creation and automated design. Our findings emphasize the effectiveness of autoregressive models in handling complex document generation tasks.

[CV-29] Blind Image Deblurring using FFT-ReLU with Deep Learning Pipeline Integration

链接: https://arxiv.org/abs/2406.08344
作者: Abdul Mohaimen Al Radi,Prothito Shovon Majumder,Syed Mumtahin Mahmud,Mahdi Mohd Hossain Noki,Md. Haider Ali,Md. Mosaddek Khan
关键词: Blind image deblurring, blur kernel, Blind image, perform blind image, process of deriving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 13 figures

点击查看摘要

Abstract:Blind image deblurring is the process of deriving a sharp image and a blur kernel from a blurred image. Blurry images are typically modeled as the convolution of a sharp image with a blur kernel, necessitating the estimation of the unknown blur kernel to perform blind image deblurring effectively. Existing approaches primarily focus on domain-specific features of images, such as salient edges, dark channels, and light streaks. These features serve as probabilistic priors to enhance the estimation of the blur kernel. For improved generality, we propose a novel prior (ReLU sparsity prior) that estimates blur kernel effectively across all distributions of images (natural, facial, text, low-light, saturated etc). Our approach demonstrates superior efficiency, with inference times up to three times faster, while maintaining high accuracy in PSNR, SSIM, and error ratio metrics. We also observe noticeable improvement in the performance of the state-of-the-art architectures (in terms of aforementioned metrics) in deep learning based approaches when our method is used as a post-processing unit.

[CV-30] WMAdapter: Adding WaterMark Control to Latent Diffusion Models

链接: https://arxiv.org/abs/2406.08337
作者: Hai Ci,Yiren Song,Pei Yang,Jinheng Xie,Mike Zheng Shou
关键词: crucial for protecting, protecting the copyright, copyright of AI-generated, Abstract, diffusion generation process
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 20 pages, 13 figures

点击查看摘要

Abstract:Watermarking is crucial for protecting the copyright of AI-generated images. We propose WMAdapter, a diffusion model watermark plugin that takes user-specified watermark information and allows for seamless watermark imprinting during the diffusion generation process. WMAdapter is efficient and robust, with a strong emphasis on high generation quality. To achieve this, we make two key designs: (1) We develop a contextual adapter structure that is lightweight and enables effective knowledge transfer from heavily pretrained post-hoc watermarking models. (2) We introduce an extra finetuning step and design a hybrid finetuning strategy to further improve image quality and eliminate tiny artifacts. Empirical results demonstrate that WMAdapter offers strong flexibility, exceptional image generation quality and competitive watermark robustness.

[CV-31] CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction

链接: https://arxiv.org/abs/2406.08336
作者: Xueyuan Chen,Dongchao Yang,Dingdong Wang,Xixin Wu,Zhiyong Wu,Helen Meng
关键词: prosody naturalness, aims to transform, Dysarthric speech, transform dysarthric speech, speaker similarity
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: Accepted by Interspeech 2024

点击查看摘要

Abstract:Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec language modeling to improve the reconstruction results, especially for the speaker similarity and prosody naturalness. Our proposed model consists of: (i) a multi-modal content encoder to extract robust phoneme embeddings from dysarthric speech with auxiliary visual inputs; (ii) a speaker codec encoder to extract and normalize the speaker-aware codecs from the dysarthric speech, in order to provide original timbre and normal prosody; (iii) a codec language model based speech decoder to reconstruct the speech based on the extracted phoneme embeddings and normalized codecs. Evaluations on the commonly used UASpeech corpus show that our proposed model can achieve significant improvements in terms of speaker similarity and prosody naturalness.

[CV-32] UDON: Universal Dynamic Online distillatioN for generic image representations

链接: https://arxiv.org/abs/2406.08332
作者: Nikolaos-Antonios Ypsilantis,Kaifeng Chen,André Araujo,Ondřej Chum
关键词: instance-level recognition applications, enabling real-world fine-grained, Universal image representations, recognition applications, image representations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Universal image representations are critical in enabling real-world fine-grained and instance-level recognition applications, where objects and entities from any domain must be identified at large scale. Despite recent advances, existing methods fail to capture important domain-specific knowledge, while also ignoring differences in data distribution across different domains. This leads to a large performance gap between efficient universal solutions and expensive approaches utilising a collection of specialist models, one for each domain. In this work, we make significant strides towards closing this gap, by introducing a new learning technique, dubbed UDON (Universal Dynamic Online DistillatioN). UDON employs multi-teacher distillation, where each teacher is specialized in one domain, to transfer detailed domain-specific knowledge into the student universal embedding. UDON’s distillation approach is not only effective, but also very efficient, by sharing most model parameters between the student and all teachers, where all models are jointly trained in an online manner. UDON also comprises a sampling technique which adapts the training process to dynamically allocate batches to domains which are learned slower and require more frequent processing. This boosts significantly the learning of complex domains which are characterised by a large number of classes and long-tail distributions. With comprehensive experiments, we validate each component of UDON, and showcase significant improvements over the state of the art in the recent UnED benchmark. Code: this https URL .

[CV-33] LaMOT: Language-Guided Multi-Object Tracking

链接: https://arxiv.org/abs/2406.08324
作者: Yunhao Li,Xiaoqiong Liu,Luke Liu,Heng Fan,Libo Zhang
关键词: increasing attention recently, drawn increasing attention, crucial tracking problem, attention recently, drawn increasing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-Language MOT is a crucial tracking problem and has drawn increasing attention recently. It aims to track objects based on human language commands, replacing the traditional use of templates or pre-set information from training sets in conventional tracking tasks. Despite various efforts, a key challenge lies in the lack of a clear understanding of why language is used for tracking, which hinders further development in this field. In this paper, we address this challenge by introducing Language-Guided MOT, a unified task framework, along with a corresponding large-scale benchmark, termed LaMOT, which encompasses diverse scenarios and language descriptions. Specially, LaMOT comprises 1,660 sequences from 4 different datasets and aims to unify various Vision-Language MOT tasks while providing a standardized evaluation platform. To ensure high-quality annotations, we manually assign appropriate descriptive texts to each target in every video and conduct careful inspection and correction. To the best of our knowledge, LaMOT is the first benchmark dedicated to Language-Guided MOT. Additionally, we propose a simple yet effective tracker, termed LaMOTer. By establishing a unified task framework, providing challenging benchmarks, and offering insights for future algorithm design and evaluation, we expect to contribute to the advancement of research in Vision-Language MOT. We will release the data at this https URL.

[CV-34] AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer

链接: https://arxiv.org/abs/2406.08298
作者: Yitao Xu,Tong Zhang,Sabine Süsstrunk
关键词: image classification tasks, Neural Cellular Automata, demonstrated remarkable performance, Cellular Automata, classification tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 26 pages, 11 figures

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated remarkable performance in image classification tasks, particularly when equipped with local information via region attention or convolutions. While such architectures improve the feature aggregation from different granularities, they often fail to contribute to the robustness of the networks. Neural Cellular Automata (NCA) enables the modeling of global cell representations through local interactions, with its training strategies and architecture design conferring strong generalization ability and robustness against noisy inputs. In this paper, we propose Adaptor Neural Cellular Automata (AdaNCA) for Vision Transformer that uses NCA as plug-in-play adaptors between ViT layers, enhancing ViT’s performance and robustness against adversarial samples as well as out-of-distribution inputs. To overcome the large computational overhead of standard NCAs, we propose Dynamic Interaction for more efficient interaction learning. Furthermore, we develop an algorithm for identifying the most effective insertion points for AdaNCA based on our analysis of AdaNCA placement and robustness improvement. With less than a 3% increase in parameters, AdaNCA contributes to more than 10% absolute improvement in accuracy under adversarial attacks on the ImageNet1K benchmark. Moreover, we demonstrate with extensive evaluations across 8 robustness benchmarks and 4 ViT architectures that AdaNCA, as a plug-in-play module, consistently improves the robustness of ViTs.

[CV-35] Vessel Re-identification and Activity Detection in Thermal Domain for Maritime Surveillance

链接: https://arxiv.org/abs/2406.08294
作者: Yasod Ginige,Ransika Gunasekara,Darsha Hewavitharana,Manjula Ariyarathne,Ranga Rodrigo,Peshala Jayasekara
关键词: mitigate illegal activities, Maritime surveillance, illegal fishing, drug smuggling, human trafficking
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Maritime surveillance is vital to mitigate illegal activities such as drug smuggling, illegal fishing, and human trafficking. Vision-based maritime surveillance is challenging mainly due to visibility issues at night, which results in failures in re-identifying vessels and detecting suspicious activities. In this paper, we introduce a thermal, vision-based approach for maritime surveillance with object tracking, vessel re-identification, and suspicious activity detection capabilities. For vessel re-identification, we propose a novel viewpoint-independent algorithm which compares features of the sides of the vessel separately (separate side-spaces) leveraging shape information in the absence of color features. We propose techniques to adapt tracking and activity detection algorithms for the thermal domain and train them using a thermal dataset we created. This dataset will be the first publicly available benchmark dataset for thermal maritime surveillance. Our system is capable of re-identifying vessels with an 81.8% Top1 score and identifying suspicious activities with a 72.4% frame mAP score; a new benchmark for each task in the thermal domain.

[CV-36] Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata

链接: https://arxiv.org/abs/2406.08292
作者: Dongsu Zhang,Francis Williams,Zan Gojcic,Karsten Kreis,Sanja Fidler,Young Min Kim,Amlan Kar
关键词: sparse LiDAR scans, large-scale sparse LiDAR, LiDAR scans, abundantly captured, autonomous vehicles
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to CVPR 2024 as highlight

点击查看摘要

Abstract:We aim to generate fine-grained 3D geometry from large-scale sparse LiDAR scans, abundantly captured by autonomous vehicles (AV). Contrary to prior work on AV scene completion, we aim to extrapolate fine geometry from unlabeled and beyond spatial limits of LiDAR scans, taking a step towards generating realistic, high-resolution simulation-ready 3D street environments. We propose hierarchical Generative Cellular Automata (hGCA), a spatially scalable conditional 3D generative model, which grows geometry recursively with local kernels following, in a coarse-to-fine manner, equipped with a light-weight planner to induce global consistency. Experiments on synthetic scenes show that hGCA generates plausible scene geometry with higher fidelity and completeness compared to state-of-the-art baselines. Our model generalizes strongly from sim-to-real, qualitatively outperforming baselines on the Waymo-open dataset. We also show anecdotal evidence of the ability to create novel objects from real-world geometric cues even when trained on limited synthetic content. More results and details can be found on this https URL.

[CV-37] A New Class Biorthogonal Spline Wavelet for Image Edge Detection

链接: https://arxiv.org/abs/2406.08285
作者: Dujuan Zhou,Zizhao Yuan
关键词: shown favorable characteristics, cubic special spline, shown favorable, BCSSW spline wavelet, spline wavelet
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Spline wavelets have shown favorable characteristics for localizing in both time and frequency. In this paper, we propose a new biorthogonal cubic special spline wavelet (BCSSW), based on the Cohen-Daubechies-Feauveau wavelet construction method and the cubic special spline algorithm. BCSSW has better properties in compact support, symmetry, and frequency domain characteristics. However, current mainstream detection operators usually ignore the uncertain representation of regional pixels and global structures. To solve these problems, we propose a structural uncertainty-aware and multi-structure operator fusion detection algorithm (EDBSW) based on a new BCSSW spline wavelet. By constructing a spline wavelet that efficiently handles edge effects, we utilize structural uncertainty-aware modulus maxima to detect highly uncertain edge samples. The proposed wavelet detection operator utilizes the multi-structure morphological operator and fusion reconstruction strategy to effectively address anti-noise processing and edge information of different frequencies. Numerous experiments have demonstrated its excellent performance in reducing noise and capturing edge structure details.

[CV-38] Dataset Enhancement with Instance-Level Augmentations

链接: https://arxiv.org/abs/2406.08249
作者: Orest Kupyn,Christian Rupprecht
关键词: pre-trained latent diffusion, latent diffusion models, incorporating knowledge, wide distribution, distribution of pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a method for expanding a dataset by incorporating knowledge from the wide distribution of pre-trained latent diffusion models. Data augmentations typically incorporate inductive biases about the image formation process into the training (e.g. translation, scaling, colour changes, etc.). Here, we go beyond simple pixel transformations and introduce the concept of instance-level data augmentation by repainting parts of the image at the level of object instances. The method combines a conditional diffusion model with depth and edge maps control conditioning to seamlessly repaint individual objects inside the scene, being applicable to any segmentation or detection dataset. Used as a data augmentation method, it improves the performance and generalization of the state-of-the-art salient object detection, semantic segmentation and object detection models. By redrawing all privacy-sensitive instances (people, license plates, etc.), the method is also applicable for data anonymization. We also release fully synthetic and anonymized expansions for popular datasets: COCO, Pascal VOC and DUTS.

[CV-39] OpenCOLE: Towards Reproducible Automatic Graphic Design Generation

链接: https://arxiv.org/abs/2406.08232
作者: Naoto Inoue,Kento Masui,Wataru Shimoda,Kota Yamaguchi
关键词: received considerable attention, recently received considerable, considerable attention, recently received, received considerable
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: To appear as an extended abstract (EA) in Workshop on Graphic Design Understanding and Generation (in CVPR2024), code: this https URL

点击查看摘要

Abstract:Automatic generation of graphic designs has recently received considerable attention. However, the state-of-the-art approaches are complex and rely on proprietary datasets, which creates reproducibility barriers. In this paper, we propose an open framework for automatic graphic design called OpenCOLE, where we build a modified version of the pioneering COLE and train our model exclusively on publicly available datasets. Based on GPT4V evaluations, our model shows promising performance comparable to the original COLE. We release the pipeline and training results to encourage open development.

[CV-40] Using Deep Convolutional Neural Networks to Detect Rendered Glitches in Video Games

链接: https://arxiv.org/abs/2406.08231
作者: Carlos Garcia Ling,Konrad Tollmar,Linus Gisslen
关键词: Convolutional Neural Networks, Deep Convolutional Neural, Neural Networks, Deep Convolutional, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures, AAIDE conference

点击查看摘要

Abstract:In this paper, we present a method using Deep Convolutional Neural Networks (DCNNs) to detect common glitches in video games. The problem setting consists of an image (800x800 RGB) as input to be classified into one of five defined classes, normal image, or one of four different kinds of glitches (stretched, low resolution, missing and placeholder textures). Using a supervised approach, we train a ShuffleNetV2 using generated data. This work focuses on detecting texture graphical anomalies achieving arguably good performance with an accuracy of 86.8%, detecting 88% of the glitches with a false positive rate of 8.7%, and with the models being able to generalize and detect glitches even in unseen objects. We apply a confidence measure as well to tackle the issue with false positives as well as an effective way of aggregating images to achieve better detection in production. The main use of this work is the partial automatization of graphical testing in the final stages of video game development.

[CV-41] DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

链接: https://arxiv.org/abs/2406.08226
作者: Jordy Van Landeghem,Subhajit Maity,Ayan Banerjee,Matthew Blaschko,Marie-Francine Moens,Josep Lladós,Sanket Biswas
关键词: document image classification, image classification, DIC, document layout analysis, explores knowledge distillation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ICDAR 2024 (Athens, Greece)

点击查看摘要

Abstract:This work explores knowledge distillation (KD) for visually-rich document (VRD) applications such as document layout analysis (DLA) and document image classification (DIC). While VRD research is dependent on increasingly sophisticated and cumbersome models, the field has neglected to study efficiency via model compression. Here, we design a KD experimentation methodology for more lean, performant models on document understanding (DU) tasks that are integral within larger task pipelines. We carefully selected KD strategies (response-based, feature-based) for distilling knowledge to and from backbones with different architectures (ResNet, ViT, DiT) and capacities (base, small, tiny). We study what affects the teacher-student knowledge gap and find that some methods (tuned vanilla KD, MSE, SimKD with an apt projector) can consistently outperform supervised student training. Furthermore, we design downstream task setups to evaluate covariate shift and the robustness of distilled DLA models on zero-shot layout-aware document visual question answering (DocVQA). DLA-KD experiments result in a large mAP knowledge gap, which unpredictably translates to downstream robustness, accentuating the need to further explore how to efficiently obtain more semantic document layout awareness.

[CV-42] A Sociotechnical Lens for Evaluating Computer Vision Models: A Case Study on Detecting and Reasoning about Gender and Emotion

链接: https://arxiv.org/abs/2406.08222
作者: Sha Luo,Sang Jung Kim,Zening Duan,Kaiping Chen
关键词: evolving landscape, landscape of computer, critical area, models, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In the evolving landscape of computer vision (CV) technologies, the automatic detection and interpretation of gender and emotion in images is a critical area of study. This paper investigates social biases in CV models, emphasizing the limitations of traditional evaluation metrics such as precision, recall, and accuracy. These metrics often fall short in capturing the complexities of gender and emotion, which are fluid and culturally nuanced constructs. Our study proposes a sociotechnical framework for evaluating CV models, incorporating both technical performance measures and considerations of social fairness. Using a dataset of 5,570 images related to vaccination and climate change, we empirically compared the performance of various CV models, including traditional models like DeepFace and FER, and generative models like GPT-4 Vision. Our analysis involved manually validating the gender and emotional expressions in a subset of images to serve as benchmarks. Our findings reveal that while GPT-4 Vision outperforms other models in technical accuracy for gender classification, it exhibits discriminatory biases, particularly in response to transgender and non-binary personas. Furthermore, the model’s emotion detection skew heavily towards positive emotions, with a notable bias towards associating female images with happiness, especially when prompted by male personas. These findings underscore the necessity of developing more comprehensive evaluation criteria that address both validity and discriminatory biases in CV models. Our proposed framework provides guidelines for researchers to critically assess CV tools, ensuring their application in communication research is both ethical and effective. The significant contribution of this study lies in its emphasis on a sociotechnical approach, advocating for CV technologies that support social good and mitigate biases rather than perpetuate them.

[CV-43] Runtime Freezing: Dynamic Class Loss for Multi-Organ 3D Segmentation

链接: https://arxiv.org/abs/2406.08217
作者: James Willoughby,Irina Voiculescu
关键词: crucial pre-processing step, refined downstream tasks, medical domain, crucial pre-processing, pre-processing step
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 4 Pages. Accepted to ISBI 2024

点击查看摘要

Abstract:Segmentation has become a crucial pre-processing step to many refined downstream tasks, and particularly so in the medical domain. Even with recent improvements in segmentation models, many segmentation tasks remain difficult. When multiple organs are segmented simultaneously, difficulties are due not only to the limited availability of labelled data, but also to class imbalance. In this work we propose dynamic class-based loss strategies to mitigate the effects of highly imbalanced training data. We show how our approach improves segmentation performance on a challenging Multi-Class 3D Abdominal Organ dataset.

[CV-44] Diffusion-Promoted HDR Video Reconstruction

链接: https://arxiv.org/abs/2406.08204
作者: Yuanshen Guan,Ruikang Xu,Mingde Yao,Ruisheng Gao,Lizhi Wang,Zhiwei Xiong
关键词: High dynamic range, low dynamic range, dynamic range, High dynamic, HDR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Arxiv Preprint

点击查看摘要

Abstract:High dynamic range (HDR) video reconstruction aims to generate HDR videos from low dynamic range (LDR) frames captured with alternating exposures. Most existing works solely rely on the regression-based paradigm, leading to adverse effects such as ghosting artifacts and missing details in saturated regions. In this paper, we propose a diffusion-promoted method for HDR video reconstruction, termed HDR-V-Diff, which incorporates a diffusion model to capture the HDR distribution. As such, HDR-V-Diff can reconstruct HDR videos with realistic details while alleviating ghosting artifacts. However, the direct introduction of video diffusion models would impose massive computational burden. Instead, to alleviate this burden, we first propose an HDR Latent Diffusion Model (HDR-LDM) to learn the distribution prior of single HDR frames. Specifically, HDR-LDM incorporates a tonemapping strategy to compress HDR frames into the latent space and a novel exposure embedding to aggregate the exposure information into the diffusion process. We then propose a Temporal-Consistent Alignment Module (TCAM) to learn the temporal information as a complement for HDR-LDM, which conducts coarse-to-fine feature alignment at different scales among video frames. Finally, we design a Zero-Init Cross-Attention (ZiCA) mechanism to effectively integrate the learned distribution prior and temporal information for generating HDR frames. Extensive experiments validate that HDR-V-Diff achieves state-of-the-art results on several representative datasets.

[CV-45] 2nd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

链接: https://arxiv.org/abs/2406.08192
作者: Zhensong Xu,Jiangtao Yao,Chengjing Wu,Ting Liu,Luoqi Liu
关键词: Complex video object, Complex video, automatic data annotation, video editing, fundamental task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5pages, 4 figures, technique report for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

点击查看摘要

Abstract:Complex video object segmentation serves as a fundamental task for a wide range of downstream applications such as video editing and automatic data annotation. Here we present the 2nd place solution in the MOSE track of PVUW 2024. To mitigate problems caused by tiny objects, similar objects and fast movements in MOSE. We use instance segmentation to generate extra pretraining data from the valid and test set of MOSE. The segmented instances are combined with objects extracted from COCO to augment the training data and enhance semantic representation of the baseline model. Besides, motion blur is added during training to increase robustness against image blur induced by motion. Finally, we apply test time augmentation (TTA) and memory strategy to the inference stage. Our method ranked 2nd in the MOSE track of PVUW 2024, with a \mathcalJ of 0.8007, a \mathcalF of 0.8683 and a \mathcalJ \ \mathcalF of 0.8345.

[CV-46] Category-level Neural Field for Reconstruction of Partially Observed Objects in Indoor Environment

链接: https://arxiv.org/abs/2406.08176
作者: Taekbeom Lee,Youngseok Jang,H. Jin Kim
关键词: Neural implicit representation, success cases, implicit representation, representation has attracted, attracted attention
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: RA-L. 8 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Neural implicit representation has attracted attention in 3D reconstruction through various success cases. For further applications such as scene understanding or editing, several works have shown progress towards object compositional reconstruction. Despite their superior performance in observed regions, their performance is still limited in reconstructing objects that are partially observed. To better treat this problem, we introduce category-level neural fields that learn meaningful common 3D information among objects belonging to the same category present in the scene. Our key idea is to subcategorize objects based on their observed shape for better training of the category-level model. Then we take advantage of the neural field to conduct the challenging task of registering partially observed objects by selecting and aligning against representative objects selected by ray-based uncertainty. Experiments on both simulation and real-world datasets demonstrate that our method improves the reconstruction of unobserved parts for several categories.

[CV-47] Continuous fake media detection: adapting deepfake detectors to new generative techniques

链接: https://arxiv.org/abs/2406.08171
作者: Francesco Tassone,Luca Maiano,Irene Amerini
关键词: impressively high rate, high rate, continue to evolve, impressively high, Generative techniques continue
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative techniques continue to evolve at an impressively high rate, driven by the hype about these technologies. This rapid advancement severely limits the application of deepfake detectors, which, despite numerous efforts by the scientific community, struggle to achieve sufficiently robust performance against the ever-changing content. To address these limitations, in this paper, we propose an analysis of two continuous learning techniques on a Short and a Long sequence of fake media. Both sequences include a complex and heterogeneous range of deepfakes generated from GANs, computer graphics techniques, and unknown sources. Our study shows that continual learning could be important in mitigating the need for generalizability. In fact, we show that, although with some limitations, continual learning methods help to maintain good performance across the entire training sequence. For these techniques to work in a sufficiently robust way, however, it is necessary that the tasks in the sequence share similarities. In fact, according to our experiments, the order and similarity of the tasks can affect the performance of the models over time. To address this problem, we show that it is possible to group tasks based on their similarity. This small measure allows for a significant improvement even in longer sequences. This result suggests that continual techniques can be combined with the most promising detection methods, allowing them to catch up with the latest generative techniques. In addition to this, we propose an overview of how this learning approach can be integrated into a deepfake detection pipeline for continuous integration and continuous deployment (CI/CD). This allows you to keep track of different funds, such as social networks, new generative tools, or third-party datasets, and through the integration of continuous learning, allows constant maintenance of the detectors.

[CV-48] ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

链接: https://arxiv.org/abs/2406.08164
作者: Irene Huang,Wei Lin,M. Jehanzeb Mirza,Jacob A. Hansen,Sivan Doveh,Victor Ion Butoi,Roei Herzig,Assaf Arbelle,Hilde Kuhene,Trevor Darrel,Chuang Gan,Aude Oliva,Rogerio Feris,Leonid Karlinsky
关键词: Large Language Model, entails grasping, significance of attributes, word order, grasping the significance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The first three authors contributed equally

点击查看摘要

Abstract:Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM-only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs’ LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe – a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR QA’. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs.

[CV-49] CT3D: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer

链接: https://arxiv.org/abs/2406.08152
作者: Hualian Sheng,Sijia Cai,Na Zhao,Bing Deng,Qiao Liang,Min-Jian Zhao,Jieping Ye
关键词: computer vision, aiming to accurately, three-dimensional space, clouds is rapidly, rapidly advancing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 8 figures

点击查看摘要

Abstract:The field of 3D object detection from point clouds is rapidly advancing in computer vision, aiming to accurately and efficiently detect and localize objects in three-dimensional space. Current 3D detectors commonly fall short in terms of flexibility and scalability, with ample room for advancements in performance. In this paper, our objective is to address these limitations by introducing two frameworks for 3D object detection with minimal hand-crafted design. Firstly, we propose CT3D, which sequentially performs raw-point-based embedding, a standard Transformer encoder, and a channel-wise decoder for point features within each proposal. Secondly, we present an enhanced network called CT3D++, which incorporates geometric and semantic fusion-based embedding to extract more valuable and comprehensive proposal-aware information. Additionally, CT3D ++ utilizes a point-to-key bidirectional encoder for more efficient feature encoding with reduced computational cost. By replacing the corresponding components of CT3D with these novel modules, CT3D++ achieves state-of-the-art performance on both the KITTI dataset and the large-scale Way-mo Open Dataset. The source code for our frameworks will be made accessible at this https URL.

[CV-50] Universal Scale Laws for Colors and Patterns in Imagery

链接: https://arxiv.org/abs/2406.08149
作者: Rémi Michel,Mohamed Tamaazousti
关键词: Fully Colored Images, adjust spatial resolution, Distribution of colors, Fully Colored, Integral Fluctuation Theorem
类目: Computer Vision and Pattern Recognition (cs.CV); Chaotic Dynamics (nlin.CD)
*备注: 20 pages

点击查看摘要

Abstract:Distribution of colors and patterns in images is observed through cascades that adjust spatial resolution and dynamics. Cascades of colors reveal the emergent universal property that Fully Colored Images (FCIs) of natural scenes adhere to the debated continuous linear log-scale law (slope -2.00 \pm 0.01 ) (L1). Cascades of discrete 2 \times 2 patterns are derived from pixel squares reductions onto the seven unlabeled rotation-free textures (0000, 0001, 0011, 0012, 0101, 0102, 0123). They exhibit an unparalleled universal entropy maximum of 1.74 \pm 0.013 at some dynamics regardless of spatial scale (L2). Patterns also adhere to the Integral Fluctuation Theorem ( 1.00 \pm 0.01 ) (L3), pivotal in studies of chaotic systems. Images with fewer colors exhibit quadratic shift and bias from L1 and L3 but adhere to L2. Randomized Hilbert fractals FCIs better match the laws than basic-to-AI-based simulations. Those results are of interest in Neural Networks, out of equilibrium physics and spectral imagery.

[CV-51] Valeo4Cast: A Modular Approach to End-to-End Forecasting

链接: https://arxiv.org/abs/2406.08113
作者: Yihong Xu,Éloi Zablocki,Alexandre Boulch,Gilles Puy,Mickael Chen,Florent Bartoccioni,Nermin Samet,Oriane Siméoni,Spyros Gidaris,Tuan-Hung Vu,Andrei Bursuc,Eduardo Valle,Renaud Marlet,Matthieu Cord
关键词: autonomous driving systems, Motion forecasting, traffic signals, systems to anticipate, surrounding agents
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Winning solution of the Argoverse 2 “Unified Detection, Tracking, and Forecasting” challenge, held at CVPR 2024 WAD

点击查看摘要

Abstract:Motion forecasting is crucial in autonomous driving systems to anticipate the future trajectories of surrounding agents such as pedestrians, vehicles, and traffic signals. In end-to-end forecasting, the model must jointly detect from sensor data (cameras or LiDARs) the position and past trajectories of the different elements of the scene and predict their future location. We depart from the current trend of tackling this task via end-to-end training from perception to forecasting and we use a modular approach instead. Following a recent study, we individually build and train detection, tracking, and forecasting modules. We then only use consecutive finetuning steps to integrate the modules better and alleviate compounding errors. Our study reveals that this simple yet effective approach significantly improves performance on the end-to-end forecasting benchmark. Consequently, our solution ranks first in the Argoverse 2 end-to-end Forecasting Challenge held at CVPR 2024 Workshop on Autonomous Driving (WAD), with 63.82 mAPf. We surpass forecasting results by +17.1 points over last year’s winner and by +13.3 points over this year’s runner-up. This remarkable performance in forecasting can be explained by our modular paradigm, which integrates finetuning strategies and significantly outperforms the end-to-end-trained counterparts.

[CV-52] Adversarial Patch for 3D Local Feature Extractor

链接: https://arxiv.org/abs/2406.08102
作者: Yu Wen Pao,Li Chang Lai,Hong-Yi Lin
关键词: computer vision tasks, Local feature extractors, vision tasks, computer vision, Local feature
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Local feature extractors are the cornerstone of many computer vision tasks. However, their vulnerability to adversarial attacks can significantly compromise their effectiveness. This paper discusses approaches to attack sophisticated local feature extraction algorithms and models to achieve two distinct goals: (1) forcing a match between originally non-matching image regions, and (2) preventing a match between originally matching regions. At the end of the paper, we discuss the performance and drawbacks of different patch generation methods.

[CV-53] Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

链接: https://arxiv.org/abs/2406.08096
作者: Runyi Yu,Tianyu He,Ailing Zeng,Yuchi Wang,Junliang Guo,Xu Tan,Chang Liu,Jie Chen,Jiang Bian
关键词: visual details preservation, visual details, aim to edit, movements in talking, speech while preserving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages of main text, 23 pages in total, 9 figures

点击查看摘要

Abstract:We aim to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details. The task can be decomposed into two sub-problems: (1) speech-driven lip motion generation and (2) visual appearance synthesis. Current solutions handle the two sub-problems within a single generative model, resulting in a challenging trade-off between lip-sync quality and visual details preservation. Instead, we propose to disentangle the motion and appearance, and then generate them one by one with a speech-to-motion diffusion model and a motion-conditioned appearance generation model. However, there still remain challenges in each stage, such as motion-aware identity preservation in (1) and visual details preservation in (2). Therefore, to preserve personal identity, we adopt landmarks to represent the motion, and further employ a landmark-based identity loss. To capture motion-agnostic visual details, we use separate encoders to encode the lip, non-lip appearance and motion, and then integrate them with a learned fusion module. We train MyTalk on a large-scale and diverse dataset. Experiments show that our method generalizes well to the unknown, even out-of-domain person, in terms of both lip sync and visual detail preservation. We encourage the readers to watch the videos on our project page (this https URL).

[CV-54] From Sim-to-Real: Toward General Event-based Low-light Frame Interpolation with Per-scene Optimization

链接: https://arxiv.org/abs/2406.08090
作者: Ziran Zhang,Yongrui Ma,Yueting Chen,Feng Zhang,Jinwei Gu,Tianfan Xue,Shi Guo
关键词: Video Frame Interpolation, frame rate up-conversion, Frame Interpolation, Video Frame, video enhancement
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video Frame Interpolation (VFI) is important for video enhancement, frame rate up-conversion, and slow-motion generation. The introduction of event cameras, which capture per-pixel brightness changes asynchronously, has significantly enhanced VFI capabilities, particularly for high-speed, nonlinear motions. However, these event-based methods encounter challenges in low-light conditions, notably trailing artifacts and signal latency, which hinder their direct applicability and generalization. Addressing these issues, we propose a novel per-scene optimization strategy tailored for low-light conditions. This approach utilizes the internal statistics of a sequence to handle degraded event data under low-light conditions, improving the generalizability to different lighting and camera settings. To evaluate its robustness in low-light condition, we further introduce EVFI-LL, a unique RGB+Event dataset captured under low-light conditions. Our results demonstrate state-of-the-art performance in low-light environments. Both the dataset and the source code will be made publicly available upon publication. Project page: this https URL.

[CV-55] Identification of Conversation Partners from Egocentric Video

链接: https://arxiv.org/abs/2406.08089
作者: Tobias Dorszewski,Søren A. Fuglsang,Jens Hjortkjær
关键词: Communicating in noisy, multi-talker environments, environments is challenging, hearing impairments, people with hearing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: First Joint Egocentric Vision (EgoVis) Workshop at CVPR 2024

点击查看摘要

Abstract:Communicating in noisy, multi-talker environments is challenging, especially for people with hearing impairments. Egocentric video data can potentially be used to identify a user’s conversation partners, which could be used to inform selective acoustic amplification of relevant speakers. Recent introduction of datasets and tasks in computer vision enable progress towards analyzing social interactions from an egocentric perspective. Building on this, we focus on the task of identifying conversation partners from egocentric video and describe a suitable dataset. Our dataset comprises 69 hours of egocentric video of diverse multi-conversation scenarios where each individual was assigned one or more conversation partners, providing the labels for our computer vision task. This dataset enables the development and assessment of algorithms for identifying conversation partners and evaluating related approaches. Here, we describe the dataset alongside initial baseline results of this ongoing work, aiming to contribute to the exciting advancements in egocentric video analysis for social settings.

[CV-56] Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

链接: https://arxiv.org/abs/2406.08085
作者: Haoji Zhang,Yiqin Wang,Yansong Tang,Yong Liu,Jiashi Feng,Jifeng Dai,Xiaojie Jin
关键词: large language models, achieved prominent performance, online video streams, cross-modal alignment, video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common media forms in the real world, have seldom received attention. Compared to offline videos, the ‘dynamic’ nature of online video streams poses challenges for the direct application of existing models and introduces new problems, such as the storage of extremely long-term information, interaction between continuous visual content and ‘asynchronous’ user questions. Therefore, in this paper we present Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. Compared to existing models, Flash-VStream achieves significant reductions in inference latency and VRAM consumption, which is intimately related to performing understanding of online streaming video. In addition, given that existing video understanding benchmarks predominantly concentrate on offline scenario, we propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding. Comparisons with popular existing methods on the proposed benchmark demonstrate the superiority of our method for such challenging setting. To verify the generalizability of our approach, we further evaluate it on existing video understanding benchmarks and achieves state-of-the-art performance in offline scenarios as well. All code, models, and datasets are available at the this https URL

[CV-57] A2-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

链接: https://arxiv.org/abs/2406.08079
作者: Lixian Zhang,Yi Zhao,Runmin Dong,Jinxiao Zhang,Shuai Yuan,Shilei Cao,Mengxuan Chen,Juepeng Zheng,Weijia Li,Wei Liu,Litong Feng,Haohuan Fu
关键词: provide Earth observations, data provide Earth, addressing global-scale challenges, encompassing critical spatial, Vast amounts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vast amounts of remote sensing (RS) data provide Earth observations across multiple dimensions, encompassing critical spatial, temporal, and spectral information which is essential for addressing global-scale challenges such as land use monitoring, disaster prevention, and environmental change mitigation. Despite various pre-training methods tailored to the characteristics of RS data, a key limitation persists: the inability to effectively integrate spatial, temporal, and spectral information within a single unified model. To unlock the potential of RS data, we construct a Spatial-Temporal-Spectral Structured Dataset (STSSD) characterized by the incorporation of multiple RS sources, diverse coverage, unified locations within image sets, and heterogeneity within images. Building upon this structured dataset, we propose an Anchor-Aware Masked AutoEncoder method (A ^2 -MAE), leveraging intrinsic complementary information from the different kinds of images and geo-information to reconstruct the masked patches during the pre-training phase. A ^2 -MAE integrates an anchor-aware masking strategy and a geographic encoding module to comprehensively exploit the properties of RS images. Specifically, the proposed anchor-aware masking strategy dynamically adapts the masking process based on the meta-information of a pre-selected anchor image, thereby facilitating the training on images captured by diverse types of RS sources within one model. Furthermore, we propose a geographic encoding method to leverage accurate spatial patterns, enhancing the model generalization capabilities for downstream applications that are generally location-related. Extensive experiments demonstrate our method achieves comprehensive improvements across various downstream tasks compared with existing RS pre-training methods, including image classification, semantic segmentation, and change detection tasks.

[CV-58] A Concept-Based Explainability Framework for Large Multimodal Models

链接: https://arxiv.org/abs/2406.08074
作者: Jayneel Parekh,Pegah Khayatan,Mustafa Shukor,Alasdair Newson,Matthieu Cord
关键词: large language models, combine unimodal encoders, Large multimodal models, large language, perform multimodal tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as “multi-modal concepts”. We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. We will publicly release our code.

[CV-59] CFG: Manifold-constrained Classifier Free Guidance for Diffusion Models

链接: https://arxiv.org/abs/2406.08070
作者: Hyungjin Chung,Jeongsol Kim,Geon Yeong Park,Hyelin Nam,Jong Chul Ye
关键词: CFG, Classifier-free guidance, modern diffusion models, fundamental tool, tool in modern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classifier-free guidance (CFG) is a fundamental tool in modern diffusion models for text-guided generation. Although effective, CFG has notable drawbacks. For instance, DDIM with CFG lacks invertibility, complicating image editing; furthermore, high guidance scales, essential for high-quality outputs, frequently result in issues like mode collapse. Contrary to the widespread belief that these are inherent limitations of diffusion models, this paper reveals that the problems actually stem from the off-manifold phenomenon associated with CFG, rather than the diffusion models themselves. More specifically, inspired by the recent advancements of diffusion model-based inverse problem solvers (DIS), we reformulate text-guidance as an inverse problem with a text-conditioned score matching loss, and develop CFG++, a novel approach that tackles the off-manifold challenges inherent in traditional CFG. CFG++ features a surprisingly simple fix to CFG, yet it offers significant improvements, including better sample quality for text-to-image generation, invertibility, smaller guidance scales, reduced mode collapse, etc. Furthermore, CFG++ enables seamless interpolation between unconditional and conditional sampling at lower guidance scales, consistently outperforming traditional CFG at all scales. Experimental results confirm that our method significantly enhances performance in text-to-image generation, DDIM inversion, editing, and solving inverse problems, suggesting a wide-ranging impact and potential applications in various fields that utilize text guidance. Project Page: this https URL.

[CV-60] MWIRSTD: A MWIR Small Target Detection Dataset

链接: https://arxiv.org/abs/2406.08063
作者: Nikhil Kumar,Avinash Upadhyay,Shreya Sharma,Manoj Sharma,Pravendra Singh
关键词: mid-wave infrared, video sequences, sequences containing approximately, paper presents, distinct classes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ICIP2024

点击查看摘要

Abstract:This paper presents a novel mid-wave infrared (MWIR) small target detection dataset (MWIRSTD) comprising 14 video sequences containing approximately 1053 images with annotated targets of three distinct classes of small objects. Captured using cooled MWIR imagers, the dataset offers a unique opportunity for researchers to develop and evaluate state-of-the-art methods for small object detection in realistic MWIR scenes. Unlike existing datasets, which primarily consist of uncooled thermal images or synthetic data with targets superimposed onto the background or vice versa, MWIRSTD provides authentic MWIR data with diverse targets and environments. Extensive experiments on various traditional methods and deep learning-based techniques for small target detection are performed on the proposed dataset, providing valuable insights into their efficacy. The dataset and code are available at this https URL.

[CV-61] A Robust Pipeline for Classification and Detection of Bleeding Frames in Wireless Capsule Endoscopy using Swin Transformer and RT-DETR

链接: https://arxiv.org/abs/2406.08046
作者: Sasidhar Alavala,Anil Kumar Vadde,Aparnamala Kancheti,Subrahmanyam Gorthi
关键词: Auto WCEBleedGen Challenge, Wireless Capsule Endoscopy, WCEBleedGen Challenge, Auto WCEBleedGen, Adaptive Histogram Equalization
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, we present our approach to the Auto WCEBleedGen Challenge V2 2024. Our solution combines the Swin Transformer for the initial classification of bleeding frames and RT-DETR for further detection of bleeding in Wireless Capsule Endoscopy (WCE), enhanced by a series of image preprocessing steps. These steps include converting images to Lab colour space, applying Contrast Limited Adaptive Histogram Equalization (CLAHE) for better contrast, and using Gaussian blur to suppress artefacts. The Swin Transformer utilizes a tiered architecture with shifted windows to efficiently manage self-attention calculations, focusing on local windows while enabling cross-window interactions. RT-DETR features an efficient hybrid encoder for fast processing of multi-scale features and an uncertainty-minimal query selection for enhanced accuracy. The class activation maps by Ablation-CAM are plausible to the model’s decisions. On the validation set, this approach achieves a classification accuracy of 98.5% (best among the other state-of-the-art models) compared to 91.7% without any pre-processing and an \textAP_50 of 66.7% compared to 65.0% with state-of-the-art YOLOv8. On the test set, this approach achieves a classification accuracy and F1 score of 87.0% and 89.0% respectively.

[CV-62] Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

链接: https://arxiv.org/abs/2406.08037
作者: Xiangyang Yang,Dan Zeng,Xucheng Wang,You Wu,Hengzhou Ye,Shuiwang Li
关键词: Empowered by transformer-based, transformer-based models, visual tracking, efficient visual tracking, tracking
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. The rationale behind ABTrack is rooted in the observation that semantic features or relations do not uniformly impact the tracking task across all abstraction levels. Instead, this impact varies based on the characteristics of the target and the scene it occupies. Consequently, disregarding insignificant semantic features or relations at certain abstraction levels may not significantly affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed, which adaptively simplifies the architecture of ViTs and thus speeds up the inference process. To counteract the time cost incurred by the BDMs and further enhance the efficiency of ViTs, we innovatively adapt a pruning technique to reduce the dimension of the latent representation of tokens in each transformer block. Extensive experiments on multiple tracking benchmarks validate the effectiveness and generality of the proposed method and show that it achieves state-of-the-art performance. Code is released at: \hrefthis https URL

[CV-63] LVBench: An Extreme Long Video Understanding Benchmark

链接: https://arxiv.org/abs/2406.08035
作者: Weihan Wang,Zehai He,Wenyi Hong,Yean Cheng,Xiaohan Zhang,Ji Qi,Shiyu Huang,Bin Xu,Yuxiao Dong,Ming Ding,Jie Tang
关键词: Recent progress, multimodal large language, large language models, long video, long video understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension. Our data and code are publicly available at: this https URL.

[CV-64] Deep Learning for Slum Mapping in Remote Sensing Images: A Meta-analysis and Review

链接: https://arxiv.org/abs/2406.08031
作者: Anjali Raj,Adway Mitra,Manjira Sinha
关键词: Nations Development Program, United Nations Development, Sustainable Development Goals, major Sustainable Development, Development Goals
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The major Sustainable Development Goals (SDG) 2030, set by the United Nations Development Program (UNDP), include sustainable cities and communities, no poverty, and reduced inequalities. However, millions of people live in slums or informal settlements with poor living conditions in many major cities around the world, especially in less developed countries. To emancipate these settlements and their inhabitants through government intervention, accurate data about slum location and extent is required. While ground survey data is the most reliable, such surveys are costly and time-consuming. An alternative is remotely sensed data obtained from very high-resolution (VHR) imagery. With the advancement of new technology, remote sensing based mapping of slums has emerged as a prominent research area. The parallel rise of Artificial Intelligence, especially Deep Learning has added a new dimension to this field as it allows automated analysis of satellite imagery to identify complex spatial patterns associated with slums. This article offers a detailed review and meta-analysis of research on slum mapping using remote sensing imagery from 2014 to 2024, with a special focus on deep learning approaches. Our analysis reveals a trend towards increasingly complex neural network architectures, with advancements in data preprocessing and model training techniques significantly enhancing slum identification accuracy. We have attempted to identify key methodologies that are effective across diverse geographic contexts. While acknowledging the transformative impact Convolutional Neural Networks (CNNs) in slum detection, our review underscores the absence of a universally optimal model, suggesting the need for context-specific adaptations. We also identify prevailing challenges in this field, such as data limitations and a lack of model explainability and suggest potential strategies for overcoming these.

[CV-65] Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

链接: https://arxiv.org/abs/2406.08024
作者: Shimin Chen,Yitian Yuan,Shaoxiang Chen,Zequn Jie,Lin Ma
关键词: image-based Large Vision-Language, Large Vision-Language Models, Amidst the advancements, image-based Large, Large Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Amidst the advancements in image-based Large Vision-Language Models (image-LVLM), the transition to video-based models (video-LVLM) is hindered by the limited availability of quality video data. This paper addresses the challenge by leveraging the visual commonalities between images and videos to efficiently evolve image-LVLMs into video-LVLMs. We present a cost-effective video-LVLM that enhances model architecture, introduces innovative training strategies, and identifies the most effective types of video instruction data. Our innovative weighted token sampler significantly compresses the visual token numbers of each video frame, effectively cutting computational expenses. We also find that judiciously using just 10% of the video data, compared to prior video-LVLMs, yields impressive results during various training phases. Moreover, we delve into the influence of video instruction data in limited-resource settings, highlighting the significance of incorporating video training data that emphasizes temporal understanding to enhance model performance. The resulting Fewer Tokens and Fewer Videos LVLM (FTFV-LVLM) exhibits exceptional performance across video and image benchmarks, validating our model’s design and training approaches.

[CV-66] Generalizable Disaster Damage Assessment via Change Detection with Vision Foundation Model

链接: https://arxiv.org/abs/2406.08020
作者: Kyeongjin Ahn,Sungwon Han,Sungwon Park,Jihee Kim,Sangyoon Park,Meeyoung Cha
关键词: natural disasters demand, precise damage assessment, increasing frequency, frequency and intensity, intensity of natural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The increasing frequency and intensity of natural disasters demand more sophisticated approaches for rapid and precise damage assessment. To tackle this issue, researchers have developed various methods on disaster benchmark datasets from satellite imagery to aid in detecting disaster damage. However, the diverse nature of geographical landscapes and disasters makes it challenging to apply existing methods to regions unseen during training. We present DAVI (Disaster Assessment with VIsion foundation model), which overcomes domain disparities and detects structural damage (e.g., building) without requiring ground-truth labels of the target region. DAVI integrates task-specific knowledge from a model trained on source regions with an image segmentation foundation model to generate pseudo labels of possible damage in the target region. It then employs a two-stage refinement process, targeting both the pixel and overall image, to more accurately pinpoint changes in disaster-struck areas based on before-and-after images. Comprehensive evaluations demonstrate that DAVI achieves exceptional performance across diverse terrains (e.g., USA and Mexico) and disaster types (e.g., wildfires, hurricanes, and earthquakes). This confirms its robustness in assessing disaster impact without dependence on ground-truth labels.

[CV-67] OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding

链接: https://arxiv.org/abs/2406.08009
作者: Yinan Deng,Jiahui Wang,Jingyu Zhao,Jianyu Dou,Yi Yang,Yufeng Yue
关键词: visual language models, showcase remarkable capabilities, recent years, language models, scene reconstruction facilitated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 7figures. Project Url: this https URL

点击查看摘要

Abstract:In recent years, there has been a surge of interest in open-vocabulary 3D scene reconstruction facilitated by visual language models (VLMs), which showcase remarkable capabilities in open-set retrieval. However, existing methods face some limitations: they either focus on learning point-wise features, resulting in blurry semantic understanding, or solely tackle object-level reconstruction, thereby overlooking the intricate details of the object’s interior. To address these challenges, we introduce OpenObj, an innovative approach to build open-vocabulary object-level Neural Radiance Fields (NeRF) with fine-grained understanding. In essence, OpenObj establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level. Moreover, we incorporate part-level features into the neural fields, enabling a nuanced representation of object interiors. This approach captures object-level instances while maintaining a fine-grained understanding. The results on multiple datasets demonstrate that OpenObj achieves superior performance in zero-shot semantic segmentation and retrieval tasks. Additionally, OpenObj supports real-world robotics tasks at multiple scales, including global movement and local manipulation.

[CV-68] Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization

链接: https://arxiv.org/abs/2406.08001
作者: Jiaxin Deng,Junbiao Pang,Baochang Zhang
关键词: Sharpness-Aware Minimization, Asymptotic Unbiased Sampling, SAM, effectively reducing, Minimization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) has emerged as a promising approach for effectively reducing the generalization error. However, SAM incurs twice the computational cost compared to base optimizer (e.g., SGD). We propose Asymptotic Unbiased Sampling with respect to iterations to accelerate SAM (AUSAM), which maintains the model’s generalization capacity while significantly enhancing computational efficiency. Concretely, we probabilistically sample a subset of data points beneficial for SAM optimization based on a theoretically guaranteed criterion, i.e., the Gradient Norm of each Sample (GNS). We further approximate the GNS by the difference in loss values before and after perturbation in SAM. As a plug-and-play, architecture-agnostic method, our approach consistently accelerates SAM across a range of tasks and networks, i.e., classification, human pose estimation and network quantization. On CIFAR10/100 and Tiny-ImageNet, AUSAM achieves results comparable to SAM while providing a speedup of over 70%. Compared to recent dynamic data pruning methods, AUSAM is better suited for SAM and excels in maintaining performance. Additionally, AUSAM accelerates optimization in human pose estimation and model quantization without sacrificing performance, demonstrating its broad practicality.

[CV-69] SimSAM: Simple Siamese Representations Based Semantic Affinity Matrix for Unsupervised Image Segmentation

链接: https://arxiv.org/abs/2406.07986
作者: Chanda Grover Kamra,Indra Deep Mastan,Nitin Kumar,Debayan Gupta
关键词: learn data representations, Semantic Affinity Matrix, Recent developments, self-supervised learning, Affinity Matrix
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 Pages-Main Paper , 6 figures, 6Tables (Main Paper), ICIP 2024, 8 Pages: Supplementary

点击查看摘要

Abstract:Recent developments in self-supervised learning (SSL) have made it possible to learn data representations without the need for annotations. Inspired by the non-contrastive SSL approach (SimSiam), we introduce a novel framework SIMSAM to compute the Semantic Affinity Matrix, which is significant for unsupervised image segmentation. Given an image, SIMSAM first extracts features using pre-trained DINO-ViT, then projects the features to predict the correlations of dense features in a non-contrastive way. We show applications of the Semantic Affinity Matrix in object segmentation and semantic segmentation tasks. Our code is available at this https URL.

[CV-70] Real-world Image Dehazing with Coherence-based Label Generator and Cooperative Unfolding Network

链接: https://arxiv.org/abs/2406.07966
作者: Chengyu Fang,Chunming He,Fengyang Xiao,Yulun Zhang,Longxiang Tang,Yuelin Zhang,Kai Li,Xiu Li
关键词: Real-world Image Dehazing, alleviate haze-induced degradation, Image Dehazing, aims to alleviate, real-world settings
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Real-world Image Dehazing (RID) aims to alleviate haze-induced degradation in real-world settings. This task remains challenging due to the complexities in accurately modeling real haze distributions and the scarcity of paired real-world data. To address these challenges, we first introduce a cooperative unfolding network that jointly models atmospheric scattering and image scenes, effectively integrating physical knowledge into deep networks to restore haze-contaminated details. Additionally, we propose the first RID-oriented iterative mean-teacher framework, termed the Coherence-based Label Generator, to generate high-quality pseudo labels for network training. Specifically, we provide an optimal label pool to store the best pseudo-labels during network training, leveraging both global and local coherence to select high-quality candidates and assign weights to prioritize haze-free regions. We verify the effectiveness of our method, with experiments demonstrating that it achieves state-of-the-art performance on RID tasks. Code will be available at \urlthis https URL.

[CV-71] Accurate Explanation Model for Image Classifiers using Class Association Embedding

链接: https://arxiv.org/abs/2406.07961
作者: Ruitao Xie,Jingbang Chen,Limai Jiang,Rui Xiao,Yi Pan,Yunpeng Cai
关键词: data analysis, crucially demanded, image classification tasks, primary task, Image classification
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 40th IEEE International Conference on Data Engineering

点击查看摘要

Abstract:Image classification is a primary task in data analysis where explainable models are crucially demanded in various applications. Although amounts of methods have been proposed to obtain explainable knowledge from the black-box classifiers, these approaches lack the efficiency of extracting global knowledge regarding the classification task, thus is vulnerable to local traps and often leads to poor accuracy. In this study, we propose a generative explanation model that combines the advantages of global and local knowledge for explaining image classifiers. We develop a representation learning method called class association embedding (CAE), which encodes each sample into a pair of separated class-associated and individual codes. Recombining the individual code of a given sample with altered class-associated code leads to a synthetic real-looking sample with preserved individual characters but modified class-associated features and possibly flipped class assignments. A building-block coherency feature extraction algorithm is proposed that efficiently separates class-associated features from individual ones. The extracted feature space forms a low-dimensional manifold that visualizes the classification decision patterns. Explanation on each individual sample can be then achieved in a counter-factual generation manner which continuously modifies the sample in one direction, by shifting its class-associated code along a guided path, until its classification outcome is changed. We compare our method with state-of-the-art ones on explaining image classification tasks in the form of saliency maps, demonstrating that our method achieves higher accuracies. The code is available at this https URL.

[CV-72] DemosaicFormer: Coarse-to-Fine Demosaicing Network for HybridEVS Camera

链接: https://arxiv.org/abs/2406.07951
作者: Senyan Xu,Zhijing Sun,Jiaying Zhu,Yurui Zhu,Xueyang Fu,Zheng-Jun Zha
关键词: Hybrid Event-Based Vision, Event-Based Vision Sensor, sensor integrating traditional, high dynamic range, offering substantial benefits
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hybrid Event-Based Vision Sensor (HybridEVS) is a novel sensor integrating traditional frame-based and event-based sensors, offering substantial benefits for applications requiring low-light, high dynamic range, and low-latency environments, such as smartphones and wearable devices. Despite its potential, the lack of Image signal processing (ISP) pipeline specifically designed for HybridEVS poses a significant challenge. To address this challenge, in this study, we propose a coarse-to-fine framework named DemosaicFormer which comprises coarse demosaicing and pixel correction. Coarse demosaicing network is designed to produce a preliminary high-quality estimate of the RGB image from the HybridEVS raw data while the pixel correction network enhances the performance of image restoration and mitigates the impact of defective pixels. Our key innovation is the design of a Multi-Scale Gating Module (MSGM) applying the integration of cross-scale features, which allows feature information to flow between different scales. Additionally, the adoption of progressive training and data augmentation strategies further improves model’s robustness and effectiveness. Experimental results show superior performance against the existing methods both qualitatively and visually, and our DemosaicFormer achieves the best performance in terms of all the evaluation metrics in the MIPI 2024 challenge on Demosaic for Hybridevs Camera. The code is available at this https URL.

[CV-73] Multi-Teacher Multi-Objective Meta-Learning for Zero-Shot Hyperspectral Band Selection

链接: https://arxiv.org/abs/2406.07949
作者: Jie Feng,Xiaojian Zhong,Di Li,Weisheng Dong,Ronghua Shang,Licheng Jiao
关键词: Band selection, hyperspectral band selection, Band selection plays, zero-shot hyperspectral band, multiple band selection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Band selection plays a crucial role in hyperspectral image classification by removing redundant and noisy bands and retaining discriminative ones. However, most existing deep learning-based methods are aimed at dealing with a specific band selection dataset, and need to retrain parameters for new datasets, which significantly limits their this http URL address this issue, a novel multi-teacher multi-objective meta-learning network (M ^3 BS) is proposed for zero-shot hyperspectral band selection. In M ^3 BS, a generalizable graph convolution network (GCN) is constructed to generate dataset-agnostic base, and extract compatible meta-knowledge from multiple band selection tasks. To enhance the ability of meta-knowledge extraction, multiple band selection teachers are introduced to provide diverse high-quality experiences.strategy Finally, subsequent classification tasks are attached and jointly optimized with multi-teacher band selection tasks through multi-objective meta-learning in an end-to-end trainable way. Multi-objective meta-learning guarantees to coordinate diverse optimization objectives automatically and adapt to various datasets simultaneously. Once the optimization is accomplished, the acquired meta-knowledge can be directly transferred to unseen datasets without any retraining or fine-tuning. Experimental results demonstrate the effectiveness and efficiency of our proposed method on par with state-of-the-art baselines for zero-shot hyperspectral band selection.

[CV-74] IFTD: Image Feature Triangle Descriptor for Loop Detection in Driving Scenes

链接: https://arxiv.org/abs/2406.07937
作者: Fengtian Lang,Ruiye Ming,Zikang Yuan,Xin Yang
关键词: Feature Triangle Descriptor, robust Image Feature, Image Feature Triangle, Triangle Descriptor, aimed at improving
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In this work, we propose a fast and robust Image Feature Triangle Descriptor (IFTD) based on the STD method, aimed at improving the efficiency and accuracy of place recognition in driving scenarios. We extract keypoints from BEV projection image of point cloud and construct these keypoints into triangle descriptors. By matching these feature triangles, we achieved precise place recognition and calculated the 4-DOF pose estimation between two keyframes. Furthermore, we employ image similarity inspection to perform the final place recognition. Experimental results on three public datasets demonstrate that our IFTD can achieve greater robustness and accuracy than state-of-the-art methods with low computational overhead.

[CV-75] Emotional Conversation: Empowering Talking Faces with Cohesive Expression Gaze and Pose Generation

链接: https://arxiv.org/abs/2406.07895
作者: Jiadong Liang,Feng Lu
关键词: diverse multimedia domains, holds immense potential, immense potential applications, Vivid talking face, generation holds immense
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vivid talking face generation holds immense potential applications across diverse multimedia domains, such as film and game production. While existing methods accurately synchronize lip movements with input audio, they typically ignore crucial alignments between emotion and facial cues, which include expression, gaze, and head pose. These alignments are indispensable for synthesizing realistic videos. To address these issues, we propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables. This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning. Specifically, we decompose this task into two key steps, namely speech-to-landmarks synthesis and landmarks-to-face generation. The first step focuses on simultaneously synthesizing emotionally aligned facial cues, including normalized landmarks that represent expressions, gaze, and head pose. These cues are subsequently reassembled into relocated facial landmarks. In the second step, these relocated landmarks are mapped to latent key points using self-supervised learning and then input into a pretrained model to create high-quality face images. Extensive experiments on the MEAD dataset demonstrate that our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.

[CV-76] A Comprehensive Survey on Machine Learning Driven Material Defect Detection: Challenges Solutions and Future Prospects

链接: https://arxiv.org/abs/2406.07880
作者: Jun Bai,Di Wu,Tristan Shelley,Peter Schubel,David Twine,John Russell,Xuesen Zeng,Ji Zhang
关键词: affecting product performance, challenge affecting product, primary challenge affecting, affecting product, product performance
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Material defects (MD) represent a primary challenge affecting product performance and giving rise to safety issues in related products. The rapid and accurate identification and localization of MD constitute crucial research endeavours in addressing contemporary challenges associated with MD. Although conventional non-destructive testing methods such as ultrasonic and X-ray approaches have mitigated issues related to low efficiency in manual inspections, they struggle to meet the diverse requirements of high precision, real-time speed, automation, and intelligence. In recent years, propelled by the swift advancement of machine learning (ML) technologies, particularly exemplified by deep learning, ML has swiftly emerged as the core technology and a prominent research direction for material defect detection (MDD). Through a comprehensive review of the latest literature, we systematically survey the ML techniques applied in MDD into five categories: unsupervised learning, supervised learning, semi-supervised learning, reinforcement learning, and generative learning. We provide a detailed analysis of the main principles and techniques used, together with the advantages and potential challenges associated with these techniques. Furthermore, the survey focuses on the techniques for defect detection in composite materials, which are important types of materials enjoying increasingly wide application in various industries such as aerospace, automotive, construction, and renewable energy. Finally, the survey explores potential future directions in MDD utilizing ML technologies. This comprehensive survey not only consolidates existing literature on ML-based MDD technologies but also serves as a foundational reference for future researchers and industrial practitioners, providing valuable insights and guidance in developing advanced and efficient MDD systems.

[CV-77] KernelWarehouse: Rethinking the Design of Dynamic Convolution

链接: https://arxiv.org/abs/2406.07879
作者: Chao Li,Anbang Yao
关键词: demonstrating superior performance, static kernels weighted, Dynamic convolution learns, Dynamic convolution, demonstrating superior
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work is accepted to ICML 2024. The project page: this https URL . arXiv admin note: substantial text overlap with arXiv:2308.08361

点击查看摘要

Abstract:Dynamic convolution learns a linear mixture of n static kernels weighted with their input-dependent attentions, demonstrating superior performance than normal convolution. However, it increases the number of convolutional parameters by n times, and thus is not parameter efficient. This leads to no research progress that can allow researchers to explore the setting n100 (an order of magnitude larger than the typical setting n10) for pushing forward the performance boundary of dynamic convolution while enjoying parameter efficiency. To fill this gap, in this paper, we propose KernelWarehouse, a more general form of dynamic convolution, which redefines the basic concepts of kernels", assembling kernels" and ``attention function" through the lens of exploiting convolutional parameter dependencies within the same layer and across neighboring layers of a ConvNet. We testify the effectiveness of KernelWarehouse on ImageNet and MS-COCO datasets using various ConvNet architectures. Intriguingly, KernelWarehouse is also applicable to Vision Transformers, and it can even reduce the model size of a backbone while improving the model accuracy. For instance, KernelWarehouse (n=4) achieves 5.61%|3.90%|4.38% absolute top-1 accuracy gain on the ResNet18|MobileNetV2|DeiT-Tiny backbone, and KernelWarehouse (n=1/4) with 65.10% model size reduction still achieves 2.29% gain on the ResNet18 backbone. The code and models are available at this https URL.

[CV-78] Small Scale Data-Free Knowledge Distillation

链接: https://arxiv.org/abs/2406.07876
作者: He Liu,Yikai Wang,Huaping Liu,Fuchun Sun,Anbang Yao
关键词: Data-free knowledge distillation, smaller student network, knowledge distillation, large teacher network, Data-free knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work is accepted to CVPR 2024. The project page: this https URL

点击查看摘要

Abstract:Data-free knowledge distillation is able to utilize the knowledge learned by a large teacher network to augment the training of a smaller student network without accessing the original training data, avoiding privacy, security, and proprietary risks in real applications. In this line of research, existing methods typically follow an inversion-and-distillation paradigm in which a generative adversarial network on-the-fly trained with the guidance of the pre-trained teacher network is used to synthesize a large-scale sample set for knowledge distillation. In this paper, we reexamine this common data-free knowledge distillation paradigm, showing that there is considerable room to improve the overall training efficiency through a lens of ``small-scale inverted data for knowledge distillation". In light of three empirical observations indicating the importance of how to balance class distributions in terms of synthetic sample diversity and difficulty during both data inversion and distillation processes, we propose Small Scale Data-free Knowledge Distillation SSD-KD. In formulation, SSD-KD introduces a modulating function to balance synthetic samples and a priority sampling function to select proper samples, facilitated by a dynamic replay buffer and a reinforcement learning strategy. As a result, SSD-KD can perform distillation training conditioned on an extremely small scale of synthetic samples (e.g., 10X less than the original training data scale), making the overall training efficiency one or two orders of magnitude faster than many mainstream methods while retaining superior or competitive model performance, as demonstrated on popular image classification and semantic segmentation benchmarks. The code is available at this https URL.

[CV-79] Robust 3D Face Alignment with Multi-Path Neural Architecture Search

链接: https://arxiv.org/abs/2406.07873
作者: Zhichao Jiang,Hongsong Wang,Xi Teng,Baopu Li
关键词: Neural Architecture Search, Multi-path One-shot Search, computer vision, Architecture Search, challenging and fundamental
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D face alignment is a very challenging and fundamental problem in computer vision. Existing deep learning-based methods manually design different networks to regress either parameters of a 3D face model or 3D positions of face vertices. However, designing such networks relies on expert knowledge, and these methods often struggle to produce consistent results across various face poses. To address this limitation, we employ Neural Architecture Search (NAS) to automatically discover the optimal architecture for 3D face alignment. We propose a novel Multi-path One-shot Neural Architecture Search (MONAS) framework that leverages multi-scale features and contextual information to enhance face alignment across various poses. The MONAS comprises two key algorithms: Multi-path Networks Unbiased Sampling Based Training and Simulated Annealing based Multi-path One-shot Search. Experimental results on three popular benchmarks demonstrate the superior performance of the MONAS for both sparse alignment and dense alignment.

[CV-80] Flexible Music-Conditioned Dance Generation with Style Description Prompts

链接: https://arxiv.org/abs/2406.07871
作者: Hongsong Wang,Yin Zhu,Xin Geng
关键词: Style Description Prompts, dance generation, Dance, Flexible Dance Generation, human culture
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Dance plays an important role as an artistic form and expression in human culture, yet the creation of dance remains a challenging task. Most dance generation methods primarily rely solely on music, seldom taking into consideration intrinsic attributes such as music style or genre. In this work, we introduce Flexible Dance Generation with Style Description Prompts (DGSDP), a diffusion-based framework suitable for diversified tasks of dance generation by fully leveraging the semantics of music style. The core component of this framework is Music-Conditioned Style-Aware Diffusion (MCSAD), which comprises a Transformer-based network and a music Style Modulation module. The MCSAD seemly integrates music conditions and style description prompts into the dance generation framework, ensuring that generated dances are consistent with the music content and style. To facilitate flexible dance generation and accommodate different tasks, a spatial-temporal masking strategy is effectively applied in the backward diffusion process. The proposed framework successfully generates realistic dance sequences that are accurately aligned with music for a variety of tasks such as long-term generation, dance in-betweening, dance inpainting, and etc. We hope that this work has the potential to inspire dance generation and creation, with promising applications in entertainment, art, and education.

[CV-81] Unveiling the Power of Wavelets: A Wavelet-based Kolmogorov-Arnold Network for Hyperspectral Image Classification

链接: https://arxiv.org/abs/2406.07869
作者: Seyd Teymoor Seydi
关键词: challenging task due, complex spatial-spectral correlations, spatial-spectral correlations inherent, Hyperspectral image classification, Wavelet-based Kolmogorov-Arnold Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hyperspectral image classification is a crucial but challenging task due to the high dimensionality and complex spatial-spectral correlations inherent in hyperspectral data. This paper employs Wavelet-based Kolmogorov-Arnold Network (wav-kan) architecture tailored for efficient modeling of these intricate dependencies. Inspired by the Kolmogorov-Arnold representation theorem, Wav-KAN incorporates wavelet functions as learnable activation functions, enabling non-linear mapping of the input spectral signatures. The wavelet-based activation allows Wav-KAN to effectively capture multi-scale spatial and spectral patterns through dilations and translations. Experimental evaluation on three benchmark hyperspectral datasets (Salinas, Pavia, Indian Pines) demonstrates the superior performance of Wav-KAN compared to traditional multilayer perceptrons (MLPs) and the recently proposed Spline-based KAN (Spline-KAN) model. In this work we are: (1) conducting more experiments on additional hyperspectral datasets (Pavia University, WHU-Hi, and Urban Hyperspectral Image) to further validate the generalizability of Wav-KAN; (2) developing a multiresolution Wav-KAN architecture to capture scale-invariant features; (3) analyzing the effect of dimensional reduction techniques on classification performance; (4) exploring optimization methods for tuning the hyperparameters of KAN models; and (5) comparing Wav-KAN with other state-of-the-art models in hyperspectral image classification.

[CV-82] Lets Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

链接: https://arxiv.org/abs/2406.07867
作者: Se Jin Park,Chae Won Kim,Hyeongseop Rha,Minsu Kim,Joanna Hong,Jeong Hun Yeo,Yong Man Ro
关键词: spoken dialogue, spoken dialogue model, dialogue, spoken, audio-visual spoken dialogue
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Accepted to ACL 2024

点击查看摘要

Abstract:In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. Demo and data are available at this https URL and this https URL, respectively.

[CV-83] FaithFill: Faithful Inpainting for Object Completion Using a Single Reference Image

链接: https://arxiv.org/abs/2406.07865
作者: Rupayan Mallick,Amr Abdalla,Sarah Adel Bargal
关键词: diffusion-based inpainting object, inpainting object completion, object completion approach, diffusion-based inpainting, completion approach
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present FaithFill, a diffusion-based inpainting object completion approach for realistic generation of missing object parts. Typically, multiple reference images are needed to achieve such realistic generation, otherwise the generation would not faithfully preserve shape, texture, color, and background. In this work, we propose a pipeline that utilizes only a single input reference image -having varying lighting, background, object pose, and/or viewpoint. The singular reference image is used to generate multiple views of the object to be inpainted. We demonstrate that FaithFill produces faithful generation of the object’s missing parts, together with background/scene preservation, from a single reference image. This is demonstrated through standard similarity metrics, human judgement, and GPT evaluation. Our results are presented on the DreamBooth dataset, and a novel proposed dataset.

[CV-84] Self-Distillation Learning Based on Temporal-Spatial Consistency for Spiking Neural Networks

链接: https://arxiv.org/abs/2406.07862
作者: Lin Zuo,Yongqi Ding,Mengmeng Jing,Kunshan Yang,Yunqian Yu
关键词: Spiking neural networks, high biological interpretability, attracted considerable attention, Spiking neural, low-power characteristics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Spiking neural networks (SNNs) have attracted considerable attention for their event-driven, low-power characteristics and high biological interpretability. Inspired by knowledge distillation (KD), recent research has improved the performance of the SNN model with a pre-trained teacher model. However, additional teacher models require significant computational resources, and it is tedious to manually define the appropriate teacher network architecture. In this paper, we explore cost-effective self-distillation learning of SNNs to circumvent these concerns. Without an explicit defined teacher, the SNN generates pseudo-labels and learns consistency during training. On the one hand, we extend the timestep of the SNN during training to create an implicit temporal teacher" that guides the learning of the original student", i.e., the temporal self-distillation. On the other hand, we guide the output of the weak classifier at the intermediate stage by the final output of the SNN, i.e., the spatial self-distillation. Our temporal-spatial self-distillation (TSSD) learning method does not introduce any inference overhead and has excellent generalization ability. Extensive experiments on the static image datasets CIFAR10/100 and ImageNet as well as the neuromorphic datasets CIFAR10-DVS and DVS-Gesture validate the superior performance of the TSSD method. This paper presents a novel manner of fusing SNNs with KD, providing insights into high-performance SNN learning methods.

[CV-85] DiffPop: Plausibility-Guided Object Placement Diffusion for Image Composition

链接: https://arxiv.org/abs/2406.07852
作者: Jiacheng Liu,Hang Zhou,Shida Wei,Rui Ma
关键词: realistic image composition, plausible object placement, object placement, address the problem, image composition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we address the problem of plausible object placement for the challenging task of realistic image composition. We propose DiffPop, the first framework that utilizes plausibility-guided denoising diffusion probabilistic model to learn the scale and spatial relations among multiple objects and the corresponding scene image. First, we train an unguided diffusion model to directly learn the object placement parameters in a self-supervised manner. Then, we develop a human-in-the-loop pipeline which exploits human labeling on the diffusion-generated composite images to provide the weak supervision for training a structural plausibility classifier. The classifier is further used to guide the diffusion sampling process towards generating the plausible object placement. Experimental results verify the superiority of our method for producing plausible and diverse composite images on the new Cityscapes-OP dataset and the public OPA dataset, as well as demonstrate its potential in applications such as data augmentation and multi-object placement tasks. Our dataset and code will be released.

[CV-86] A Labeled Array Distance Metric for Measuring Image Segmentation Quality

链接: https://arxiv.org/abs/2406.07851
作者: Maryam Berijanian,Katrina Gensterblum,Doruk Alp Mutlu,Katelyn Reagan,Andrew Hart,Dirk Colbry
关键词: segmentation, work introduces, comparing labeled arrays, segmentation algorithms, labeled
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to: Electronic Letters on Computer Vision and Image Analysis

点击查看摘要

Abstract:This work introduces two new distance metrics for comparing labeled arrays, which are common outputs of image segmentation algorithms. Each pixel in an image is assigned a label, with binary segmentation providing only two labels (‘foreground’ and ‘background’). These can be represented by a simple binary matrix and compared using pixel differences. However, many segmentation algorithms output multiple regions in a labeled array. We propose two distance metrics, named LAD and MADLAD, that calculate the distance between two labeled images. By doing so, the accuracy of different image segmentation algorithms can be evaluated by measuring their outputs against a ‘ground truth’ labeling. Both proposed metrics, operating with a complexity of O(N) for images with N pixels, are designed to quickly identify similar labeled arrays, even when different labeling methods are used. Comparisons are made between images labeled manually and those labeled by segmentation algorithms. This evaluation is crucial when searching through a space of segmentation algorithms and their hyperparameters via a genetic algorithm to identify the optimal solution for automated segmentation, which is the goal in our lab, SEE-Insight. By measuring the distance from the ground truth, these metrics help determine which algorithm provides the most accurate segmentation.

[CV-87] Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models

链接: https://arxiv.org/abs/2406.07844
作者: Arman Zarei,Keivan Rezaei,Samyadeep Basu,Mehrdad Saberi,Mazda Moayeri,Priyatham Kattakinda,Soheil Feizi
关键词: image generation benchmarks, challenging image generation, diffusion-based generative models, low FID scores, generate highly detailed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent text-to-image diffusion-based generative models have the stunning ability to generate highly detailed and photo-realistic images and achieve state-of-the-art low FID scores on challenging image generation benchmarks. However, one of the primary failure modes of these text-to-image generative models is in composing attributes, objects, and their associated relationships accurately into an image. In our paper, we investigate this compositionality-based failure mode and highlight that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes. In particular, we show that (i) there exists an optimal text-embedding space that can generate highly coherent compositional scenes which shows that the output space of the CLIP text-encoder is sub-optimal, and (ii) we observe that the final token embeddings in CLIP are erroneous as they often include attention contributions from unrelated tokens in compositional prompts. Our main finding shows that the best compositional improvements can be achieved (without harming the model’s FID scores) by fine-tuning \it only a simple linear projection on CLIP’s representation space in Stable-Diffusion variants using a small set of compositional image-text pairs. This result demonstrates that the sub-optimality of the CLIP’s output space is a major error source. We also show that re-weighting the erroneous attention contributions in CLIP can also lead to improved compositional performances, however these improvements are often less significant than those achieved by solely learning a linear projection head, highlighting erroneous attentions to be only a minor error source.

[CV-88] Incremental Learning and Self-Attention Mechanisms Improve Neural System Identification

链接: https://arxiv.org/abs/2406.07843
作者: Isaac Lin,Tianye Wang,Shang Gao,Shiming Tang,Tai Sing Lee
关键词: Convolutional neural networks, visual cortical neurons, Convolutional neural, primary visual cortex, cortical neurons
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注: Preprint NeurIPS 2024

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have been shown to be the state-of-the-art approach for modeling the transfer functions of visual cortical neurons. Cortical neurons in the primary visual cortex are are sensitive to contextual information mediated by extensive horizontal and feedback connections. Standard CNNs can integrate global spatial image information to model such contextual modulation via two mechanisms: successive rounds of convolutions and a fully connected readout layer. In this paper, we find that non-local networks or self-attention (SA) mechanisms, theoretically related to context-dependent flexible gating mechanisms observed in the primary visual cortex, improve neural response predictions over parameter-matched CNNs in two key metrics: tuning curve correlation and tuning peak. We factorize networks to determine the relative contribution of each context mechanism. This reveals that information in the local receptive field is most important for modeling the overall tuning curve, but surround information is critically necessary for characterizing the tuning peak. We find that self-attention can replace subsequent spatial-integration convolutions when learned in an incremental manner, and is further enhanced in the presence of a fully connected readout layer, suggesting that the two context mechanisms are complementary. Finally, we find that learning a receptive-field-centric model with self-attention, before incrementally learning a fully connected readout, yields a more biologically realistic model in terms of center-surround contributions.

[CV-89] Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model

链接: https://arxiv.org/abs/2406.07841
作者: Elaheh Baharlouei,Mahsa Shafaei,Yigeng Zhang,Hugo Jair Escalante,Thamar Solorio
关键词: detecting questionable content, comic mischief, comic mischief detection, specifically the subcategory, address the challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We address the challenge of detecting questionable content in online media, specifically the subcategory of comic mischief. This type of content combines elements such as violence, adult content, or sarcasm with humor, making it difficult to detect. Employing a multimodal approach is vital to capture the subtle details inherent in comic mischief content. To tackle this problem, we propose a novel end-to-end multimodal system for the task of comic mischief detection. As part of this contribution, we release a novel dataset for the targeted task consisting of three modalities: video, text (video captions and subtitles), and audio. We also design a HIerarchical Cross-attention model with CAPtions (HICCAP) to capture the intricate relationships among these modalities. The results show that the proposed approach makes a significant improvement over robust baselines and state-of-the-art models for comic mischief detection and its type classification. This emphasizes the potential of our system to empower users, to make informed decisions about the online content they choose to see. In addition, we conduct experiments on the UCF101, HMDB51, and XD-Violence datasets, comparing our model against other state-of-the-art approaches showcasing the outstanding performance of our proposed model in various scenarios.

[CV-90] SynthForge: Synthesizing High-Quality Face Dataset with Controllable 3D Generative Models

链接: https://arxiv.org/abs/2406.07840
作者: Abhay Rawat,Shubham Dokania,Astitva Srivastava,Shuaib Ahmed,Haiwen Feng,Rahul Tallamraju
关键词: Recent advancements, render photo-realistic data, unlocked the capabilities, capabilities to render, render photo-realistic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 4 figures, 3 tables. Under Review

点击查看摘要

Abstract:Recent advancements in generative models have unlocked the capabilities to render photo-realistic data in a controllable fashion. Trained on the real data, these generative models are capable of producing realistic samples with minimal to no domain gap, as compared to the traditional graphics rendering. However, using the data generated using such models for training downstream tasks remains under-explored, mainly due to the lack of 3D consistent annotations. Moreover, controllable generative models are learned from massive data and their latent space is often too vast to obtain meaningful sample distributions for downstream task with limited generation. To overcome these challenges, we extract 3D consistent annotations from an existing controllable generative model, making the data useful for downstream tasks. Our experiments show competitive performance against state-of-the-art models using only generated synthetic data, demonstrating potential for solving downstream tasks. Project page: this https URL

[CV-91] Sense Less Generate More: Pre-training LiDAR Perception with Masked Autoencoders for Ultra-Efficient 3D Sensing

链接: https://arxiv.org/abs/2406.07833
作者: Sina Tayebati,Theja Tulabandhula,Amit R. Trivedi
关键词: disruptively frugal LiDAR, frugal LiDAR perception, LiDAR perception dataflow, propose a disruptively, disruptively frugal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we propose a disruptively frugal LiDAR perception dataflow that generates rather than senses parts of the environment that are either predictable based on the extensive training of the environment or have limited consequence to the overall prediction accuracy. Therefore, the proposed methodology trades off sensing energy with training data for low-power robotics and autonomous navigation to operate frugally with sensors, extending their lifetime on a single battery charge. Our proposed generative pre-training strategy for this purpose, called as radially masked autoencoding (R-MAE), can also be readily implemented in a typical LiDAR system by selectively activating and controlling the laser power for randomly generated angular regions during on-field operations. Our extensive evaluations show that pre-training with R-MAE enables focusing on the radial segments of the data, thereby capturing spatial relationships and distances between objects more effectively than conventional procedures. Therefore, the proposed methodology not only reduces sensing energy but also improves prediction accuracy. For example, our extensive evaluations on Waymo, nuScenes, and KITTI datasets show that the approach achieves over a 5% average precision improvement in detection tasks across datasets and over a 4% accuracy improvement in transferring domains from Waymo and nuScenes to KITTI. In 3D object detection, it enhances small object detection by up to 4.37% in AP at moderate difficulty levels in the KITTI dataset. Even with 90% radial masking, it surpasses baseline models by up to 5.59% in mAP/mAPH across all object classes in the Waymo dataset. Additionally, our method achieves up to 3.17% and 2.31% improvements in mAP and NDS, respectively, on the nuScenes dataset, demonstrating its effectiveness with both single and fused LiDAR-camera modalities. this https URL.

[CV-92] Spatial Annealing Smoothing for Efficient Few-shot Neural Rendering

链接: https://arxiv.org/abs/2406.07828
作者: Yuru Xiao,Xianming Liu,Deming Zhai,Kui Jiang,Junjun Jiang,Xiangyang Ji
关键词: Neural Radiance Fields, Radiance Fields, shown impressive capabilities, delivering high efficiency, Neural Radiance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) with hybrid representations have shown impressive capabilities in reconstructing scenes for view synthesis, delivering high efficiency. Nonetheless, their performance significantly drops with sparse view inputs, due to the issue of overfitting. While various regularization strategies have been devised to address these challenges, they often depend on inefficient assumptions or are not compatible with hybrid models. There is a clear need for a method that maintains efficiency and improves resilience to sparse views within a hybrid framework. In this paper, we introduce an accurate and efficient few-shot neural rendering method named Spatial Annealing smoothing regularized NeRF (SANeRF), which is specifically designed for a pre-filtering-driven hybrid representation architecture. We implement an exponential reduction of the sample space size from an initially large value. This methodology is crucial for stabilizing the early stages of the training phase and significantly contributes to the enhancement of the subsequent process of detail refinement. Our extensive experiments reveal that, by adding merely one line of code, SANeRF delivers superior rendering quality and much faster reconstruction speed compared to current few-shot NeRF methods. Notably, SANeRF outperforms FreeNeRF by 0.3 dB in PSNR on the Blender dataset, while achieving 700x faster reconstruction speed.

[CV-93] Me Whats Next: Textual Foresight for Generic UI Representations

链接: https://arxiv.org/abs/2406.07822
作者: Andrea Burns,Kate Saenko,Bryan A. Plummer
关键词: automating user commands, user interfaces, app user interfaces, Textual Foresight, user commands
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted to ACL 2024 Findings. Data and code to be released at this https URL

点击查看摘要

Abstract:Mobile app user interfaces (UIs) are rich with action, text, structure, and image content that can be utilized to learn generic UI representations for tasks like automating user commands, summarizing content, and evaluating the accessibility of user interfaces. Prior work has learned strong visual representations with local or global captioning losses, but fails to retain both granularities. To combat this, we propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. Our approach requires joint reasoning over elements and entire screens, resulting in improved UI features: on generation tasks, UI agents trained with Textual Foresight outperform state-of-the-art by 2% with 28x fewer images. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning. OpenApp enables new baselines, and we find Textual Foresight improves average task performance over them by 5.7% while having access to 2x less data.

[CV-94] Are Objective Explanatory Evaluation metrics Trustworthy? An Adversarial Analysis

链接: https://arxiv.org/abs/2406.07820
作者: Prithwijit Chowdhury,Mohit Prabhushankar,Ghassan AlRegib,Mohamed Deriche
关键词: neural network models, network models, deep learning, learning by empowering, trust in neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainable AI (XAI) has revolutionized the field of deep learning by empowering users to have more trust in neural network models. The field of XAI allows users to probe the inner workings of these algorithms to elucidate their decision-making processes. The rise in popularity of XAI has led to the advent of different strategies to produce explanations, all of which only occasionally agree. Thus several objective evaluation metrics have been devised to decide which of these modules give the best explanation for specific scenarios. The goal of the paper is twofold: (i) we employ the notions of necessity and sufficiency from causal literature to come up with a novel explanatory technique called SHifted Adversaries using Pixel Elimination(SHAPE) which satisfies all the theoretical and mathematical criteria of being a valid explanation, (ii) we show that SHAPE is, infact, an adversarial explanation that fools causal metrics that are employed to measure the robustness and reliability of popular importance based visual XAI methods. Our analysis shows that SHAPE outperforms popular explanatory techniques like GradCAM and GradCAM++ in these tests and is comparable to RISE, raising questions about the sanity of these metrics and the need for human involvement for an overall better evaluation.

[CV-95] Hierarchical Patch Diffusion Models for High-Resolution Video Generation

链接: https://arxiv.org/abs/2406.07792
作者: Ivan Skorokhodov,Willi Menapace,Aliaksandr Siarohin,Sergey Tulyakov
关键词: demonstrated remarkable performance, demonstrated remarkable, remarkable performance, diffusion pipeline, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion – an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 256^2 , surpassing recent methods by more than 100%. Then, we show that it can be rapidly fine-tuned from a base 36\times 64 low-resolution generator for high-resolution 64 \times 288 \times 512 text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: this https URL.

[CV-96] From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers

链接: https://arxiv.org/abs/2406.07785
作者: Swaminathan Gurumurthy,Karnik Ram,Bingqing Chen,Zachary Manchester,Zico Kolter
关键词: pose estimation, correspondence estimation problem, squares optimization problem, weighted least squares, estimation problem
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at CVPR 2024

点击查看摘要

Abstract:Various pose estimation and tracking problems in robotics can be decomposed into a correspondence estimation problem (often computed using a deep network) followed by a weighted least squares optimization problem to solve for the poses. Recent work has shown that coupling the two problems by iteratively refining one conditioned on the other’s output yields SOTA results across domains. However, training these models has proved challenging, requiring a litany of tricks to stabilize and speed up training. In this work, we take the visual odometry problem as an example and identify three plausible causes: (1) flow loss interference, (2) linearization errors in the bundle adjustment (BA) layer, and (3) dependence of weight gradients on the BA residual. We show how these issues result in noisy and higher variance gradients, potentially leading to a slow down in training and instabilities. We then propose a simple, yet effective solution to reduce the gradient variance by using the weights predicted by the network in the inner optimization loop to weight the correspondence objective in the training problem. This helps the training objective `focus’ on the more important points, thereby reducing the variance and mitigating the influence of outliers. We show that the resulting method leads to faster training and can be more flexibly trained in varying training setups without sacrificing performance. In particular we show 2 – 2.5\times training speedups over a baseline visual odometry model we modify.

[CV-97] HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

链接: https://arxiv.org/abs/2406.07754
作者: Zihui Xue,Mi Luo,Changan Chen,Kristen Grauman
关键词: reference object image, user-provided reference object, precisely swapping objects, study the problem, problem of precisely
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project website: this https URL

点击查看摘要

Abstract:We study the problem of precisely swapping objects in videos, with a focus on those interacted with by hands, given one user-provided reference object image. Despite the great advancements that diffusion models have made in video editing recently, these models often fall short in handling the intricacies of hand-object interactions (HOI), failing to produce realistic edits – especially when object swapping results in object shape or functionality changes. To bridge this gap, we present HOI-Swap, a novel diffusion-based video editing framework trained in a self-supervised manner. Designed in two stages, the first stage focuses on object swapping in a single frame with HOI awareness; the model learns to adjust the interaction patterns, such as the hand grasp, based on changes in the object’s properties. The second stage extends the single-frame edit across the entire sequence; we achieve controllable motion alignment with the original video by: (1) warping a new sequence from the stage-I edited frame based on sampled motion points and (2) conditioning video generation on the warped sequence. Comprehensive qualitative and quantitative evaluations demonstrate that HOI-Swap significantly outperforms existing methods, delivering high-quality video edits with realistic HOIs.

[CV-98] C3DAG: Controlled 3D Animal Generation using 3D pose guidance

链接: https://arxiv.org/abs/2406.07742
作者: Sandeep Mishra,Oindrila Saha,Alan C. Bovik
关键词: Recent advancements, demonstrated the ability, high quality, generate high quality, Recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in text-to-3D generation have demonstrated the ability to generate high quality 3D assets. However while generating animals these methods underperform, often portraying inaccurate anatomy and geometry. Towards ameliorating this defect, we present C3DAG, a novel pose-Controlled text-to-3D Animal Generation framework which generates a high quality 3D animal consistent with a given pose. We also introduce an automatic 3D shape creator tool, that allows dynamic pose generation and modification via a web-based tool, and that generates a 3D balloon animal using simple geometries. A NeRF is then initialized using this 3D shape using depth-controlled SDS. In the next stage, the pre-trained NeRF is fine-tuned using quadruped-pose-controlled SDS. The pipeline that we have developed not only produces geometrically and anatomically consistent results, but also renders highly controlled 3D animals, unlike prior methods which do not allow fine-grained pose control.

[CV-99] Back to the Color: Learning Depth to Specific Color Transformation for Unsupervised Depth Estimation

链接: https://arxiv.org/abs/2406.07741
作者: Yufan Zhu,Chongzhi Ran,Mingtao Feng,Weisheng Dong,Antonio M. López,Guangming Shi
关键词: Virtual engines, generate dense depth, dense depth maps, depth estimation, depth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Virtual engines have the capability to generate dense depth maps for various synthetic scenes, making them invaluable for training depth estimation models. However, synthetic colors often exhibit significant discrepancies compared to real-world colors, thereby posing challenges for depth estimation in real-world scenes, particularly in complex and uncertain environments encountered in unsupervised monocular depth estimation tasks. To address this issue, we propose Back2Color, a framework that predicts realistic colors from depth utilizing a model trained on real-world data, thus facilitating the transformation of synthetic colors into real-world counterparts. Additionally, by employing the Syn-Real CutMix method for joint training with both real-world unsupervised and synthetic supervised depth samples, we achieve improved performance in monocular depth estimation for real-world scenes. Moreover, to comprehensively address the impact of non-rigid motions on depth estimation, we propose an auto-learning uncertainty temporal-spatial fusion method (Auto-UTSF), which integrates the benefits of unsupervised learning in both temporal and spatial dimensions. Furthermore, we design a depth estimation network (VADepth) based on the Vision Attention Network. Our Back2Color framework demonstrates state-of-the-art performance, as evidenced by improvements in performance metrics and the production of fine-grained details in our predictions, particularly on challenging datasets such as Cityscapes for unsupervised depth estimation.

[CV-100] On the Application of Egocentric Computer Vision to Industrial Scenarios

链接: https://arxiv.org/abs/2406.07738
作者: Vivek Chavan,Oliver Heimann,Jörg Krüger
关键词: Egocentric vision aims, first-person perspective, aims to capture, capture and analyse, analyse the world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To be presented at the First Joint Egocentric Vision (EgoVis) Workshop, held in conjunction with CVPR 2024

点击查看摘要

Abstract:Egocentric vision aims to capture and analyse the world from the first-person perspective. We explore the possibilities for egocentric wearable devices to improve and enhance industrial use cases w.r.t. data collection, annotation, labelling and downstream applications. This would contribute to easier data collection and allow users to provide additional context. We envision that this approach could serve as a supplement to the traditional industrial Machine Vision workflow. Code, Dataset and related resources will be available at: this https URL

[CV-101] Unleashing the Power of Transfer Learning Model for Sophisticated Insect Detection: Revolutionizing Insect Classification

链接: https://arxiv.org/abs/2406.07716
作者: Md. Mahmudul Hasan,SM Shaqib,Ms. Sharmin Akter,Rabiul Alam,Afraz Ul Haque,Shahrun akter khushbu
关键词: Plant Health, identify insect infestations, farming areas, optimal plant health, infestations in farming
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The purpose of the Insect Detection System for Crop and Plant Health is to keep an eye out for and identify insect infestations in farming areas. By utilizing cutting-edge technology like computer vision and machine learning, the system seeks to identify hazardous insects early and accurately. This would enable prompt response to save crops and maintain optimal plant health. The Method of this study includes Data Acquisition, Preprocessing, Data splitting, Model Implementation and Model evaluation. Different models like MobileNetV2, ResNet152V2, Xecption, Custom CNN was used in this study. In order to categorize insect photos, a Convolutional Neural Network (CNN) based on the ResNet152V2 architecture is constructed and evaluated in this work. Achieving 99% training accuracy and 97% testing accuracy, ResNet152V2 demonstrates superior performance among four implemented models. The results highlight its potential for real-world applications in insect classification and entomology studies, emphasizing efficiency and accuracy. To ensure food security and sustain agricultural output globally, finding insects is crucial. Cutting-edge technology, such as ResNet152V2 models, greatly influence automating and improving the accuracy of insect identification. Efficient insect detection not only minimizes crop losses but also enhances agricultural productivity, contributing to sustainable food production. This underscores the pivotal role of technology in addressing challenges related to global food security.

[CV-102] Vehicle Speed Detection System Utilizing YOLOv8: Enhancing Road Safety and Traffic Management for Metropolitan Areas

链接: https://arxiv.org/abs/2406.07710
作者: SM Shaqib,Alaya Parvin Alo,Shahriar Sultan Ramit,Afraz Ul Haque Rupak,Sadman Sadik Khan,Mr. Md. Sadekur Rahman
关键词: Bangladesh Passenger Welfare, vehicle speed detection, order to ensure, reduction in fatalities, Passenger Welfare Association
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In order to ensure traffic safety through a reduction in fatalities and accidents, vehicle speed detection is essential. Relentless driving practices are discouraged by the enforcement of speed restrictions, which are made possible by accurate monitoring of vehicle speeds. Road accidents remain one of the leading causes of death in Bangladesh. The Bangladesh Passenger Welfare Association stated in 2023 that 7,902 individuals lost their lives in traffic accidents during the course of the year. Efficient vehicle speed detection is essential to maintaining traffic safety. Reliable speed detection can also help gather important traffic data, which makes it easier to optimize traffic flow and provide safer road infrastructure. The YOLOv8 model can recognize and track cars in videos with greater speed and accuracy when trained under close supervision. By providing insights into the application of supervised learning in object identification for vehicle speed estimation and concentrating on the particular traffic conditions and safety concerns in Bangladesh, this work represents a noteworthy contribution to the area. The MAE was 3.5 and RMSE was 4.22 between the predicted speed of our model and the actual speed or the ground truth measured by the speedometer Promising increased efficiency and wider applicability in a variety of traffic conditions, the suggested solution offers a financially viable substitute for conventional approaches.

[CV-103] A Deep Learning Approach to Detect Complete Safety Equipment For Construction Workers Based On YOLOv7

链接: https://arxiv.org/abs/2406.07707
作者: Md. Shariful Islam,SM Shaqib,Shahriar Sultan Ramit,Shahrun Akter Khushbu,Mr. Abdus Sattar,Dr. Sheak Rashed Haider Noor
关键词: ensuring worker safety, safety equipment, safety, utmost significance, ensuring worker
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the construction sector, ensuring worker safety is of the utmost significance. In this study, a deep learning-based technique is presented for identifying safety gear worn by construction workers, such as helmets, goggles, jackets, gloves, and footwears. The recommended approach uses the YOLO v7 (You Only Look Once) object detection algorithm to precisely locate these safety items. The dataset utilized in this work consists of labeled images split into training, testing and validation sets. Each image has bounding box labels that indicate where the safety equipment is located within the image. The model is trained to identify and categorize the safety equipment based on the labeled dataset through an iterative training approach. We used custom dataset to train this model. Our trained model performed admirably well, with good precision, recall, and F1-score for safety equipment recognition. Also, the model’s evaluation produced encouraging results, with a mAP@0.5 score of 87.7%. The model performs effectively, making it possible to quickly identify safety equipment violations on building sites. A thorough evaluation of the outcomes reveals the model’s advantages and points up potential areas for development. By offering an automatic and trustworthy method for safety equipment detection, this research makes a contribution to the fields of computer vision and workplace safety. The proposed deep learning-based approach will increase safety compliance and reduce the risk of accidents in the construction industry

[CV-104] Object-level Scene Deocclusion

链接: https://arxiv.org/abs/2406.07706
作者: Zhengzhe Liu,Qing Liu,Chirui Chang,Jianming Zhang,Daniil Pakhomov,Haitian Zheng,Zhe Lin,Daniel Cohen-Or,Chi-Wing Fu
关键词: Deoccluding the hidden, formidable task, full-view feature map, hidden portions, addressing real-world scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: SIGGRAPH 2024. A foundation model for category-agnostic object deocclusion

点击查看摘要

Abstract:Deoccluding the hidden portions of objects in a scene is a formidable task, particularly when addressing real-world scenes. In this paper, we present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, a foundation model for object-level scene deocclusion. Leveraging the rich prior of pre-trained models, we first design the parallel variational autoencoder, which produces a full-view feature map that simultaneously encodes multiple complete objects, and the visible-to-complete latent generator, which learns to implicitly predict the full-view feature map from partial-view feature map and text prompts extracted from the incomplete objects in the input image. To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning, avoiding tedious annotations of the amodal masks and occluded regions. At inference, we devise a layer-wise deocclusion strategy to improve efficiency while maintaining the deocclusion quality. Extensive experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin. Our method can also be extended to cross-domain scenes and novel categories that are not covered by the training set. Further, we demonstrate the deocclusion applicability of PACO in single-view 3D scene reconstruction and object recomposition.

[CV-105] Graphical Perception of Saliency-based Model Explanations

链接: https://arxiv.org/abs/2406.07702
作者: Yayan Zhao,Mingwei Li,Matthew Berger
关键词: deep learning-based models, recent years, explaining predictive, deep learning-based, devoted to explaining
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, considerable work has been devoted to explaining predictive, deep learning-based models, and in turn how to evaluate explanations. An important class of evaluation methods are ones that are human-centered, which typically require the communication of explanations through visualizations. And while visualization plays a critical role in perceiving and understanding model explanations, how visualization design impacts human perception of explanations remains poorly understood. In this work, we study the graphical perception of model explanations, specifically, saliency-based explanations for visual recognition models. We propose an experimental design to investigate how human perception is influenced by visualization design, wherein we study the task of alignment assessment, or whether a saliency map aligns with an object in an image. Our findings show that factors related to visualization design decisions, the type of alignment, and qualities of the saliency map all play important roles in how humans perceive saliency-based visual explanations.

[CV-106] CUPID: Contextual Understanding of Prompt-conditioned Image Distributions

链接: https://arxiv.org/abs/2406.07699
作者: Yayan Zhao,Mingwei Li,Matthew Berger
关键词: CUPID, understanding of prompt-conditioned, present CUPID, prompt-conditioned image distributions, prompt-conditioned image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present CUPID: a visualization method for the contextual understanding of prompt-conditioned image distributions. CUPID targets the visual analysis of distributions produced by modern text-to-image generative models, wherein a user can specify a scene via natural language, and the model generates a set of images, each intended to satisfy the user’s description. CUPID is designed to help understand the resulting distribution, using contextual cues to facilitate analysis: objects mentioned in the prompt, novel, synthesized objects not explicitly mentioned, and their potential relationships. Central to CUPID is a novel method for visualizing high-dimensional distributions, wherein contextualized embeddings of objects, those found within images, are mapped to a low-dimensional space via density-based embeddings. We show how such embeddings allows one to discover salient styles of objects within a distribution, as well as identify anomalous, or rare, object styles. Moreover, we introduce conditional density embeddings, whereby conditioning on a given object allows one to compare object dependencies within the distribution. We employ CUPID for analyzing image distributions produced by large-scale diffusion models, where our experimental results offer insights on language misunderstanding from such models and biases in object composition, while also providing an interface for discovery of typical, or rare, synthesized scenes.

[CV-107] A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection

链接: https://arxiv.org/abs/2406.07694
作者: Can Akbas,Irem Su Arin,Sinan Onal
关键词: effective defect detection, Recent advancements, defect detection, Cylindrical Defect Detection, Defect Detection Dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: One figure and one table

点击查看摘要

Abstract:Recent advancements in quality control across various industries have increasingly utilized the integration of video cameras and image processing for effective defect detection. A critical barrier to progress is the scarcity of comprehensive datasets featuring annotated defects, which are essential for developing and refining automated defect detection models. This systematic review, spanning from 2015 to 2023, identifies 15 publicly available datasets and critically examines them to assess their effectiveness and applicability for benchmarking and model development. Our findings reveal a diverse landscape of datasets, such as NEU-CLS, NEU-DET, DAGM, KolektorSDD, PCB Defect Dataset, and the Hollow Cylindrical Defect Detection Dataset, each with unique strengths and limitations in terms of image quality, defect type representation, and real-world applicability. The goal of this systematic review is to consolidate these datasets in a single location, providing researchers who seek such publicly available resources with a comprehensive reference.

[CV-108] AI Radiologist: Revolutionizing Liver Tissue Segmentation with Convolutional Neural Networks and a Clinician-Friendly GUI

链接: https://arxiv.org/abs/2406.07688
作者: Ayman Al-Kababji,Faycal Bensaali,Sarada Prasad Dakua,Yassine Himeur
关键词: Artificial Intelligence, pervasive research topic, permeating various sectors, liver tissues, liver
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 38 pages, 19 figures, 7 tables submitted to journal

点击查看摘要

Abstract:Artificial Intelligence (AI) is a pervasive research topic, permeating various sectors and applications. In this study, we harness the power of AI, specifically convolutional neural networks (ConvNets), for segmenting liver tissues. It also focuses on developing a user-friendly graphical user interface (GUI) tool, “AI Radiologist”, enabling clinicians to effectively delineate different liver tissues (parenchyma, tumors, and vessels), thereby saving lives. This endeavor bridges the gap between academic research and practical, industrial applications. The GUI is a single-page application and is designed using the PyQt5 Python framework. The offline-available AI Radiologist resorts to three ConvNet models trained to segment all liver tissues. With respect to the Dice metric, the best liver ConvNet scores 98.16%, the best tumor ConvNet scores 65.95%, and the best vessel ConvNet scores 51.94%. It outputs 2D slices of the liver, tumors, and vessels, along with 3D interpolations in .obj and .mtl formats, which can be visualized/printed using any 3D-compatible software. Thus, the AI Radiologist offers a convenient tool for clinicians to perform liver tissue segmentation and 3D interpolation employing state-of-the-art models for tissues segmentation. With the provided capacity to select the volumes and pre-trained models, the clinicians can leave the rest to the AI Radiologist.

[CV-109] AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

链接: https://arxiv.org/abs/2406.07686
作者: Kai Wang,Shijian Deng,Jing Shi,Dimitrios Hatzinakos,Yapeng Tian
关键词: Recent Diffusion Transformers, shown impressive capabilities, Recent Diffusion, generating high-quality single-modality, high-quality single-modality content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-quality single-modality content, including images, videos, and audio. However, it is still under-explored whether the transformer-based diffuser can efficiently denoise the Gaussian noises towards superb multimodal content creation. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with both visual and audio tracks. To minimize model complexity and computational costs, AV-DiT utilizes a shared DiT backbone pre-trained on image-only data, with only lightweight, newly inserted adapters being trainable. This shared backbone facilitates both audio and video generation. Specifically, the video branch incorporates a trainable temporal attention layer into a frozen pre-trained DiT block for temporal consistency. Additionally, a small number of trainable parameters adapt the image-based DiT block for audio generation. An extra shared DiT block, equipped with lightweight parameters, facilitates feature interaction between audio and visual modalities, ensuring alignment. Extensive experiments on the AIST++ and Landscape datasets demonstrate that AV-DiT achieves state-of-the-art performance in joint audio-visual generation with significantly fewer tunable parameters. Furthermore, our results highlight that a single shared image generative backbone with modality-specific adaptations is sufficient for constructing a joint audio-video generator. Our source code and pre-trained models will be released.

[CV-110] Watching Swarm Dynamics from Above: A Framework for Advanced Object Tracking in Drone Videos

链接: https://arxiv.org/abs/2406.07680
作者: Duc Pham,Matthew Hansen,Félicie Dhellemmens,Jens Krause,Pia Bideau
关键词: Easily accessible sensors, greatly expanded studying, Easily accessible, expanded studying animal, diverse onboard sensors
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPRW: Workshop paper appearing in CV4Animals

点击查看摘要

Abstract:Easily accessible sensors, like drones with diverse onboard sensors, have greatly expanded studying animal behavior in natural environments. Yet, analyzing vast, unlabeled video data, often spanning hours, remains a challenge for machine learning, especially in computer vision. Existing approaches often analyze only a few frames. Our focus is on long-term animal behavior analysis. To address this challenge, we utilize classical probabilistic methods for state estimation, such as particle filtering. By incorporating recent advancements in semantic object segmentation, we enable continuous tracking of rapidly evolving object formations, even in scenarios with limited data availability. Particle filters offer a provably optimal algorithmic structure for recursively adding new incoming information. We propose a novel approach for tracking schools of fish in the open ocean from drone videos. Our framework not only performs classical object tracking in 2D, instead it tracks the position and spatial expansion of the fish school in world coordinates by fusing video data and the drone’s on board sensor information (GPS and IMU). The presented framework for the first time allows researchers to study collective behavior of fish schools in its natural social and environmental context in a non-invasive and scalable way.

[CV-111] Automated Pavement Cracks Detection and Classification Using Deep Learning

链接: https://arxiv.org/abs/2406.07674
作者: Selvia Nafaa,Hafsa Essam,Karim Ashour,Doaa Emad,Rana Mohamed,Mohammed Elhenawy,Huthaifa I. Ashqar,Abdallah A. Hassan,Taqwa I. Alhadidi
关键词: building efficient transportation, transportation asset management, Monitoring asset conditions, efficient transportation asset, Monitoring asset
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Monitoring asset conditions is a crucial factor in building efficient transportation asset management. Because of substantial advances in image processing, traditional manual classification has been largely replaced by semi-automatic/automatic techniques. As a result, automated asset detection and classification techniques are required. This paper proposes a methodology to detect and classify roadway pavement cracks using the well-known You Only Look Once (YOLO) version five (YOLOv5) and version 8 (YOLOv8) algorithms. Experimental results indicated that the precision of pavement crack detection reaches up to 67.3% under different illumination conditions and image sizes. The findings of this study can assist highway agencies in accurately detecting and classifying asset conditions under different illumination conditions. This will reduce the cost and time that are associated with manual inspection, which can greatly reduce the cost of highway asset maintenance.

[CV-112] PLT-D3: A High-fidelity Dynamic Driving Simulation Dataset for Stereo Depth and Scene Flow

链接: https://arxiv.org/abs/2406.07667
作者: Joshua Tokarsky,Ibrahim Abdulhafiz,Satya Ayyalasomayajula,Mostafa Mohsen,Navya G. Rao,Adam Forbes
关键词: experienced remarkable progress, deep learning methodologies, sophisticated deep learning, Autonomous driving, remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Autonomous driving has experienced remarkable progress, bolstered by innovations in computational hardware and sophisticated deep learning methodologies. The foundation of these advancements rests on the availability and quality of datasets, which are crucial for the development and refinement of dependable and versatile autonomous driving algorithms. While numerous datasets have been developed to support the evolution of autonomous driving perception technologies, few offer the diversity required to thoroughly test and enhance system robustness under varied weather conditions. Many public datasets lack the comprehensive coverage of challenging weather scenarios and detailed, high-resolution data, which are critical for training and validating advanced autonomous-driving perception models. In this paper, we introduce PLT-D3; a Dynamic-weather Driving Dataset, designed specifically to enhance autonomous driving systems’ adaptability to diverse weather conditions. PLT-D3 provides high-fidelity stereo depth and scene flow ground truth data generated using Unreal Engine 5. In particular, this dataset includes synchronized high-resolution stereo image sequences that replicate a wide array of dynamic weather scenarios including rain, snow, fog, and diverse lighting conditions, offering an unprecedented level of realism in simulation-based testing. The primary aim of PLT-D3 is to address the scarcity of comprehensive training and testing resources that can simulate real-world weather variations. Benchmarks have been established for several critical autonomous driving tasks using PLT-D3, such as depth estimation, optical flow and scene-flow to measure and enhance the performance of state-of-the-art models.

[CV-113] A Unified Framework for Integer Programming Formulation of Graph Matching Problems

链接: https://arxiv.org/abs/2406.07666
作者: Bahram Alidaee,Haibo Wang,Hugh Sloan
关键词: problems, Graph, formulation, Graph theory, powerful tool
类目: Data Structures and Algorithms (cs.DS); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
*备注: 34 pages

点击查看摘要

Abstract:Graph theory has been a powerful tool in solving difficult and complex problems arising in all disciplines. In particular, graph matching is a classical problem in pattern analysis with enormous applications. Many graph problems have been formulated as a mathematical program and then solved using exact, heuristic, and/or approximated-guaranteed procedures. On the other hand, graph theory has been a powerful tool in visualizing and understanding complex mathematical programming problems, especially integer programs. Formulating a graph problem as a natural integer program (IP) is often a challenging task. However, an IP formulation of the problem has many advantages. Several researchers have noted the need for natural IP formulation of graph theoretic problems. The present study aims to provide a unified framework for IP formulation of graph-matching problems. Although there are many surveys on graph matching problems, none is concerned with IP formulation. This paper is the first to provide a comprehensive IP formulation for such problems. The framework includes a variety of graph optimization problems in the literature. While these problems have been studied by different research communities, however, the framework presented here helps to bring efforts from different disciplines to tackle such diverse and complex problems. We hope the present study can significantly help to simplify some of the difficult problems arising in practice, especially in pattern analysis.

[CV-114] ROADWork Dataset: Learning to Recognize Observe Analyze and Drive Through Work Zones

链接: https://arxiv.org/abs/2406.07661
作者: Anurag Ghosh,Robert Tamburo,Shen Zheng,Juan R. Alvarez-Padilla,Hailiang Zhu,Michael Cardei,Nicholas Dunn,Christoph Mertz,Srinivasa G. Narasimhan
关键词: Perceiving and navigating, work zones, challenging and under-explored, self-driving research, major strides
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Perceiving and navigating through work zones is challenging and under-explored, even with major strides in self-driving research. An important reason is the lack of open datasets for developing new algorithms to address this long-tailed scenario. We propose the ROADWork dataset to learn how to recognize, observe and analyze and drive through work zones. We find that state-of-the-art foundation models perform poorly on work zones. With our dataset, we improve upon detecting work zone objects (+26.2 AP), while discovering work zones with higher precision (+32.5%) at a much higher discovery rate (12.8 times), significantly improve detecting (+23.9 AP) and reading (+14.2% 1-NED) work zone signs and describing work zones (+36.7 SPICE). We also compute drivable paths from work zone navigation videos and show that it is possible to predict navigational goals and pathways such that 53.6% goals have angular error (AE) 0.5 degrees (+9.9 %) and 75.3% pathways have AE 0.5 degrees (+8.1 %).

[CV-115] M-LRM: Multi-view Large Reconstruction Model

链接: https://arxiv.org/abs/2406.07648
作者: Mengfei Li,Xiaoxiao Long,Yixun Liang,Weiyu Li,Yuan Liu,Peng Li,Xiaowei Chi,Xingqun Qi,Wei Xue,Wenhan Luo,Qifeng Liu,Yike Guo
关键词: Large Reconstruction Model, demonstrating impressive results, Multi-view Large Reconstruction, Large Reconstruction, slower convergence speed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite recent advancements in the Large Reconstruction Model (LRM) demonstrating impressive results, when extending its input from single image to multiple images, it exhibits inefficiencies, subpar geometric and texture quality, as well as slower convergence speed than expected. It is attributed to that, LRM formulates 3D reconstruction as a naive images-to-3D translation problem, ignoring the strong 3D coherence among the input images. In this paper, we propose a Multi-view Large Reconstruction Model (M-LRM) designed to efficiently reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner. Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images. Moreover, we employ the 3D priors of the input multi-view images to initialize the tri-plane tokens. Compared to LRM, the proposed M-LRM can produce a tri-plane NeRF with 128 \times 128 resolution and generate 3D shapes of high fidelity. Experimental studies demonstrate that our model achieves a significant performance gain and faster training convergence than LRM. Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.07648 [cs.CV] (or arXiv:2406.07648v1 [cs.CV] for this version)

[CV-116] SSNVC: Single Stream Neural Video Compression with Implicit Temporal Information

链接: https://arxiv.org/abs/2406.07645
作者: Feng Wang,Haihang Ruan,Zhihuang Xie,Ronggang Wang,Xiangyu Yue
关键词: Neural Video Compression, lossy video codec, traditional lossy video, Neural Video, video codec
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted by DCC 2024 as Poster. This is the full paper

点击查看摘要

Abstract:Recently, Neural Video Compression (NVC) techniques have achieved remarkable performance, even surpassing the best traditional lossy video codec. However, most existing NVC methods heavily rely on transmitting Motion Vector (MV) to generate accurate contextual features, which has the following drawbacks. (1) Compressing and transmitting MV requires specialized MV encoder and decoder, which makes modules redundant. (2) Due to the existence of MV Encoder-Decoder, the training strategy is complex. In this paper, we present a noval Single Stream NVC framework (SSNVC), which removes complex MV Encoder-Decoder structure and uses a one-stage training strategy. SSNVC implicitly use temporal information by adding previous entropy model feature to current entropy model and using previous two frame to generate predicted motion information at the decoder side. Besides, we enhance the frame generator to generate higher quality reconstructed frame. Experiments demonstrate that SSNVC can achieve state-of-the-art performance on multiple benchmarks, and can greatly simplify compression process as well as training process.

[CV-117] BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models

链接: https://arxiv.org/abs/2406.07584
作者: Wanaiu Huang
关键词: enables non-invasive clinical, non-invasive clinical augmentative, activity enables non-invasive, brain activity enables, semantic information decoding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Semantic information is vital for human interaction, and decoding it from brain activity enables non-invasive clinical augmentative and alternative communication. While there has been significant progress in reconstructing visual images, few studies have focused on the language aspect. To address this gap, leveraging the powerful capabilities of the decoder-based vision-language pretrained model CoCa, this paper proposes BrainChat, a simple yet effective generative framework aimed at rapidly accomplishing semantic information decoding tasks from brain activity, including fMRI question answering and fMRI captioning. BrainChat employs the self-supervised approach of Masked Brain Modeling to encode sparse fMRI data, obtaining a more compact embedding representation in the latent space. Subsequently, BrainChat bridges the gap between modalities by applying contrastive loss, resulting in aligned representations of fMRI, image, and text embeddings. Furthermore, the fMRI embeddings are mapped to the generative Brain Decoder via cross-attention layers, where they guide the generation of textual content about fMRI in a regressive manner by minimizing caption loss. Empirically, BrainChat exceeds the performance of existing state-of-the-art methods in the fMRI captioning task and, for the first time, implements fMRI question answering. Additionally, BrainChat is highly flexible and can achieve high performance without image data, making it better suited for real-world scenarios with limited data.

[CV-118] A novel method for identifying rice seed purity based on hybrid machine learning algorithms

链接: https://arxiv.org/abs/2406.07581
作者: Phan Thi-Thu-Hong,Vo Quoc-Trinh,Nguyen Huu-Du
关键词: rice seed purity, grain industry, seed purity, crucial task, factor in evaluating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:In the grain industry, the identification of seed purity is a crucial task as it is an important factor in evaluating the quality of seeds. For rice seeds, this property allows for the reduction of unexpected influences of other varieties on rice yield, nutrient composition, and price. However, in practice, they are often mixed with seeds from others. This study proposes a novel method for automatically identifying the rice seed purity of a certain rice variety based on hybrid machine learning algorithms. The main idea is to use deep learning architectures for extracting important features from the raw data and then use machine learning algorithms for classification. Several experiments are conducted following a practical implementation to evaluate the performance of the proposed model. The obtained results show that the novel method improves significantly the performance of existing methods. Thus, it can be applied to design effective identification systems for rice seed purity.

[CV-119] Detection of Moving Objects in Earth Observation Satellite Images

链接: https://arxiv.org/abs/2406.07566
作者: Eric Keto,Wesley Andres Watters
关键词: push broom scanning, Earth observation satellites, Earth observation, made by Earth, multi-spectral images made
类目: Computer Vision and Pattern Recognition (cs.CV); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注:

点击查看摘要

Abstract:Moving objects have characteristic signatures in multi-spectral images made by Earth observation satellites that use push broom scanning. While the general concept is applicable to all satellites of this type, each satellite design has its own unique imaging system and requires unique methods to analyze the characteristic signatures. We assess the feasibility of detecting moving objects and measuring their velocities in one particular archive of satellite images made by Planet Labs Corporation with their constellation of SuperDove satellites. Planet Labs data presents a particular challenge in that the images in the archive are mosaics of individual exposures and therefore do not have unique time stamps. We explain how the timing information can be restored indirectly. Our results indicate that the movement of common transportation vehicles, airplanes, cars, and boats, can be detected and measured.

[CV-120] A Large Medical Model based on Visual Physiological Monitoring for Public Health

链接: https://arxiv.org/abs/2406.07558
作者: Bin Huang,Changchen Zhao,Zimeng Liu,Shenda Hong,Baochang Zhang,Wenjin Wang,Hui Liu
关键词: public health, pandemic has sounded, health, widespread outbreak, sounded a warning
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:The widespread outbreak of the COVID-19 pandemic has sounded a warning about the globalization challenges in public health. In this context, the establishment of large-scale public health datasets, of medical models, and of decision-making systems with a human-centric approach holds strategic significance. Recently, groundbreaking advancements have emerged in AI methods for physiological signal monitoring and disease diagnosis based on camera sensors. These approaches, requiring no specialized medical equipment, offer convenient manners of collecting large-scale medical data in response to public health events. Not only do these methods facilitate the acquisition of unbiased datasets, but also enable the development of fair large medical models (LMMs). Therefore, we outline a prospective framework and heuristic vision for a public health large medical model (PHLMM) utilizing visual-based physiological monitoring (VBPM) technology. The PHLMM can be considered as a “convenient and universal” framework for public health, advancing the United Nations’ “Sustainable Development Goals 2030”, particularly in its promotion of Universal Health Coverage (UHC) in low- and middle-income countries. Furthermore, this paper provides an outlook on the crucial application prospects of PHLMM in response to public health challenges and its significant role in the field of AI for medicine (AI4medicine). In summary, PHLMM serves as a solution for constructing a large-scale medical database and LMM, eliminating the issue of dataset bias and unfairness in AI models. The outcomes will contribute to the establishment of an LMM framework for public health, acting as a crucial bridge for advancing AI4medicine.

[CV-121] On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models

链接: https://arxiv.org/abs/2406.08486
作者: Hashmat Shadab Malik,Numan Saeed,Asif Hanif,Muzammal Naseer,Mohammad Yaqub,Salman Khan,Fahad Shahbaz Khan
关键词: achieved significant success, tumor-based segmentation tasks, recent years, achieved significant, significant success
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Volumetric medical segmentation models have achieved significant success on organ and tumor-based segmentation tasks in recent years. However, their vulnerability to adversarial attacks remains largely unexplored, raising serious concerns regarding the real-world deployment of tools employing such models in the healthcare sector. This underscores the importance of investigating the robustness of existing models. In this context, our work aims to empirically examine the adversarial robustness across current volumetric segmentation architectures, encompassing Convolutional, Transformer, and Mamba-based models. We extend this investigation across four volumetric segmentation datasets, evaluating robustness under both white box and black box adversarial attacks. Overall, we observe that while both pixel and frequency-based attacks perform reasonably well under white box setting, the latter performs significantly better under transfer-based black box attacks. Across our experiments, we observe transformer-based models show higher robustness than convolution-based models with Mamba-based models being the most vulnerable. Additionally, we show that large-scale training of volumetric segmentation models improves the model’s robustness against adversarial attacks. The code and pretrained models will be made available at this https URL.

[CV-122] From Chaos to Clarity: 3DGS in the Dark

链接: https://arxiv.org/abs/2406.08300
作者: Zhihao Li,Yufei Wang,Alex Kot,Bihan Wen
关键词: dynamic range RGB, superior high dynamic, range RGB images, high dynamic range, low dynamic range
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Novel view synthesis from raw images provides superior high dynamic range (HDR) information compared to reconstructions from low dynamic range RGB images. However, the inherent noise in unprocessed raw images compromises the accuracy of 3D scene representation. Our study reveals that 3D Gaussian Splatting (3DGS) is particularly susceptible to this noise, leading to numerous elongated Gaussian shapes that overfit the noise, thereby significantly degrading reconstruction quality and reducing inference speed, especially in scenarios with limited views. To address these issues, we introduce a novel self-supervised learning framework designed to reconstruct HDR 3DGS from a limited number of noisy raw images. This framework enhances 3DGS by integrating a noise extractor and employing a noise-robust reconstruction loss that leverages a noise distribution prior. Experimental results show that our method outperforms LDR/HDR 3DGS and previous state-of-the-art (SOTA) self-supervised and supervised pre-trained models in both reconstruction quality and inference speed on the RawNeRF dataset across a broad range of training views. Code can be found in \urlthis https URL.

[CV-123] Interpretable Representation Learning of Cardiac MRI via Attribute Regularization

链接: https://arxiv.org/abs/2406.08282
作者: Maxime Di Folco,Cosmin I. Bercea,Julia A. Schnabel
关键词: artificial intelligence models, trust artificial intelligence, intelligence models, Variational AutoEncoder, essential in medical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2312.08915

点击查看摘要

Abstract:Interpretability is essential in medical imaging to ensure that clinicians can comprehend and trust artificial intelligence models. Several approaches have been recently considered to encode attributes in the latent space to enhance its interpretability. Notably, attribute regularization aims to encode a set of attributes along the dimensions of a latent representation. However, this approach is based on Variational AutoEncoder and suffers from blurry reconstruction. In this paper, we propose an Attributed-regularized Soft Introspective Variational Autoencoder that combines attribute regularization of the latent space within the framework of an adversarially trained variational autoencoder. We demonstrate on short-axis cardiac Magnetic Resonance images of the UK Biobank the ability of the proposed method to address blurry reconstruction issues of variational autoencoder methods while preserving the latent space interpretability.

[CV-124] One-Step Effective Diffusion Network for Real-World Image Super-Resolution

链接: https://arxiv.org/abs/2406.08177
作者: Rongyuan Wu,Lingchen Sun,Zhiyuan Ma,Lei Zhang
关键词: real-world image super-resolution, generative image priors, powerful generative image, image, increasingly employed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The pre-trained text-to-image diffusion models have been increasingly employed to tackle the real-world image super-resolution (Real-ISR) problem due to their powerful generative image priors. Most of the existing methods start from random noise to reconstruct the high-quality (HQ) image under the guidance of the given low-quality (LQ) image. While promising results have been achieved, such Real- ISR methods require multiple diffusion steps to reproduce the HQ image, increasing the computational cost. Meanwhile, the random noise introduces uncertainty in the output, which is unfriendly to image restoration tasks. To address these issues, we propose a one-step effective diffusion network, namely OSEDiff, for the Real- ISR problem. We argue that the LQ image contains rich information to restore its HQ counterpart, and hence the given LQ image can be directly taken as the starting point for diffusion, eliminating the uncertainty introduced by random noise sampling. We finetune the pre-trained diffusion network with trainable layers to adapt it to complex image degradations. To ensure that the one-step diffusion model could yield HQ Real-ISR output, we apply variational score distillation in the latent space to conduct KL-divergence regularization. As a result, our OSEDiff model can efficiently and effectively generate HQ images in just one diffusion step. Our experiments demonstrate that OSEDiff achieves comparable or even better Real-ISR results, in terms of both objective metrics and subjective evaluations, than previous diffusion model based Real-ISR methods that require dozens or hundreds of steps. The source codes will be released at this https URL.

[CV-125] he impact of deep learning aid on the workload and interpretation accuracy of radiologists on chest computed tomography: a cross-over reader study

链接: https://arxiv.org/abs/2406.08137
作者: Anvar Kurmukov,Valeria Chernina,Regina Gareeva,Maria Dugova,Ekaterina Petrash,Olga Aleshina,Maxim Pisov,Boris Shirokikh,Valentin Samokhin,Vladislav Proskurov,Stanislav Shimovolos,Maria Basova,Mikhail Goncahrov,Eugenia Soboleva,Maria Donskova,Farukh Yaushev,Alexey Shevtsov,Alexey Zakharov,Talgat Saparov,Victor Gombolevskiy,Mikhail Belyaev
关键词: chest computed tomography, DLA, experimental arm, arm, experimental
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Interpretation of chest computed tomography (CT) is time-consuming. Previous studies have measured the time-saving effect of using a deep-learning-based aid (DLA) for CT interpretation. We evaluated the joint impact of a multi-pathology DLA on the time and accuracy of radiologists’ reading. 40 radiologists were randomly split into three experimental arms: control (10), who interpret studies without assistance; informed group (10), who were briefed about DLA pathologies, but performed readings without it; and the experimental group (20), who interpreted half studies with DLA, and half without. Every arm used the same 200 CT studies retrospectively collected from BIMCV-COVID19 dataset; each radiologist provided readings for 20 CT studies. We compared interpretation time, and accuracy of participants diagnostic report with respect to 12 pathological findings. Mean reading time per study was 15.6 minutes [SD 8.5] in the control arm, 13.2 minutes [SD 8.7] in the informed arm, 14.4 [SD 10.3] in the experimental arm without DLA, and 11.4 minutes [SD 7.8] in the experimental arm with DLA. Mean sensitivity and specificity were 41.5 [SD 30.4], 86.8 [SD 28.3] in the control arm; 53.5 [SD 22.7], 92.3 [SD 9.4] in the informed non-assisted arm; 63.2 [SD 16.4], 92.3 [SD 8.2] in the experimental arm without DLA; and 91.6 [SD 7.2], 89.9 [SD 6.0] in the experimental arm with DLA. DLA speed up interpretation time per study by 2.9 minutes (CI95 [1.7, 4.3], p0.0005), increased sensitivity by 28.4 (CI95 [23.4, 33.4], p0.0005), and decreased specificity by 2.4 (CI95 [0.6, 4.3], p=0.13). Of 20 radiologists in the experimental arm, 16 have improved reading time and sensitivity, two improved their time with a marginal drop in sensitivity, and two participants improved sensitivity with increased time. Overall, DLA introduction decreased reading time by 20.6%. Comments: 17 pages, 6 figures, 8 tables Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.08137 [eess.IV] (or arXiv:2406.08137v1 [eess.IV] for this version) Submission history From: Anvar Kurmukov [view email] [v1] Wed, 12 Jun 2024 12:26:26 UTC (8,321 KB)

[CV-126] 3D CBCT Challenge 2024: Improved Cone Beam CT Reconstruction using SwinIR-Based Sinogram and Image Enhancement

链接: https://arxiv.org/abs/2406.08048
作者: Sasidhar Alavala,Subrahmanyam Gorthi
关键词: ICASSP SP Grand, Beam Computed Tomography, Cone Beam Computed, part of ICASSP, Swin Image Restoration
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we present our approach to the 3D CBCT Challenge 2024, a part of ICASSP SP Grand Challenges 2024. Improvement in Cone Beam Computed Tomography (CBCT) reconstruction has been achieved by integrating Swin Image Restoration (SwinIR) based sinogram and image enhancement modules. The proposed methodology uses Nesterov Accelerated Gradient Descent (NAG) to solve the least squares (NAG-LS) problem in CT image reconstruction. The integration of sinogram and image enhancement modules aims to enhance image clarity and preserve fine details, offering a promising solution for both low dose and clinical dose CBCT reconstruction. The averaged mean squared error (MSE) over the validation dataset has decreased significantly, in the case of low dose by one-fifth and clinical dose by one-tenth. Our solution is one of the top 5 approaches in this challenge.

[CV-127] Spatial-Frequency Dual Progressive Attention Network For Medical Image Segmentation

链接: https://arxiv.org/abs/2406.07952
作者: Zhenhuan Zhou,Along He,Yanlin Wu,Rui Yao,Xueshuo Xie,Tao Li
关键词: manifest significant differences, medical image segmentation, medical image, types of lesions, lesions often manifest
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages

点击查看摘要

Abstract:In medical images, various types of lesions often manifest significant differences in their shape and texture. Accurate medical image segmentation demands deep learning models with robust capabilities in multi-scale and boundary feature learning. However, previous networks still have limitations in addressing the above issues. Firstly, previous networks simultaneously fuse multi-level features or employ deep supervision to enhance multi-scale learning. However, this may lead to feature redundancy and excessive computational overhead, which is not conducive to network training and clinical deployment. Secondly, the majority of medical image segmentation networks exclusively learn features in the spatial domain, disregarding the abundant global information in the frequency domain. This results in a bias towards low-frequency components, neglecting crucial high-frequency information. To address these problems, we introduce SF-UNet, a spatial-frequency dual-domain attention network. It comprises two main components: the Multi-scale Progressive Channel Attention (MPCA) block, which progressively extract multi-scale features across adjacent encoder layers, and the lightweight Frequency-Spatial Attention (FSA) block, with only 0.05M parameters, enabling concurrent learning of texture and boundary features from both spatial and frequency domains. We validate the effectiveness of the proposed SF-UNet on three public datasets. Experimental results show that compared to previous state-of-the-art (SOTA) medical image segmentation networks, SF-UNet achieves the best performance, and achieves up to 9.4% and 10.78% improvement in DSC and IOU. Codes will be released at this https URL.

[CV-128] Evaluating the Impact of Sequence Combinations on Breast Tumor Segmentation in Multiparametric MRI

链接: https://arxiv.org/abs/2406.07813
作者: Hang Min,Gorane Santamaria Hormaechea,Prabhakar Ramachandran,Jason Dowling
关键词: Multiparametric magnetic resonance, Multiparametric magnetic, magnetic resonance imaging, magnetic resonance, key tool
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multiparametric magnetic resonance imaging (mpMRI) is a key tool for assessing breast cancer progression. Although deep learning has been applied to automate tumor segmentation in breast MRI, the effect of sequence combinations in mpMRI remains under-investigated. This study explores the impact of different combinations of T2-weighted (T2w), dynamic contrast-enhanced MRI (DCE-MRI) and diffusion-weighted imaging (DWI) with apparent diffusion coefficient (ADC) map on breast tumor segmentation using nnU-Net. Evaluated on a multicenter mpMRI dataset, the nnU-Net model using DCE sequences achieved a Dice similarity coefficient (DSC) of 0.69 \pm 0.18 for functional tumor volume (FTV) segmentation. For whole tumor mask (WTM) segmentation, adding the predicted FTV to DWI and ADC map improved the DSC from 0.57 \pm 0.24 to 0.60 \pm 0.21. Adding T2w did not yield significant improvement, which still requires further investigation under a more standardized imaging protocol. This study serves as a foundation for future work on predicting breast cancer treatment response using mpMRI.

[CV-129] Gene-Level Representation Learning via Interventional Style Transfer in Optical Pooled Screening

链接: https://arxiv.org/abs/2406.07763
作者: Mahtab Bigverdi,Burkhard Hockendorf,Heming Yao,Phil Hanslovsky,Romain Lopez,David Richmond
关键词: Optical pooled screening, combines automated microscopy, Optical pooled, pooled screening, combines automated
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 5 figures, CVPR workshop paper

点击查看摘要

Abstract:Optical pooled screening (OPS) combines automated microscopy and genetic perturbations to systematically study gene function in a scalable and cost-effective way. Leveraging the resulting data requires extracting biologically informative representations of cellular perturbation phenotypes from images. We employ a style-transfer approach to learn gene-level feature representations from images of genetically perturbed cells obtained via OPS. Our method outperforms widely used engineered features in clustering gene representations according to gene function, demonstrating its utility for uncovering latent biological relationships. This approach offers a promising alternative to investigate the role of genes in health and disease.

[CV-130] Progress Towards Decoding Visual Imagery via fNIRS

链接: https://arxiv.org/abs/2406.07662
作者: Michel Adamic(1),Wellington Avelino(1),Anna Brandenberger(2),Bryan Chiang(3),Hunter Davis,Stephen Fay(1),Andrew Gregory,Aayush Gupta,Raphael Hotter,Grace Jiang,Fiona Leng,Stephen Polcyn,Thomas Ribeiro(1),Paul Scotti(4),Michelle Wang(1),Marley Xiong,Jonathan Xu(5) ((1) McGill University, (2) Massachusetts Institute of Technology, (3) Stanford University, (4) Princeton University, (5) University of Waterloo)
关键词: fNIRS brain activity, required specs, demonstrate the possibility, possibility of reconstructing, brain activity
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:We demonstrate the possibility of reconstructing images from fNIRS brain activity and start building a prototype to match the required specs. By training an image reconstruction model on downsampled fMRI data, we discovered that cm-scale spatial resolution is sufficient for image generation. We obtained 71% retrieval accuracy with 1-cm resolution, compared to 93% on the full-resolution fMRI, and 20% with 2-cm resolution. With simulations and high-density tomography, we found that time-domain fNIRS can achieve 1-cm resolution, compared to 2-cm resolution for continuous-wave fNIRS. Lastly, we share designs for a prototype time-domain fNIRS device, consisting of a laser driver, a single photon detector, and a time-to-digital converter system.

机器学习

[LG-0] ICE-G: Image Conditional Editing of 3D Gaussian Splats

链接: https://arxiv.org/abs/2406.08488
作者: Vishnu Jaganathan,Hannah Hanyun Huang,Muhammad Zubair Irshad,Varun Jampani,Amit Raj,Zsolt Kira
关键词: create high quality, emerged to create, create high, Recently, Recently many techniques
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to CVPR AI4CC Workshop 2024. Project page: this https URL

点击查看摘要

Abstract:Recently many techniques have emerged to create high quality 3D assets and scenes. When it comes to editing of these objects, however, existing approaches are either slow, compromise on quality, or do not provide enough customization. We introduce a novel approach to quickly edit a 3D model from a single reference view. Our technique first segments the edit image, and then matches semantically corresponding regions across chosen segmented dataset views using DINO features. A color or texture change from a particular region of the edit image can then be applied to other views automatically in a semantically sensible manner. These edited views act as an updated dataset to further train and re-style the 3D scene. The end-result is therefore an edited 3D model. Our framework enables a wide variety of editing tasks such as manual local edits, correspondence based style transfer from any example image, and a combination of different styles from multiple example images. We use Gaussian Splats as our primary 3D representation due to their speed and ease of local editing, but our technique works for other methods such as NeRFs as well. We show through multiple examples that our method produces higher quality results while offering fine-grained control of editing. Project page: this http URL

[LG-1] Real2Code: Reconstruct Articulated Objects via Code Generation

链接: https://arxiv.org/abs/2406.08474
作者: Zhao Mandi,Yijia Weng,Dominik Bauer,Shuran Song
关键词: code generation, reconstructing articulated objects, real world objects, reconstructing articulated, objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Real2Code, a novel approach to reconstructing articulated objects via code generation. Given visual observations of an object, we first reconstruct its part geometry using an image segmentation model and a shape completion model. We then represent the object parts with oriented bounding boxes, which are input to a fine-tuned large language model (LLM) to predict joint articulation as code. By leveraging pre-trained vision and language models, our approach scales elegantly with the number of articulated parts, and generalizes from synthetic training data to real world objects in unstructured environments. Experimental results demonstrate that Real2Code significantly outperforms previous state-of-the-art in reconstruction accuracy, and is the first approach to extrapolate beyond objects’ structural complexity in the training set, and reconstructs objects with up to 10 articulated parts. When incorporated with a stereo reconstruction model, Real2Code also generalizes to real world objects from a handful of multi-view RGB images, without the need for depth or camera information.

[LG-2] Strategies for Pretraining Neural Operators

链接: https://arxiv.org/abs/2406.08473
作者: Anthony Zhou,Cooper Lorsung,AmirPouya Hemmasian,Amir Barati Farimani
关键词: partial differential equation, recently shown promise, neural operators, Pretraining, scaling neural operators
类目: Machine Learning (cs.LG)
*备注: 25 pages, 5 figures

点击查看摘要

Abstract:Pretraining for partial differential equation (PDE) modeling has recently shown promise in scaling neural operators across datasets to improve generalizability and performance. Despite these advances, our understanding of how pretraining affects neural operators is still limited; studies generally propose tailored architectures and datasets that make it challenging to compare or examine different pretraining frameworks. To address this, we compare various pretraining methods without optimizing architecture choices to characterize pretraining dynamics on different models and datasets as well as to understand its scaling and generalization behavior. We find that pretraining is highly dependent on model and dataset choices, but in general transfer learning or physics-based pretraining strategies work best. In addition, pretraining performance can be further improved by using data augmentations. Lastly, pretraining is additionally beneficial when fine-tuning in scarce data regimes or when generalizing to downstream data similar to the pretraining distribution. Through providing insights into pretraining neural operators for physics prediction, we hope to motivate future work in developing and evaluating pretraining methods for PDEs.

[LG-3] RILe: Reinforced Imitation Learning

链接: https://arxiv.org/abs/2406.08472
作者: Mert Albaba,Sammy Christen,Christoph Gebhardt,Thomas Langarek,Michael J. Black,Otmar Hilliges
关键词: Inverse Reinforcement Learning, achieved significant success, generating complex behavior, Reinforcement Learning, Reinforcement Learning offer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning has achieved significant success in generating complex behavior but often requires extensive reward function engineering. Adversarial variants of Imitation Learning and Inverse Reinforcement Learning offer an alternative by learning policies from expert demonstrations via a discriminator. Employing discriminators increases their data- and computational efficiency over the standard approaches; however, results in sensitivity to imperfections in expert data. We propose RILe, a teacher-student system that achieves both robustness to imperfect data and efficiency. In RILe, the student learns an action policy while the teacher dynamically adjusts a reward function based on the student’s performance and its alignment with expert demonstrations. By tailoring the reward function to both performance of the student and expert similarity, our system reduces dependence on the discriminator and, hence, increases robustness against data imperfections. Experiments show that RILe outperforms existing methods by 2x in settings with limited or noisy expert data.

[LG-4] PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

链接: https://arxiv.org/abs/2406.08469
作者: Daiwei Chen,Yi Chen,Aniket Rege,Ramya Korlakai Vinayak
关键词: raw web-scale data, Large foundation models, pretrained on raw, raw web-scale, readily deployable
类目: Machine Learning (cs.LG)
*备注: 22 pages, 14 figures, 5 tables

点击查看摘要

Abstract:Large foundation models pretrained on raw web-scale data are not readily deployable without additional step of extensive alignment to human preferences. Such alignment is typically done by collecting large amounts of pairwise comparisons from humans (“Do you prefer output A or B?”) and learning a reward model or a policy with the Bradley-Terry-Luce (BTL) model as a proxy for a human’s underlying implicit preferences. These methods generally suffer from assuming a universal preference shared by all humans, which lacks the flexibility of adapting to plurality of opinions and preferences. In this work, we propose PAL, a framework to model human preference complementary to existing pretraining strategies, which incorporates plurality from the ground up. We propose using the ideal point model as a lens to view alignment using preference comparisons. Together with our novel reformulation and using mixture modeling, our framework captures the plurality of population preferences while simultaneously learning a common preference latent space across different preferences, which can few-shot generalize to new, unseen users. Our approach enables us to use the penultimate-layer representation of large foundation models and simple MLP layers to learn reward functions that are on-par with the existing large state-of-the-art reward models, thereby enhancing efficiency of reward modeling significantly. We show that PAL achieves competitive reward model accuracy compared to strong baselines on 1) Language models with Summary dataset ; 2) Image Generative models with Pick-a-Pic dataset ; 3) A new semisynthetic heterogeneous dataset generated using Anthropic Personas. Finally, our experiments also highlight the shortcoming of current preference datasets that are created using rigid rubrics which wash away heterogeneity, and call for more nuanced data collection approaches.

[LG-5] DafnyBench: A Benchmark for Formal Software Verification

链接: https://arxiv.org/abs/2406.08467
作者: Chloe Loughridge,Qinyi Sun,Seth Ahrenbach,Federico Cassano,Chuyue Sun,Ying Sheng,Anish Mudide,Md Rakib Hossain Misu,Nada Amin,Max Tegmark
关键词: evaluating machine learning, machine learning systems, formal software verification, Dafny formal verification, largest benchmark
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Code dataset available at: this https URL

点击查看摘要

Abstract:We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.

[LG-6] Scaling Laws in Linear Regression: Compute Parameters and Data

链接: https://arxiv.org/abs/2406.08466
作者: Licong Lin,Jingfeng Wu,Sham M. Kakade,Peter L. Bartlett,Jason D. Lee
关键词: large-scale deep learning, neural scaling laws, deep learning models, model improves polynomially, model size
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with M parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using N data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree a1 , we show that the reducible part of the test error is \Theta(M^-(a-1) + N^-(a-1)/a) . The variance error, which increases with M , is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2406.08466 [cs.LG] (or arXiv:2406.08466v1 [cs.LG] for this version)

[LG-7] Nonconvex Federated Learning on Compact Smooth Submanifolds With Heterogeneous Data

链接: https://arxiv.org/abs/2406.08465
作者: Jiaojiao Zhang,Jiang Hu,Anthony Man-Cho So,Mikael Johansson
关键词: low-rank matrix completion, machine learning tasks, principal component analysis, manifold optimization problems, matrix completion
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Many machine learning tasks, such as principal component analysis and low-rank matrix completion, give rise to manifold optimization problems. Although there is a large body of work studying the design and analysis of algorithms for manifold optimization in the centralized setting, there are currently very few works addressing the federated setting. In this paper, we consider nonconvex federated learning over a compact smooth submanifold in the setting of heterogeneous client data. We propose an algorithm that leverages stochastic Riemannian gradients and a manifold projection operator to improve computational efficiency, uses local updates to improve communication efficiency, and avoids client drift. Theoretically, we show that our proposed algorithm converges sub-linearly to a neighborhood of a first-order optimal solution by using a novel analysis that jointly exploits the manifold structure and properties of the loss functions. Numerical experiments demonstrate that our algorithm has significantly smaller computational and communication overhead than existing methods.

[LG-8] he Impact of Initialization on LoRA Finetuning Dynamics

链接: https://arxiv.org/abs/2406.08447
作者: Soufiane Hayou,Nikhil Ghosh,Bin Yu
关键词: Low Rank Adaptation, Rank Adaptation, Low Rank, study the role, originally introduced
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: TDLR: Different Initializations lead to completely different finetuning dynamics. One initialization (set A random and B zero) is generally better than the natural opposite initialization. arXiv admin note: text overlap with arXiv:2402.12354

点击查看摘要

Abstract:In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.

[LG-9] ransformation-Dependent Adversarial Attacks

链接: https://arxiv.org/abs/2406.08443
作者: Yaoteng Tan,Zikui Cai,M. Salman Asif
关键词: single additive perturbation, trigger diverse, single additive, mis-predictions by systematically, systematically transforming
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce transformation-dependent adversarial attacks, a new class of threats where a single additive perturbation can trigger diverse, controllable mis-predictions by systematically transforming the input (e.g., scaling, blurring, compression). Unlike traditional attacks with static effects, our perturbations embed metamorphic properties to enable different adversarial attacks as a function of the transformation parameters. We demonstrate the transformation-dependent vulnerability across models (e.g., convolutional networks and vision transformers) and vision tasks (e.g., image classification and object detection). Our proposed geometric and photometric transformations enable a range of targeted errors from one crafted input (e.g., higher than 90% attack success rate for classifiers). We analyze effects of model architecture and type/variety of transformations on attack effectiveness. This work forces a paradigm shift by redefining adversarial inputs as dynamic, controllable threats. We highlight the need for robust defenses against such multifaceted, chameleon-like perturbations that current techniques are ill-prepared for.

[LG-10] Adaptive Swarm Mesh Refinement using Deep Reinforcement Learning with Local Rewards

链接: https://arxiv.org/abs/2406.08440
作者: Niklas Freymuth,Philipp Dahlinger,Tobias Würth,Simon Reisch,Luise Kärger,Gerhard Neumann
关键词: Simulating physical systems, Simulating physical, essential in engineering, analytical solutions, solutions are limited
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Submitted to Journal of Machine Learning Research (JMLR)

点击查看摘要

Abstract:Simulating physical systems is essential in engineering, but analytical solutions are limited to straightforward problems. Consequently, numerical methods like the Finite Element Method (FEM) are widely used. However, the FEM becomes computationally expensive as problem complexity and accuracy demands increase. Adaptive Mesh Refinement (AMR) improves the FEM by dynamically allocating mesh elements on the domain, balancing computational speed and accuracy. Classical AMR depends on heuristics or expensive error estimators, limiting its use in complex simulations. While learning-based AMR methods are promising, they currently only scale to simple problems. In this work, we formulate AMR as a system of collaborating, homogeneous agents that iteratively split into multiple new agents. This agent-wise perspective enables a spatial reward formulation focused on reducing the maximum mesh element error. Our approach, Adaptive Swarm Mesh Refinement (ASMR), offers efficient, stable optimization and generates highly adaptive meshes at user-defined resolution during inference. Extensive experiments, including volumetric meshes and Neumann boundary conditions, demonstrate that ASMR exceeds heuristic approaches and learned baselines, matching the performance of expensive error-based oracle AMR strategies. ASMR additionally generalizes to different domains during inference, and produces meshes that simulate up to 2 orders of magnitude faster than uniform refinements in more demanding settings.

[LG-11] Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2406.08431
作者: Benjamin Biggs,Arjun Seshadri,Yang Zou,Achin Jain,Aditya Golatkar,Yusheng Xie,Alessandro Achille,Ashwin Swaminathan,Stefano Soatto
关键词: present Diffusion Soup, Diffusion Soup, compartmentalization method, Diffusion Soup samples, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Diffusion Soup, a compartmentalization method for Text-to-Image Generation that averages the weights of diffusion models trained on sharded data. By construction, our approach enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging. We show that Diffusion Soup samples from a point in weight space that approximates the geometric mean of the distributions of constituent datasets, which offers anti-memorization guarantees and enables zero-shot style mixing. Empirically, Diffusion Soup outperforms a paragon model trained on the union of all data shards and achieves a 30% improvement in Image Reward (.34 \to .44) on domain sharded data, and a 59% improvement in IR (.37 \to .59) on aesthetic data. In both cases, souping also prevails in TIFA score (respectively, 85.5 \to 86.5 and 85.6 \to 86.8). We demonstrate robust unlearning – removing any individual domain shard only lowers performance by 1% in IR (.45 \to .44) – and validate our theoretical insights on anti-memorization using real data. Finally, we showcase Diffusion Soup’s ability to blend the distinct styles of models finetuned on different shards, resulting in the zero-shot generation of hybrid styles.

[LG-12] Improving Noise Robustness through Abstractions and its Impact on Machine Learning

链接: https://arxiv.org/abs/2406.08428
作者: Alfredo Ibias(1),Karol Capala(1),Varun Ravi Varma(1),Anna Drozdz(1),Jose Sousa(1) ((1) Personal Health Data Science, Sano - Centre for Computational Personalised Medicine)
关键词: Machine Learning, application of Machine, learning theory, world data tendency, real world data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Noise is a fundamental problem in learning theory with huge effects in the application of Machine Learning (ML) methods, due to real world data tendency to be noisy. Additionally, introduction of malicious noise can make ML methods fail critically, as is the case with adversarial attacks. Thus, finding and developing alternatives to improve robustness to noise is a fundamental problem in ML. In this paper, we propose a method to deal with noise: mitigating its effect through the use of data abstractions. The goal is to reduce the effect of noise over the model’s performance through the loss of information produced by the abstraction. However, this information loss comes with a cost: it can result in an accuracy reduction due to the missing information. First, we explored multiple methodologies to create abstractions, using the training dataset, for the specific case of numerical data and binary classification tasks. We also tested how these abstractions can affect robustness to noise with several experiments that explore the robustness of an Artificial Neural Network to noise when trained using raw data \emphvs when trained using abstracted data. The results clearly show that using abstractions is a viable approach for developing noise robust ML methods.

[LG-13] State Soup: In-Context Skill Learning Retrieval and Mixing

链接: https://arxiv.org/abs/2406.08423
作者: Maciej Pióro,Maciej Wołczyk,Razvan Pascanu,Johannes von Oswald,João Sacramento
关键词: sequence modeling problems, gated-linear recurrent neural, recurrent neural networks, networks has reached, modeling problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A new breed of gated-linear recurrent neural networks has reached state-of-the-art performance on a range of sequence modeling problems. Such models naturally handle long sequences efficiently, as the cost of processing a new input is independent of sequence length. Here, we explore another advantage of these stateful sequence models, inspired by the success of model merging through parameter interpolation. Building on parallels between fine-tuning and in-context learning, we investigate whether we can treat internal states as task vectors that can be stored, retrieved, and then linearly combined, exploiting the linearity of recurrence. We study this form of fast model merging on Mamba-2.8b, a pretrained recurrent model, and present preliminary evidence that simple linear state interpolation methods suffice to improve next-token perplexity as well as downstream in-context learning task performance.

[LG-14] Discovering Preference Optimization Algorithms with and for Large Language Models

链接: https://arxiv.org/abs/2406.08414
作者: Chris Lu,Samuel Holt,Claudio Fanconi,Alex J. Chan,Jakob Foerster,Mihaela van der Schaar,Robert Tjarko Lange
关键词: Large Language Model, Language Model, Large Language, preference optimization, Offline preference optimization
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Typically, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. While these methods are based on theoretical insights, they are inherently constrained by human creativity, so the large search space of possible loss functions remains under explored. We address this by performing LLM-driven objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Specifically, we iteratively prompt an LLM to propose and implement new preference optimization loss functions based on previously-evaluated performance metrics. This process leads to the discovery of previously-unknown and performant preference optimization algorithms. The best performing of these we call Discovered Preference Optimization (DiscoPOP), a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate the state-of-the-art performance of DiscoPOP and its successful transfer to held-out tasks.

[LG-15] Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference

链接: https://arxiv.org/abs/2406.08413
作者: Christopher Wolters,Xiaoxuan Yang,Ulf Schlichtmann,Toyotaro Suzumura
关键词: Large language models, transformed natural language, recently transformed natural, generate human-like text, natural language processing
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations. This development necessitates speed, efficiency, and accessibility in LLM inference as the computational and memory requirements of these systems grow exponentially. Meanwhile, advancements in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore’s law. With LLMs exceeding the capacity of single GPUs, they require complex, expert-level configurations for parallel processing. Memory accesses become significantly more expensive than computation, posing a challenge for efficient scaling, known as the memory wall. Here, compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory, potentially reducing latency and power consumption. By closely integrating memory and compute elements, CIM eliminates the von Neumann bottleneck, reducing data movement and improving energy efficiency. This survey paper provides an overview and analysis of transformer-based models, reviewing various CIM architectures and exploring how they can address the imminent challenges of modern AI computing systems. We discuss transformer-related operators and their hardware acceleration schemes and highlight challenges, trends, and insights in corresponding CIM designs.

[LG-16] RRLS : Robust Reinforcement Learning Suite

链接: https://arxiv.org/abs/2406.08406
作者: Adil Zouitine,David Bertoin,Pierre Clavier,Matthieu Geist,Emmanuel Rachelson
关键词: optimal worst-case performance, Robust reinforcement learning, provide optimal worst-case, Robust reinforcement, reinforcement learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust reinforcement learning is the problem of learning control policies that provide optimal worst-case performance against a span of adversarial environments. It is a crucial ingredient for deploying algorithms in real-world scenarios with prevalent environmental uncertainties and has been a long-standing object of attention in the community, without a standardized set of benchmarks. This contribution endeavors to fill this gap. We introduce the Robust Reinforcement Learning Suite (RRLS), a benchmark suite based on Mujoco environments. RRLS provides six continuous control tasks with two types of uncertainty sets for training and evaluation. Our benchmark aims to standardize robust reinforcement learning tasks, facilitating reproducible and comparable experiments, in particular those from recent state-of-the-art contributions, for which we demonstrate the use of RRLS. It is also designed to be easily expandable to new environments. The source code is available at \hrefthis https URLthis https URL.

[LG-17] Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

链接: https://arxiv.org/abs/2406.08404
作者: Yuhui Wang,Qingyuan Wu,Weida Li,Dylan R. Ashley,Francesco Faccio,Chao Huang,Jürgen Schmidhuber
关键词: Iteration Network, performs value iteration, latent MDP, differentiable architecture, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Value Iteration Network (VIN) is an end-to-end differentiable architecture that performs value iteration on a latent MDP for planning in reinforcement learning (RL). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a 100\times 100 maze – a task which typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module’s depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introducing an “adaptive highway loss” that constructs skip connections to improve gradient flow. We evaluate our method on both 2D maze navigation environments and the ViZDoom 3D navigation benchmark. We find that our new method, named Dynamic Transition VIN (DT-VIN), easily scales to 5000 layers and casually solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in RL environments.

[LG-18] cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers

链接: https://arxiv.org/abs/2406.08398
作者: Anirudh Sundar,Jin Xu,William Gay,Christopher Richardson,Larry Heck
关键词: multimodal interactive conversations, interactive conversations, emerging area, situated and multimodal, multimodal interactive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 1 figure

点击查看摘要

Abstract:An emerging area of research in situated and multimodal interactive conversations (SIMMC) includes interactions in scientific papers. Since scientific papers are primarily composed of text, equations, figures, and tables, SIMMC methods must be developed specifically for each component to support the depth of inquiry and interactions required by research scientists. This work introduces Conversational Papers (cPAPERS), a dataset of conversational question-answer pairs from reviews of academic papers grounded in these paper components and their associated references from scientific documents available on arXiv. We present a data collection strategy to collect these question-answer pairs from OpenReview and associate them with contextual information from LaTeX source files. Additionally, we present a series of baseline approaches utilizing Large Language Models (LLMs) in both zero-shot and fine-tuned configurations to address the cPAPERS dataset.

[LG-19] me-Constrained Robust MDPs

链接: https://arxiv.org/abs/2406.08395
作者: Adil Zouitine,David Bertoin,Pierre Clavier,Matthieu Geist,Emmanuel Rachelson
关键词: deploying reinforcement learning, Robust reinforcement learning, reinforcement learning, environmental uncertainty predominates, uncertainty predominates
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust reinforcement learning is essential for deploying reinforcement learning algorithms in real-world scenarios where environmental uncertainty predominates. Traditional robust reinforcement learning often depends on rectangularity assumptions, where adverse probability measures of outcome states are assumed to be independent across different states and actions. This assumption, rarely fulfilled in practice, leads to overly conservative policies. To address this problem, we introduce a new time-constrained robust MDP (TC-RMDP) formulation that considers multifactorial, correlated, and time-dependent disturbances, thus more accurately reflecting real-world dynamics. This formulation goes beyond the conventional rectangularity paradigm, offering new perspectives and expanding the analytical framework for robust RL. We propose three distinct algorithms, each using varying levels of environmental information, and evaluate them extensively on continuous control benchmarks. Our results demonstrate that these algorithms yield an efficient tradeoff between performance and robustness, outperforming traditional deep robust RL methods in time-constrained environments while preserving robustness in classical benchmarks. This study revisits the prevailing assumptions in robust RL and opens new avenues for developing more practical and realistic RL applications.

[LG-20] Large Language Models Must Be Taught to Know What They Dont Know

链接: https://arxiv.org/abs/2406.08391
作者: Sanyam Kapoor,Nate Gruver,Manley Roberts,Katherine Collins,Arka Pal,Umang Bhatt,Adrian Weller,Samuel Dooley,Micah Goldblum,Andrew Gordon Wilson
关键词: high-stakes applications, trust their predictions, large language models, argue that prompting, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: Code available at: this https URL

点击查看摘要

Abstract:When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibration and then show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We show that a thousand graded examples are sufficient to outperform baseline methods and that training through the features of a model is necessary for good performance and tractable for large open-source models when using LoRA. We also investigate the mechanisms that enable reliable LLM uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators, applicable not just to their own uncertainties but also the uncertainty of other models. Lastly, we show that uncertainty estimates inform human use of LLMs in human-AI collaborative settings through a user study.

[LG-21] Deep Learning Based Joint Multi-User MISO Power Allocation and Beamforming Design

链接: https://arxiv.org/abs/2406.08373
作者: Cemil Vahapoglu,Timothy J. O’Shea,Tamoghna Roy,Sennur Ulukus
关键词: wireless resource management, higher data rates, provide higher data, wireless communication networks, resource management solutions
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The evolution of fifth generation (5G) wireless communication networks has led to an increased need for wireless resource management solutions that provide higher data rates, wide coverage, low latency, and power efficiency. Yet, many of existing traditional approaches remain non-practical due to computational limitations, and unrealistic presumptions of static network conditions and algorithm initialization dependencies. This creates an important gap between theoretical analysis and real-time processing of algorithms. To bridge this gap, deep learning based techniques offer promising solutions with their representational capabilities for universal function approximation. We propose a novel unsupervised deep learning based joint power allocation and beamforming design for multi-user multiple-input single-output (MU-MISO) system. The objective is to enhance the spectral efficiency by maximizing the sum-rate with the proposed joint design framework, NNBF-P while also offering computationally efficient solution in contrast to conventional approaches. We conduct experiments for diverse settings to compare the performance of NNBF-P with zero-forcing beamforming (ZFBF), minimum mean square error (MMSE) beamforming, and NNBF, which is also our deep learning based beamforming design without joint power allocation scheme. Experiment results demonstrate the superiority of NNBF-P compared to ZFBF, and MMSE while NNBF can have lower performances than MMSE and ZFBF in some experiment settings. It can also demonstrate the effectiveness of joint design framework with respect to NNBF.

[LG-22] DocSynthv2: A Practical Autoregressive Modeling for Document Generation

链接: https://arxiv.org/abs/2406.08354
作者: Sanket Biswas,Rajiv Jain,Vlad I. Morariu,Jiuxiang Gu,Puneet Mathur,Curtis Wigington,Tong Sun,Josep Lladós
关键词: extensively explored, comprehensive document generation, document generation encompassing, comprehensive document, complex challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Spotlight (Oral) Acceptance to CVPR 2024 Workshop for Graphic Design Understanding and Generation (GDUG)

点击查看摘要

Abstract:While the generation of document layouts has been extensively explored, comprehensive document generation encompassing both layout and content presents a more complex challenge. This paper delves into this advanced domain, proposing a novel approach called DocSynthv2 through the development of a simple yet effective autoregressive structured model. Our model, distinct in its integration of both layout and textual cues, marks a step beyond existing layout-generation approaches. By focusing on the relationship between the structural elements and the textual content within documents, we aim to generate cohesive and contextually relevant documents without any reliance on visual components. Through experimental studies on our curated benchmark for the new task, we demonstrate the ability of our model combining layout and textual information in enhancing the generation quality and relevance of documents, opening new pathways for research in document creation and automated design. Our findings emphasize the effectiveness of autoregressive models in handling complex document generation tasks.

[LG-23] A Survey of Pipeline Tools for Data Engineering

链接: https://arxiv.org/abs/2406.08335
作者: Anthony Mbata,Yaji Sripada,Mingjun Zhong
关键词: data engineering, data, tools, data engineering tasks, pipeline tools
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:Currently, a variety of pipeline tools are available for use in data engineering. Data scientists can use these tools to resolve data wrangling issues associated with data and accomplish some data engineering tasks from data ingestion through data preparation to utilization as input for machine learning (ML). Some of these tools have essential built-in components or can be combined with other tools to perform desired data engineering operations. While some tools are wholly or partly commercial, several open-source tools are available to perform expert-level data engineering tasks. This survey examines the broad categories and examples of pipeline tools based on their design and data engineering intentions. These categories are Extract Transform Load/Extract Load Transform (ETL/ELT), pipelines for Data Integration, Ingestion, and Transformation, Data Pipeline Orchestration and Workflow Management, and Machine Learning Pipelines. The survey also provides a broad outline of the utilization with examples within these broad groups and finally, a discussion is presented with case studies indicating the usage of pipeline tools for data engineering. The studies present some first-user application experiences with sample data, some complexities of the applied pipeline, and a summary note of approaches to using these tools to prepare data for machine learning.

[LG-24] ProTrain: Efficient LLM Training via Memory-Aware Techniques

链接: https://arxiv.org/abs/2406.08334
作者: Hanmei Yang,Jin Zhou,Yao Fu,Xiaoqun Wang,Ramine Roane,Hui Guan,Tongping Liu
关键词: Large Language Models, train Large Language, Large Language, Language Models, train Large
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:It is extremely memory-hungry to train Large Language Models (LLM). To solve this problem, existing work exploits the combination of CPU and GPU for the training process, such as ZeRO-Offload. Such a technique largely democratizes billion-scale model training, making it possible to train with few consumer graphics cards. However, based on our observation, existing frameworks often provide coarse-grained memory management and require experienced experts in configuration tuning, leading to suboptimal hardware utilization and performance. This paper proposes ProTrain, a novel training system that intelligently balances memory usage and performance by coordinating memory, computation, and IO. ProTrain achieves adaptive memory management through Chunk-Based Model State Management and Block-Wise Activation Management, guided by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43 \times to 2.71 \times compared to the SOTA training systems.

[LG-25] Genetic Column Generation for Computing Lower Bounds for Adversarial Classification

链接: https://arxiv.org/abs/2406.08331
作者: Maximilian Penka
关键词: Recent theoretical results, Recent theoretical, formulation of Wasserstein-barycenter, multi-class classification showed, theoretical results
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent theoretical results on adversarial multi-class classification showed a similarity to the multi-marginal formulation of Wasserstein-barycenter in optimal transport. Unfortunately, both problems suffer from the curse of dimension, making it hard to exploit the nice linear program structure of the problems for numerical calculations. We investigate how ideas from Genetic Column Generation for multi-marginal optimal transport can be used to overcome the curse of dimension in computing the minimal adversarial risk in multi-class classification.

[LG-26] Its all about PR – Smart Benchmarking AI Accelerators using Performance Representatives

链接: https://arxiv.org/abs/2406.08330
作者: Alexander Louis-Ferdinand Jung,Jannik Steinmetz,Jonathan Gietz,Konstantin Lübeck,Oliver Bringmann
关键词: statistical performance models, statistical performance, hardware accelerators, COTS, training samples
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted version for: SAMOS’24

点击查看摘要

Abstract:Statistical models are widely used to estimate the performance of commercial off-the-shelf (COTS) AI hardware accelerators. However, training of statistical performance models often requires vast amounts of data, leading to a significant time investment and can be difficult in case of limited hardware availability. To alleviate this problem, we propose a novel performance modeling methodology that significantly reduces the number of training samples while maintaining good accuracy. Our approach leverages knowledge of the target hardware architecture and initial parameter sweeps to identify a set of Performance Representatives (PR) for deep neural network (DNN) layers. These PRs are then used for benchmarking, building a statistical performance model, and making estimations. This targeted approach drastically reduces the number of training samples needed, opposed to random sampling, to achieve a better estimation accuracy. We achieve a Mean Absolute Percentage Error (MAPE) of as low as 0.02% for single-layer estimations and 0.68% for whole DNN estimations with less than 10000 training samples. The results demonstrate the superiority of our method for single-layer estimations compared to models trained with randomly sampled datasets of the same size.

[LG-27] Is Programming by Example solved by LLMs?

链接: https://arxiv.org/abs/2406.08316
作者: Wen-Ding Li,Kevin Ellis
关键词: aims to generate, generate an algorithm, algorithm from input-output, PBE, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Programming-by-Examples (PBE) aims to generate an algorithm from input-output examples. Such systems are practically and theoretically important: from an end-user perspective, they are deployed to millions of people, and from an AI perspective, PBE corresponds to a very general form of few-shot inductive inference. Given the success of Large Language Models (LLMs) in code-generation tasks, we investigate here the extent to which LLMs can be said to have `solved’ PBE. We experiment on classic domains such as lists and strings, and an uncommon graphics programming domain not well represented in typical pretraining data. We find that pretrained models are not effective at PBE, but that they can be fine-tuned for much higher performance, provided the test problems are in-distribution. We analyze empirically what causes these models to succeed and fail, and take steps toward understanding how to achieve better out-of-distribution generalization. Collectively these results suggest that LLMs make strong progress toward solving the typical suite of PBE tasks, potentially increasing the flexibility and applicability of PBE systems, while also identifying ways in which LLMs still fall short.

[LG-28] Improving Policy Optimization via varepsilon-Retrain

链接: https://arxiv.org/abs/2406.08315
作者: Luca Marzari,Changliu Liu,Priya L. Donti,Enrico Marchesini
关键词: exploration strategy designed, monotonic improvement guarantees, exploration strategy, strategy designed, designed to encourage
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present \varepsilon -retrain, an exploration strategy designed to encourage a behavioral preference while optimizing policies with monotonic improvement guarantees. To this end, we introduce an iterative procedure for collecting retrain areas – parts of the state space where an agent did not follow the behavioral preference. Our method then switches between the typical uniform restart state distribution and the retrain areas using a decaying factor \varepsilon , allowing agents to retrain on situations where they violated the preference. Experiments over hundreds of seeds across locomotion, navigation, and power network tasks show that our method yields agents that exhibit significant performance and sample efficiency improvements. Moreover, we employ formal verification of neural networks to provably quantify the degree to which agents adhere to behavioral preferences.

[LG-29] Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark Framework

链接: https://arxiv.org/abs/2406.08311
作者: Ruibo Tu,Zineb Senane,Lele Cao,Cheng Zhang,Hedvig Kjellström,Gustav Eje Henter
关键词: Tabular synthesis models, Tabular synthesis, models remain ineffective, capturing complex dependencies, automated decision-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular synthesis models remain ineffective at capturing complex dependencies, and the quality of synthetic data is still insufficient for comprehensive downstream tasks, such as prediction under distribution shifts, automated decision-making, and cross-table understanding. A major challenge is the lack of prior knowledge about underlying structures and high-order relationships in tabular data. We argue that a systematic evaluation on high-order structural information for tabular data synthesis is the first step towards solving the problem. In this paper, we introduce high-order structural causal information as natural prior knowledge and provide a benchmark framework for the evaluation of tabular synthesis models. The framework allows us to generate benchmark datasets with a flexible range of data generation processes and to train tabular synthesis models using these datasets for further evaluation. We propose multiple benchmark tasks, high-order metrics, and causal inference tasks as downstream tasks for evaluating the quality of synthetic data generated by the trained models. Our experiments demonstrate to leverage the benchmark framework for evaluating the model capability of capturing high-order structural causal information. Furthermore, our benchmarking results provide an initial assessment of state-of-the-art tabular synthesis models. They have clearly revealed significant gaps between ideal and actual performance and how baseline methods differ. Our benchmark framework is available at URL this https URL.

[LG-30] GraphFM: A Comprehensive Benchmark for Graph Foundation Model

链接: https://arxiv.org/abs/2406.08310
作者: Yuhao Xu,Xinqi Liu,Keyu Duan,Yi Fang,Yu-Neng Chuang,Daochen Zha,Qiaoyu Tan
关键词: artificial intelligence systems, offering broad potential, Graph Foundation Models, Foundation Models, intelligence systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation Models (FMs) serve as a general class for the development of artificial intelligence systems, offering broad potential for generalization across a spectrum of downstream tasks. Despite extensive research into self-supervised learning as the cornerstone of FMs, several outstanding issues persist in Graph Foundation Models that rely on graph self-supervised learning, namely: 1) Homogenization. The extent of generalization capability on downstream tasks remains unclear. 2) Scalability. It is unknown how effectively these models can scale to large datasets. 3) Efficiency. The training time and memory usage of these models require evaluation. 4) Training Stop Criteria. Determining the optimal stopping strategy for pre-training across multiple tasks to maximize performance on downstream tasks. To address these questions, we have constructed a rigorous benchmark that thoroughly analyzes and studies the generalization and scalability of self-supervised Graph Neural Network (GNN) models. Regarding generalization, we have implemented and compared the performance of various self-supervised GNN models, trained to generate node representations, across tasks such as node classification, link prediction, and node clustering. For scalability, we have compared the performance of various models after training using full-batch and mini-batch strategies. Additionally, we have assessed the training efficiency of these models by conducting experiments to test their GPU memory usage and throughput. Through these experiments, we aim to provide insights to motivate future research. The code for this benchmark is publicly available at this https URL.

[LG-31] Vessel Re-identification and Activity Detection in Thermal Domain for Maritime Surveillance

链接: https://arxiv.org/abs/2406.08294
作者: Yasod Ginige,Ransika Gunasekara,Darsha Hewavitharana,Manjula Ariyarathne,Ranga Rodrigo,Peshala Jayasekara
关键词: mitigate illegal activities, Maritime surveillance, illegal fishing, drug smuggling, human trafficking
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Maritime surveillance is vital to mitigate illegal activities such as drug smuggling, illegal fishing, and human trafficking. Vision-based maritime surveillance is challenging mainly due to visibility issues at night, which results in failures in re-identifying vessels and detecting suspicious activities. In this paper, we introduce a thermal, vision-based approach for maritime surveillance with object tracking, vessel re-identification, and suspicious activity detection capabilities. For vessel re-identification, we propose a novel viewpoint-independent algorithm which compares features of the sides of the vessel separately (separate side-spaces) leveraging shape information in the absence of color features. We propose techniques to adapt tracking and activity detection algorithms for the thermal domain and train them using a thermal dataset we created. This dataset will be the first publicly available benchmark dataset for thermal maritime surveillance. Our system is capable of re-identifying vessels with an 81.8% Top1 score and identifying suspicious activities with a 72.4% frame mAP score; a new benchmark for each task in the thermal domain.

[LG-32] Decoupling the Class Label and the Target Concept in Machine Unlearning

链接: https://arxiv.org/abs/2406.08288
作者: Jianing Zhu,Bo Han,Jiangchao Yao,Jianliang Xu,Gang Niu,Masashi Sugiyama
关键词: emerging research topic, Machine unlearning, aims to adjust, emerging research, research topic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning as an emerging research topic for data regulations, aims to adjust a trained model to approximate a retrained one that excludes a portion of training data. Previous studies showed that class-wise unlearning is successful in forgetting the knowledge of a target class, through gradient ascent on the forgetting data or fine-tuning with the remaining data. However, while these methods are useful, they are insufficient as the class label and the target concept are often considered to coincide. In this work, we decouple them by considering the label domain mismatch and investigate three problems beyond the conventional all matched forgetting, e.g., target mismatch, model mismatch, and data mismatch forgetting. We systematically analyze the new challenges in restrictively forgetting the target concept and also reveal crucial forgetting dynamics in the representation level to realize these tasks. Based on that, we propose a general framework, namely, TARget-aware Forgetting (TARF). It enables the additional tasks to actively forget the target concept while maintaining the rest part, by simultaneously conducting annealed gradient ascent on the forgetting data and selected gradient descent on the hard-to-affect remaining data. Empirically, various experiments under the newly introduced settings are conducted to demonstrate the effectiveness of our TARF.

[LG-33] Pre-Training Identification of Graph Winning Tickets in Adaptive Spatial-Temporal Graph Neural Networks

链接: https://arxiv.org/abs/2406.08287
作者: Wenying Duan,Tianxiang Fang,Hong Rao,Xiaoxi He
关键词: Lottery Ticket Hypothesis, Graph Winning Ticket, Ticket Hypothesis, Winning Ticket, Lottery Ticket
类目: Machine Learning (cs.LG)
*备注: Conference paper, accepted by KDD’ 24

点击查看摘要

Abstract:In this paper, we present a novel method to significantly enhance the computational efficiency of Adaptive Spatial-Temporal Graph Neural Networks (ASTGNNs) by introducing the concept of the Graph Winning Ticket (GWT), derived from the Lottery Ticket Hypothesis (LTH). By adopting a pre-determined star topology as a GWT prior to training, we balance edge reduction with efficient information propagation, reducing computational demands while maintaining high model performance. Both the time and memory computational complexity of generating adaptive spatial-temporal graphs is significantly reduced from \mathcalO(N^2) to \mathcalO(N) . Our approach streamlines the ASTGNN deployment by eliminating the need for exhaustive training, pruning, and retraining cycles, and demonstrates empirically across various datasets that it is possible to achieve comparable performance to full models with substantially lower computational costs. Specifically, our approach enables training ASTGNNs on the largest scale spatial-temporal dataset using a single A6000 equipped with 48 GB of memory, overcoming the out-of-memory issue encountered during original training and even achieving state-of-the-art performance. Furthermore, we delve into the effectiveness of the GWT from the perspective of spectral graph theory, providing substantial theoretical support. This advancement not only proves the existence of efficient sub-networks within ASTGNNs but also broadens the applicability of the LTH in resource-constrained settings, marking a significant step forward in the field of graph neural networks. Code is available at https://anonymous.4open.science/r/paper-1430.

[LG-34] Conformal Load Prediction with Transductive Graph Autoencoders

链接: https://arxiv.org/abs/2406.08281
作者: Rui Luo,Nicolo Colombo
关键词: Predicting edge weights, Graph Neural Network, Predicting edge, social networks, Neural Network
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Predicting edge weights on graphs has various applications, from transportation systems to social networks. This paper describes a Graph Neural Network (GNN) approach for edge weight prediction with guaranteed coverage. We leverage conformal prediction to calibrate the GNN outputs and produce valid prediction intervals. We handle data heteroscedasticity through error reweighting and Conformalized Quantile Regression (CQR). We compare the performance of our method against baseline techniques on real-world transportation datasets. Our approach has better coverage and efficiency than all baselines and showcases robustness and adaptability.

[LG-35] he Importance of Positional Encoding Initialization in Transformers for Relational Reasoning

链接: https://arxiv.org/abs/2406.08272
作者: Takuya Ito,Luca Cocchi,Tim Klinger,Parikshit Ram,Murray Campbell,Luke Hearne
关键词: Relational reasoning refers, Relational reasoning, relational reasoning tasks, multiple entities, infer and understand
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Relational reasoning refers to the ability to infer and understand the relations between multiple entities. In humans, this ability underpins many higher cognitive functions, such as problem solving and decision-making, and has been reliably linked to fluid intelligence. Despite machine learning models making impressive advances across various domains, such as natural language processing and vision, the extent to which such models can perform relational reasoning tasks remains unclear. Here we study the importance of positional encoding (PE) for relational reasoning in the Transformer, and find that a learnable PE outperforms all other commonly-used PEs (e.g., absolute, relative, rotary, etc.). Moreover, we find that when using a PE with a learnable parameter, the choice of initialization greatly influences the learned representations and its downstream generalization performance. Specifically, we find that a learned PE initialized from a small-norm distribution can 1) uncover ground-truth position information, 2) generalize in the presence of noisy inputs, and 3) produce behavioral patterns that are consistent with human performance. Our results shed light on the importance of learning high-performing and robust PEs during relational reasoning tasks, which will prove useful for tasks in which ground truth positions are not provided or not known.

[LG-36] Analyzing constrained LLM through PDFA-learning

链接: https://arxiv.org/abs/2406.08269
作者: Matías Carrasco,Franz Mayr,Sergio Yovine,Johny Kidd,Martín Iturbide,Juan Pedro da Silva,Alejo Garat
关键词: null next-symbol probabilities, text generation, copes with null, null next-symbol, next-symbol probabilities
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Workshop Paper

点击查看摘要

Abstract:We define a congruence that copes with null next-symbol probabilities that arise when the output of a language model is constrained by some means during text generation. We develop an algorithm for efficiently learning the quotient with respect to this congruence and evaluate it on case studies for analyzing statistical properties of LLM.

[LG-37] A deep cut into Split Federated Self-supervised Learning

链接: https://arxiv.org/abs/2406.08267
作者: Marcin Przewięźlikowski,Marcin Osial,Bartosz Zieliński,Marek Śmieja
关键词: Collaborative self-supervised learning, Collaborative self-supervised, highly distributed environments, central server, recently become feasible
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted to European Conference on Machine Learning (ECML) 2024

点击查看摘要

Abstract:Collaborative self-supervised learning has recently become feasible in highly distributed environments by dividing the network layers between client devices and a central server. However, state-of-the-art methods, such as MocoSFL, are optimized for network division at the initial layers, which decreases the protection of the client data and increases communication overhead. In this paper, we demonstrate that splitting depth is crucial for maintaining privacy and communication efficiency in distributed training. We also show that MocoSFL suffers from a catastrophic quality deterioration for the minimal communication overhead. As a remedy, we introduce Momentum-Aligned contrastive Split Federated Learning (MonAcoSFL), which aligns online and momentum client models during training procedure. Consequently, we achieve state-of-the-art accuracy while significantly reducing the communication overhead, making MonAcoSFL more practical in real-world scenarios.

[LG-38] Dataset Enhancement with Instance-Level Augmentations

链接: https://arxiv.org/abs/2406.08249
作者: Orest Kupyn,Christian Rupprecht
关键词: pre-trained latent diffusion, latent diffusion models, incorporating knowledge, wide distribution, distribution of pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a method for expanding a dataset by incorporating knowledge from the wide distribution of pre-trained latent diffusion models. Data augmentations typically incorporate inductive biases about the image formation process into the training (e.g. translation, scaling, colour changes, etc.). Here, we go beyond simple pixel transformations and introduce the concept of instance-level data augmentation by repainting parts of the image at the level of object instances. The method combines a conditional diffusion model with depth and edge maps control conditioning to seamlessly repaint individual objects inside the scene, being applicable to any segmentation or detection dataset. Used as a data augmentation method, it improves the performance and generalization of the state-of-the-art salient object detection, semantic segmentation and object detection models. By redrawing all privacy-sensitive instances (people, license plates, etc.), the method is also applicable for data anonymization. We also release fully synthetic and anonymized expansions for popular datasets: COCO, Pascal VOC and DUTS.

[LG-39] Leveraging Large Language Models for Web Scraping

链接: https://arxiv.org/abs/2406.08246
作者: Aman Ahluwalia,Suhrud Wani
关键词: demonstrate remarkable capabilities, replicating human tasks, demonstrate remarkable, boosting productivity, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities in replicating human tasks and boosting productivity. However, their direct application for data extraction presents limitations due to a prioritisation of fluency over factual accuracy and a restricted ability to manipulate specific information. Therefore to overcome these limitations, this research leverages the knowledge representation power of pre-trained LLMs and the targeted information access enabled by RAG models, this research investigates a general-purpose accurate data scraping recipe for RAG models designed for language generation. To capture knowledge in a more modular and interpretable way, we use pre trained language models with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus. We utilised RAG model architecture and did an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Chunking HTML text for effective understanding, and (iii) comparing results from different LLMs and ranking algorithms. While previous work has developed dedicated architectures and training procedures for HTML understanding and extraction, we show that LLMs pre-trained on standard natural language with an addition of effective chunking, searching and ranking algorithms, can prove to be efficient data scraping tool to extract complex data from unstructured text. Future research directions include addressing the challenges of provenance tracking and dynamic knowledge updates within the proposed RAG-based data extraction framework. By overcoming these limitations, this approach holds the potential to revolutionise data extraction from vast repositories of textual information.

[LG-40] Residual Learning and Context Encoding for Adaptive Offline-to-Online Reinforcement Learning

链接: https://arxiv.org/abs/2406.08238
作者: Mohammadreza Nakhaei,Aidan Scannell,Joni Pajarinen
关键词: learning sequential behavior, sequential behavior, behavior from fixed, Offline reinforcement learning, Offline
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 10 pages, 5 figures, 1 table. Accepted at L4DC 2024

点击查看摘要

Abstract:Offline reinforcement learning (RL) allows learning sequential behavior from fixed datasets. Since offline datasets do not cover all possible situations, many methods collect additional data during online fine-tuning to improve performance. In general, these methods assume that the transition dynamics remain the same during both the offline and online phases of training. However, in many real-world applications, such as outdoor construction and navigation over rough terrain, it is common for the transition dynamics to vary between the offline and online phases. Moreover, the dynamics may vary during the online fine-tuning. To address this problem of changing dynamics from offline to online RL we propose a residual learning approach that infers dynamics changes to correct the outputs of the offline solution. At the online fine-tuning phase, we train a context encoder to learn a representation that is consistent inside the current online learning environment while being able to predict dynamic transitions. Experiments in D4RL MuJoCo environments, modified to support dynamics’ changes upon environment resets, show that our approach can adapt to these dynamic changes and generalize to unseen perturbations in a sample-efficient way, whilst comparison methods cannot.

[LG-41] MaIL: Improving Imitation Learning with Mamba

链接: https://arxiv.org/abs/2406.08234
作者: Xiaogang Jia,Qian Wang,Atalay Donat,Bowen Xing,Ge Li,Hongyi Zhou,Onur Celik,Denis Blessing,Rudolf Lioutikov,Gerhard Neumann
关键词: Mamba Imitation Learning, Imitation Learning, introduces Mamba Imitation, computationally efficient alternative, Mamba Imitation
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This work introduces Mamba Imitation Learning (MaIL), a novel imitation learning (IL) architecture that offers a computationally efficient alternative to state-of-the-art (SoTA) Transformer policies. Transformer-based policies have achieved remarkable results due to their ability in handling human-recorded data with inherently non-Markovian behavior. However, their high performance comes with the drawback of large models that complicate effective training. While state space models (SSMs) have been known for their efficiency, they were not able to match the performance of Transformers. Mamba significantly improves the performance of SSMs and rivals against Transformers, positioning it as an appealing alternative for IL policies. MaIL leverages Mamba as a backbone and introduces a formalism that allows using Mamba in the encoder-decoder structure. This formalism makes it a versatile architecture that can be used as a standalone policy or as part of a more advanced architecture, such as a diffuser in the diffusion process. Extensive evaluations on the LIBERO IL benchmark and three real robot experiments show that MaIL: i) outperforms Transformers in all LIBERO tasks, ii) achieves good performance even with small datasets, iii) is able to effectively process multi-modal sensory inputs, iv) is more robust to input noise compared to Transformers.

[LG-42] GPT4Rec: Graph Prompt Tuning for Streaming Recommendation

链接: https://arxiv.org/abs/2406.08229
作者: Peiyan Zhang,Yuchen Yan,Xi Zhang,Liying Kang,Chaozhuo Li,Feiran Huang,Senzhang Wang,Sunghun Kim
关键词: personalized recommender systems, evolving user preferences, user preferences, recommender systems, items is paramount
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by SIGIR 2024. arXiv admin note: text overlap with arXiv:2303.11700 by other authors

点击查看摘要

Abstract:In the realm of personalized recommender systems, the challenge of adapting to evolving user preferences and the continuous influx of new users and items is paramount. Conventional models, typically reliant on a static training-test approach, struggle to keep pace with these dynamic demands. Streaming recommendation, particularly through continual graph learning, has emerged as a novel solution. However, existing methods in this area either rely on historical data replay, which is increasingly impractical due to stringent data privacy regulations; or are inability to effectively address the over-stability issue; or depend on model-isolation and expansion strategies. To tackle these difficulties, we present GPT4Rec, a Graph Prompt Tuning method for streaming Recommendation. Given the evolving user-item interaction graph, GPT4Rec first disentangles the graph patterns into multiple views. After isolating specific interaction patterns and relationships in different views, GPT4Rec utilizes lightweight graph prompts to efficiently guide the model across varying interaction patterns within the user-item graph. Firstly, node-level prompts are employed to instruct the model to adapt to changes in the attributes or properties of individual nodes within the graph. Secondly, structure-level prompts guide the model in adapting to broader patterns of connectivity and relationships within the graph. Finally, view-level prompts are innovatively designed to facilitate the aggregation of information from multiple disentangled views. These prompt designs allow GPT4Rec to synthesize a comprehensive understanding of the graph, ensuring that all vital aspects of the user-item interactions are considered and effectively integrated. Experiments on four diverse real-world datasets demonstrate the effectiveness and efficiency of our proposal.

[LG-43] DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

链接: https://arxiv.org/abs/2406.08226
作者: Jordy Van Landeghem,Subhajit Maity,Ayan Banerjee,Matthew Blaschko,Marie-Francine Moens,Josep Lladós,Sanket Biswas
关键词: document image classification, image classification, DIC, document layout analysis, explores knowledge distillation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ICDAR 2024 (Athens, Greece)

点击查看摘要

Abstract:This work explores knowledge distillation (KD) for visually-rich document (VRD) applications such as document layout analysis (DLA) and document image classification (DIC). While VRD research is dependent on increasingly sophisticated and cumbersome models, the field has neglected to study efficiency via model compression. Here, we design a KD experimentation methodology for more lean, performant models on document understanding (DU) tasks that are integral within larger task pipelines. We carefully selected KD strategies (response-based, feature-based) for distilling knowledge to and from backbones with different architectures (ResNet, ViT, DiT) and capacities (base, small, tiny). We study what affects the teacher-student knowledge gap and find that some methods (tuned vanilla KD, MSE, SimKD with an apt projector) can consistently outperform supervised student training. Furthermore, we design downstream task setups to evaluate covariate shift and the robustness of distilled DLA models on zero-shot layout-aware document visual question answering (DocVQA). DLA-KD experiments result in a large mAP knowledge gap, which unpredictably translates to downstream robustness, accentuating the need to further explore how to efficiently obtain more semantic document layout awareness.

[LG-44] Runtime Freezing: Dynamic Class Loss for Multi-Organ 3D Segmentation

链接: https://arxiv.org/abs/2406.08217
作者: James Willoughby,Irina Voiculescu
关键词: crucial pre-processing step, refined downstream tasks, medical domain, crucial pre-processing, pre-processing step
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 4 Pages. Accepted to ISBI 2024

点击查看摘要

Abstract:Segmentation has become a crucial pre-processing step to many refined downstream tasks, and particularly so in the medical domain. Even with recent improvements in segmentation models, many segmentation tasks remain difficult. When multiple organs are segmented simultaneously, difficulties are due not only to the limited availability of labelled data, but also to class imbalance. In this work we propose dynamic class-based loss strategies to mitigate the effects of highly imbalanced training data. We show how our approach improves segmentation performance on a challenging Multi-Class 3D Abdominal Organ dataset.

[LG-45] Expressivity and Generalization: Fragment-Biases for Molecular GNNs

链接: https://arxiv.org/abs/2406.08210
作者: Tom Wollschläger,Niklas Kemper,Leon Hetzel,Johanna Sommer,Stephan Günnemann
关键词: Graph Neural Networks, higher-order Graph Neural, Neural Networks, property predictive performance, Graph Neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although recent advances in higher-order Graph Neural Networks (GNNs) improve the theoretical expressiveness and molecular property predictive performance, they often fall short of the empirical performance of models that explicitly use fragment information as inductive bias. However, for these approaches, there exists no theoretic expressivity study. In this work, we propose the Fragment-WL test, an extension to the well-known Weisfeiler Leman (WL) test, which enables the theoretic analysis of these fragment-biased GNNs. Building on the insights gained from the Fragment-WL test, we develop a new GNN architecture and a fragmentation with infinite vocabulary that significantly boosts expressiveness. We show the effectiveness of our model on synthetic and real-world data where we outperform all GNNs on Peptides and have 12% lower error than all GNNs on ZINC and 34% lower error than other fragment-biased models. Furthermore, we show that our model exhibits superior generalization capabilities compared to the latest transformer-based architectures, positioning it as a robust solution for a range of molecular modeling tasks.

[LG-46] Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation

链接: https://arxiv.org/abs/2406.08206
作者: Christopher Bockel-Rickermann,Toon Vanderschueren,Tim Verdonck,Wouter Verbeke
关键词: Estimating conditional average, Estimating conditional, average dose responses, conditional average dose, conditional average
类目: Machine Learning (cs.LG)
*备注: 25 pages, 9 figures

点击查看摘要

Abstract:Estimating conditional average dose responses (CADR) is an important but challenging problem. Estimators must correctly model the potentially complex relationships between covariates, interventions, doses, and outcomes. In recent years, the machine learning community has shown great interest in developing tailored CADR estimators that target specific challenges. Their performance is typically evaluated against other methods on (semi-) synthetic benchmark datasets. Our paper analyses this practice and shows that using popular benchmark datasets without further analysis is insufficient to judge model performance. Established benchmarks entail multiple challenges, whose impacts must be disentangled. Therefore, we propose a novel decomposition scheme that allows the evaluation of the impact of five distinct components contributing to CADR estimator performance. We apply this scheme to eight popular CADR estimators on four widely-used benchmark datasets, running nearly 1,500 individual experiments. Our results reveal that most established benchmarks are challenging for reasons different from their creators’ claims. Notably, confounding, the key challenge tackled by most estimators, is not an issue in any of the considered datasets. We discuss the major implications of our findings and present directions for future research.

[LG-47] What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims

链接: https://arxiv.org/abs/2406.08205
作者: Jason Jones,Wenxin Jiang,Nicholas Synovic,George K. Thiruvathukal,James C. Davis
关键词: Collaborative Software Package, Toggle, Collaborative Software, Software Package Registries, synthesizes SPR package
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Collaborative Software Package Registries (SPRs) are an integral part of the software supply chain. Much engineering work synthesizes SPR package into applications. Prior research has examined SPRs for traditional software, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain. Aims: Recent empirical research has examined PTM registries in ways such as vulnerabilities, reuse processes, and evolution. However, no existing research synthesizes them to provide a systematic understanding of the current knowledge. Some of the existing research includes qualitative claims lacking quantitative analysis. Our research fills these gaps by providing a knowledge synthesis and quantitative analyses. Methods: We first conduct a systematic literature review (SLR). We then observe that some of the claims are qualitative. We identify quantifiable metrics associated with those claims, and measure in order to substantiate these claims. Results: From our SLR, we identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative validation. We successfully test 3 of these claims through a quantitative analysis, and directly compare one with traditional software. Our findings corroborate qualitative claims with quantitative measurements. Our findings are: (1) PTMs have a much higher turnover rate than traditional software, indicating a dynamic and rapidly evolving reuse environment within the PTM ecosystem; and (2) There is a strong correlation between documentation quality and PTM popularity. Conclusions: We confirm qualitative research claims with concrete metrics, supporting prior qualitative and case study research. Our measures show further dynamics of PTM reuse, inspiring research infrastructure and new measures. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2406.08205 [cs.SE] (or arXiv:2406.08205v1 [cs.SE] for this version) Submission history From: James Davis [view email] [v1] Wed, 12 Jun 2024 13:38:48 UTC (4,469 KB) Full-text links: Access Paper: View a PDF of the paper titled What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims, by Jason Jones and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2024-06 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-48] Attention-Based Learning for Fluid State Interpolation and Editing in a Time-Continuous Framework

链接: https://arxiv.org/abs/2406.08188
作者: Bruno Roy
关键词: introduce FluidsFormer, continuous-time framework, transformer-based approach, Abstract, residual neural network
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注: 5 pages, 3 figures, submitted and accepted to SIGGRAPH

点击查看摘要

Abstract:In this work, we introduce FluidsFormer: a transformer-based approach for fluid interpolation within a continuous-time framework. By combining the capabilities of PITT and a residual neural network (RNN), we analytically predict the physical properties of the fluid state. This enables us to interpolate substep frames between simulated keyframes, enhancing the temporal smoothness and sharpness of animations. We demonstrate promising results for smoke interpolation and conduct initial experiments on liquids.

[LG-49] Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark

链接: https://arxiv.org/abs/2406.08155
作者: Pingzhi Li,Xiaolong Jin,Yu Cheng,Tianlong Chen
关键词: Large Language Models, natural language processing, Large Language, demonstrating performance improvements, language processing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Our code for reproducing all our experiments is provided at this https URL

点击查看摘要

Abstract:Large Language Models~(LLMs) have become foundational in the realm of natural language processing, demonstrating performance improvements as model sizes increase. The Mixture-of-Experts~(MoE) approach offers a promising way to scale LLMs more efficiently by using fewer computational FLOPs through sparse activation. However, it suffers from significant memory overheads, necessitating model compression techniques. Post-training quantization, a popular method for model compression, proves less effective when directly applied to MoE models due to MoE’s overlooked inherent sparsity. This paper explores several MoE structure-aware quantization heuristics, ranging from coarse to fine granularity, from MoE block to individual linear weight. Our investigations reveal critical principles: different MoE structures (i.e., blocks, experts, linear layers) require varying numbers of weight bits for effective and efficient quantization. Conclusions are supported by extensive benchmarking across two representative MoE models and six tasks. We further introduce novel enhancements to more accurately identify the most critical weights in MoE quantization that necessitate higher bit allocations, including the linear weight outlier scorer and MoE block scorer. Additionally, subsequent experiments validate our findings in the context of both weight and activation quantization.

[LG-50] Probing Implicit Bias in Semi-gradient Q-learning: Visualizing the Effective Loss Landscapes via the Fokker–Planck Equation

链接: https://arxiv.org/abs/2406.08148
作者: Shuyu Yin,Fei Wen,Peilin Liu,Tao Luo
关键词: explicit loss function, effective loss landscape, loss landscape, effective loss, studying its dynamics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Semi-gradient Q-learning is applied in many fields, but due to the absence of an explicit loss function, studying its dynamics and implicit bias in the parameter space is challenging. This paper introduces the Fokker–Planck equation and employs partial data obtained through sampling to construct and visualize the effective loss landscape within a two-dimensional parameter space. This visualization reveals how the global minima in the loss landscape can transform into saddle points in the effective loss landscape, as well as the implicit bias of the semi-gradient method. Additionally, we demonstrate that saddle points, originating from the global minima in loss landscape, still exist in the effective loss landscape under high-dimensional parameter spaces and neural network settings. This paper develop a novel approach for probing implicit bias in semi-gradient Q-learning.

[LG-51] Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences

链接: https://arxiv.org/abs/2406.08128
作者: Zicheng Liu,Siyuan Li,Li Wang,Zedong Wang,Yunfan Liu,Stan Z. Li
关键词: state space models, utilizes computation tricks, attention utilizes computation, linear attention, linear attention utilizes
类目: Machine Learning (cs.LG)
*备注: ICML 2024. arXiv admin note: text overlap with arXiv:2404.11163 ; text overlap with arXiv:2212.08136 by other authors

点击查看摘要

Abstract:To mitigate the computational complexity in the self-attention mechanism on long sequences, linear attention utilizes computation tricks to achieve linear complexity, while state space models (SSMs) popularize a favorable practice of using non-data-dependent memory pattern, i.e., emphasize the near and neglect the distant, to processing sequences. Recent studies have shown the priorities by combining them as one. However, the efficiency of linear attention remains only at the theoretical level in a causal setting, and SSMs require various designed constraints to operate effectively on specific data. Therefore, in order to unveil the true power of the hybrid design, the following two issues need to be addressed: (1) hardware-efficient implementation for linear attention and (2) stabilization of SSMs. To achieve this, we leverage the thought of tiling and hierarchy to propose CHELA (short-long Convolutions with Hardware-Efficient Linear Attention), which replaces SSMs with short-long convolutions and implements linear attention in a divide-and-conquer manner. This approach enjoys global abstraction and data-dependent selection from stable SSM and linear attention while maintaining real linear complexity. Our comprehensive experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.

[LG-52] Counterfactual-based Root Cause Analysis for Dynamical Systems

链接: https://arxiv.org/abs/2406.08106
作者: Juliane Weilbach,Sebastian Gerwinn,Karim Barsim,Martin Fränzle
关键词: numerous industrial applications, failing dynamic process, Identifying the underlying, fundamental challenge, industrial applications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying the underlying reason for a failing dynamic process or otherwise anomalous observation is a fundamental challenge, yet has numerous industrial applications. Identifying the failure-causing sub-system using causal inference, one can ask the question: “Would the observed failure also occur, if we had replaced the behaviour of a sub-system at a certain point in time with its normal behaviour?” To this end, a formal description of behaviour of the full system is needed in which such counterfactual questions can be answered. However, existing causal methods for root cause identification are typically limited to static settings and focusing on additive external influences causing failures rather than structural influences. In this paper, we address these problems by modelling the dynamic causal system using a Residual Neural Network and deriving corresponding counterfactual distributions over trajectories. We show quantitatively that more root causes are identified when an intervention is performed on the structural equation and the external influence, compared to an intervention on the external influence only. By employing an efficient approximation to a corresponding Shapley value, we also obtain a ranking between the different subsystems at different points in time being responsible for an observed failure, which is applicable in settings with large number of variables. We illustrate the effectiveness of the proposed method on a benchmark dynamic system as well as on a real world river dataset.

[LG-53] Confidence Interval Estimation of Predictive Performance in the Context of AutoML

链接: https://arxiv.org/abs/2406.08099
作者: Konstantinos Paraschakis,Andrea Castellani,Giorgos Borboudakis,Ioannis Tsamardinos
关键词: supervised machine learning, machine learning analysis, analysis is required, predictive performance, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: Accepted at AutoML 2024 conference

点击查看摘要

Abstract:Any supervised machine learning analysis is required to provide an estimate of the out-of-sample predictive performance. However, it is imperative to also provide a quantification of the uncertainty of this performance in the form of a confidence or credible interval (CI) and not just a point estimate. In an AutoML setting, estimating the CI is challenging due to the ``winner’s curse", i.e., the bias of estimation due to cross-validating several machine learning pipelines and selecting the winning one. In this work, we perform a comparative evaluation of 9 state-of-the-art methods and variants in CI estimation in an AutoML setting on a corpus of real and simulated datasets. The methods are compared in terms of inclusion percentage (does a 95% CI include the true performance at least 95% of the time), CI tightness (tighter CIs are preferable as being more informative), and execution time. The evaluation is the first one that covers most, if not all, such methods and extends previous work to imbalanced and small-sample tasks. In addition, we present a variant, called BBC-F, of an existing method (the Bootstrap Bias Correction, or BBC) that maintains the statistical properties of the BBC but is more computationally efficient. The results support that BBC-F and BBC dominate the other methods in all metrics measured.

[LG-54] Inductive Global and Local Manifold Approximation and Projection

链接: https://arxiv.org/abs/2406.08097
作者: Jungeum Kim,Xiao Wang
关键词: Nonlinear dimensional reduction, Nonlinear dimensional, high-dimensional data analysis, proven its usefulness, wide range
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Nonlinear dimensional reduction with the manifold assumption, often called manifold learning, has proven its usefulness in a wide range of high-dimensional data analysis. The significant impact of t-SNE and UMAP has catalyzed intense research interest, seeking further innovations toward visualizing not only the local but also the global structure information of the data. Moreover, there have been consistent efforts toward generalizable dimensional reduction that handles unseen data. In this paper, we first propose GLoMAP, a novel manifold learning method for dimensional reduction and high-dimensional data visualization. GLoMAP preserves locally and globally meaningful distance estimates and displays a progression from global to local formation during the course of optimization. Furthermore, we extend GLoMAP to its inductive version, iGLoMAP, which utilizes a deep neural network to map data to its lower-dimensional representation. This allows iGLoMAP to provide lower-dimensional embeddings for unseen points without needing to re-train the algorithm. iGLoMAP is also well-suited for mini-batch learning, enabling large-scale, accelerated gradient calculations. We have successfully applied both GLoMAP and iGLoMAP to the simulated and real-data settings, with competitive experiments against the state-of-the-art methods.

[LG-55] Learnable Interpretable Model Combination in Dynamic Systems Modeling

链接: https://arxiv.org/abs/2406.08093
作者: Tobias Thummerer,Lars Mikelsons
关键词: every-day dynamic systems, dynamic systems modeling, systems modeling, intuitively in every-day, every-day dynamic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the core concepts in science, and something that happens intuitively in every-day dynamic systems modeling, is the combination of models or methods. Especially in dynamical systems modeling, often two or more structures are combined to obtain a more powerful or efficient architecture regarding a specific application (area). Further, even physical simulations are combined with machine learning architectures, to increase prediction accuracy or optimize the computational performance. In this work, we shortly discuss, which types of models are usually combined and propose a model interface that is capable of expressing a width variety of mixed algebraic, discrete and differential equation based models. Further, we examine different established, as well as new ways of combining these models from a system theoretical point of view and highlight two challenges - algebraic loops and local event affect functions in discontinuous models - that require a special approach. Finally, we propose a new wildcard topology, that is capable of describing the generic connection between two combined models in an easy to interpret fashion that can be learned as part of a gradient based optimization procedure. The contributions of this paper are highlighted at a proof of concept: Different connection topologies between two models are learned, interpreted and compared applying the proposed methodology and software implementation.

[LG-56] Balancing Molecular Information and Empirical Data in the Prediction of Physico-Chemical Properties

链接: https://arxiv.org/abs/2406.08075
作者: Johannes Zenn,Dominik Gond,Fabian Jirasek,Robert Bamler
关键词: task in thermodynamics, pure substances, central task, Predicting the physico-chemical, molecular descriptors
类目: Machine Learning (cs.LG)
*备注: 14 pages, including 10 pages of main text and 2 pages of appendix

点击查看摘要

Abstract:Predicting the physico-chemical properties of pure substances and mixtures is a central task in thermodynamics. Established prediction methods range from fully physics-based ab-initio calculations, which are only feasible for very simple systems, over descriptor-based methods that use some information on the molecules to be modeled together with fitted model parameters (e.g., quantitative-structure-property relationship methods or classical group contribution methods), to representation-learning methods, which may, in extreme cases, completely ignore molecular descriptors and extrapolate only from existing data on the property to be modeled (e.g., matrix completion methods). In this work, we propose a general method for combining molecular descriptors with representation learning using the so-called expectation maximization algorithm from the probabilistic machine learning literature, which uses uncertainty estimates to trade off between the two approaches. The proposed hybrid model exploits chemical structure information using graph neural networks, but it automatically detects cases where structure-based predictions are unreliable, in which case it corrects them by representation-learning based predictions that can better specialize to unusual cases. The effectiveness of the proposed method is demonstrated using the prediction of activity coefficients in binary mixtures as an example. The results are compelling, as the method significantly improves predictive accuracy over the current state of the art, showcasing its potential to advance the prediction of physico-chemical properties in general.

[LG-57] A Concept-Based Explainability Framework for Large Multimodal Models

链接: https://arxiv.org/abs/2406.08074
作者: Jayneel Parekh,Pegah Khayatan,Mustafa Shukor,Alasdair Newson,Matthieu Cord
关键词: large language models, combine unimodal encoders, Large multimodal models, large language, perform multimodal tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as “multi-modal concepts”. We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. We will publicly release our code.

[LG-58] CFG: Manifold-constrained Classifier Free Guidance for Diffusion Models

链接: https://arxiv.org/abs/2406.08070
作者: Hyungjin Chung,Jeongsol Kim,Geon Yeong Park,Hyelin Nam,Jong Chul Ye
关键词: CFG, Classifier-free guidance, modern diffusion models, fundamental tool, tool in modern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classifier-free guidance (CFG) is a fundamental tool in modern diffusion models for text-guided generation. Although effective, CFG has notable drawbacks. For instance, DDIM with CFG lacks invertibility, complicating image editing; furthermore, high guidance scales, essential for high-quality outputs, frequently result in issues like mode collapse. Contrary to the widespread belief that these are inherent limitations of diffusion models, this paper reveals that the problems actually stem from the off-manifold phenomenon associated with CFG, rather than the diffusion models themselves. More specifically, inspired by the recent advancements of diffusion model-based inverse problem solvers (DIS), we reformulate text-guidance as an inverse problem with a text-conditioned score matching loss, and develop CFG++, a novel approach that tackles the off-manifold challenges inherent in traditional CFG. CFG++ features a surprisingly simple fix to CFG, yet it offers significant improvements, including better sample quality for text-to-image generation, invertibility, smaller guidance scales, reduced mode collapse, etc. Furthermore, CFG++ enables seamless interpolation between unconditional and conditional sampling at lower guidance scales, consistently outperforming traditional CFG at all scales. Experimental results confirm that our method significantly enhances performance in text-to-image generation, DDIM inversion, editing, and solving inverse problems, suggesting a wide-ranging impact and potential applications in various fields that utilize text guidance. Project Page: this https URL.

[LG-59] Explore-Go: Leveraging Exploration for Generalisation in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2406.08069
作者: Max Weltevrede,Felix Kaubek,Matthijs T.J. Spaan,Wendelin Böhmer
关键词: encounter once deployed, remaining challenges, develop agents, agent, generalise
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One of the remaining challenges in reinforcement learning is to develop agents that can generalise to novel scenarios they might encounter once deployed. This challenge is often framed in a multi-task setting where agents train on a fixed set of tasks and have to generalise to new tasks. Recent work has shown that in this setting increased exploration during training can be leveraged to increase the generalisation performance of the agent. This makes sense when the states encountered during testing can actually be explored during training. In this paper, we provide intuition why exploration can also benefit generalisation to states that cannot be explicitly encountered during training. Additionally, we propose a novel method Explore-Go that exploits this intuition by increasing the number of states on which the agent trains. Explore-Go effectively increases the starting state distribution of the agent and as a result can be used in conjunction with most existing on-policy or off-policy reinforcement learning algorithms. We show empirically that our method can increase generalisation performance in an illustrative environment and on the Procgen benchmark.

[LG-60] Adversarial Evasion Attack Efficiency against Large Language Models

链接: https://arxiv.org/abs/2406.08050
作者: João Vitorino,Eva Maia,Isabel Praça
关键词: Large Language Models, Large Language, Language Models, Large, Models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 9 pages, 1 table, 2 figures, DCAI 2024 conference

点击查看摘要

Abstract:Large Language Models (LLMs) are valuable for text classification, but their vulnerabilities must not be disregarded. They lack robustness against adversarial examples, so it is pertinent to understand the impacts of different types of perturbations, and assess if those attacks could be replicated by common users with a small amount of perturbations and a small number of queries to a deployed LLM. This work presents an analysis of the effectiveness, efficiency, and practicality of three different types of adversarial attacks against five different LLMs in a sentiment classification task. The obtained results demonstrated the very distinct impacts of the word-level and character-level attacks. The word attacks were more effective, but the character and more constrained attacks were more practical and required a reduced number of perturbations and queries. These differences need to be considered during the development of adversarial defense strategies to train more robust LLMs for intelligent text classification applications.

[LG-61] A novel approach to graph distinction through GENEOs and permutants

链接: https://arxiv.org/abs/2406.08045
作者: Giovanni Bocchi,Massimo Ferri,Patrizio Frosini
关键词: Group Equivariant Non-Expansive, Topological Data Analysis, Equivariant Non-Expansive Operators, Group Equivariant, theory of Group
类目: Machine Learning (cs.LG); Group Theory (math.GR)
*备注:

点击查看摘要

Abstract:The theory of Group Equivariant Non-Expansive Operators (GENEOs) was initially developed in Topological Data Analysis for the geometric approximation of data observers, including their invariances and symmetries. This paper departs from that line of research and explores the use of GENEOs for distinguishing r -regular graphs up to isomorphisms. In doing so, we aim to test the capabilities and flexibility of these operators. Our experiments show that GENEOs offer a good compromise between efficiency and computational cost in comparing r -regular graphs, while their actions on data are easily interpretable. This supports the idea that GENEOs could be a general-purpose approach to discriminative problems in Machine Learning when some structural information about data and observers is explicitly given.

[LG-62] Efficient Network Traffic Feature Sets for IoT Intrusion Detection

链接: https://arxiv.org/abs/2406.08042
作者: Miguel Silva,João Vitorino,Eva Maia,Isabel Praça
关键词: Machine Learning, cybersecurity solutions requires, solutions requires high-quality, requires high-quality data, Recursive Feature Elimination
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 10 pages, 9 tables, DCAI 2024 conference

点击查看摘要

Abstract:The use of Machine Learning (ML) models in cybersecurity solutions requires high-quality data that is stripped of redundant, missing, and noisy information. By selecting the most relevant features, data integrity and model efficiency can be significantly improved. This work evaluates the feature sets provided by a combination of different feature selection methods, namely Information Gain, Chi-Squared Test, Recursive Feature Elimination, Mean Absolute Deviation, and Dispersion Ratio, in multiple IoT network datasets. The influence of the smaller feature sets on both the classification performance and the training time of ML models is compared, with the aim of increasing the computational efficiency of IoT intrusion detection. Overall, the most impactful features of each dataset were identified, and the ML models obtained higher computational efficiency while preserving a good generalization, showing little to no difference between the sets.

[LG-63] Beyond the Mean: Differentially Private Prototypes for Private Transfer Learning

链接: https://arxiv.org/abs/2406.08039
作者: Dariush Wahdany,Matthew Jagielski,Adam Dziedzic,Franziska Boenisch
关键词: leak private information, Machine learning, private, shown to leak, Machine
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Submitted to NeurIPS 2024

点击查看摘要

Abstract:Machine learning (ML) models have been shown to leak private information from their training datasets. Differential Privacy (DP), typically implemented through the differential private stochastic gradient descent algorithm (DP-SGD), has become the standard solution to bound leakage from the models. Despite recent improvements, DP-SGD-based approaches for private learning still usually struggle in the high privacy ( \varepsilon\le1) and low data regimes, and when the private training datasets are imbalanced. To overcome these limitations, we propose Differentially Private Prototype Learning (DPPL) as a new paradigm for private transfer learning. DPPL leverages publicly pre-trained encoders to extract features from private data and generates DP prototypes that represent each private class in the embedding space and can be publicly released for inference. Since our DP prototypes can be obtained from only a few private training data points and without iterative noise addition, they offer high-utility predictions and strong privacy guarantees even under the notion of pure DP. We additionally show that privacy-utility trade-offs can be further improved when leveraging the public data beyond pre-training of the encoder: in particular, we can privately sample our DP prototypes from the publicly available data points used to train the encoder. Our experimental evaluation with four state-of-the-art encoders, four vision datasets, and under different data and imbalancedness regimes demonstrate DPPL’s high performance under strong privacy guarantees in challenging private learning setups.

[LG-64] A Self-boosted Framework for Calibrated Ranking

链接: https://arxiv.org/abs/2406.08010
作者: Shunyu Zhang,Hu Liu,Wentian Bao,Enyun Yu,Yang Song
关键词: Scale-calibrated ranking systems, real-world applications nowadays, pursue accurate ranking, accurate ranking quality, Scale-calibrated ranking
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: KDD 2024

点击查看摘要

Abstract:Scale-calibrated ranking systems are ubiquitous in real-world applications nowadays, which pursue accurate ranking quality and calibrated probabilistic predictions simultaneously. For instance, in the advertising ranking system, the predicted click-through rate (CTR) is utilized for ranking and required to be calibrated for the downstream cost-per-click ads bidding. Recently, multi-objective based methods have been wildly adopted as a standard approach for Calibrated Ranking, which incorporates the combination of two loss functions: a pointwise loss that focuses on calibrated absolute values and a ranking loss that emphasizes relative orderings. However, when applied to industrial online applications, existing multi-objective CR approaches still suffer from two crucial limitations. First, previous methods need to aggregate the full candidate list within a single mini-batch to compute the ranking loss. Such aggregation strategy violates extensive data shuffling which has long been proven beneficial for preventing overfitting, and thus degrades the training effectiveness. Second, existing multi-objective methods apply the two inherently conflicting loss functions on a single probabilistic prediction, which results in a sub-optimal trade-off between calibration and ranking. To tackle the two limitations, we propose a Self-Boosted framework for Calibrated Ranking (SBCR).

[LG-65] Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization

链接: https://arxiv.org/abs/2406.08001
作者: Jiaxin Deng,Junbiao Pang,Baochang Zhang
关键词: Sharpness-Aware Minimization, Asymptotic Unbiased Sampling, SAM, effectively reducing, Minimization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) has emerged as a promising approach for effectively reducing the generalization error. However, SAM incurs twice the computational cost compared to base optimizer (e.g., SGD). We propose Asymptotic Unbiased Sampling with respect to iterations to accelerate SAM (AUSAM), which maintains the model’s generalization capacity while significantly enhancing computational efficiency. Concretely, we probabilistically sample a subset of data points beneficial for SAM optimization based on a theoretically guaranteed criterion, i.e., the Gradient Norm of each Sample (GNS). We further approximate the GNS by the difference in loss values before and after perturbation in SAM. As a plug-and-play, architecture-agnostic method, our approach consistently accelerates SAM across a range of tasks and networks, i.e., classification, human pose estimation and network quantization. On CIFAR10/100 and Tiny-ImageNet, AUSAM achieves results comparable to SAM while providing a speedup of over 70%. Compared to recent dynamic data pruning methods, AUSAM is better suited for SAM and excels in maintaining performance. Additionally, AUSAM accelerates optimization in human pose estimation and model quantization without sacrificing performance, demonstrating its broad practicality.

[LG-66] A Federated Online Restless Bandit Framework for Cooperative Resource Allocation

链接: https://arxiv.org/abs/2406.07992
作者: Jingwen Tong,Xinran Li,Liqun Fu,Jun Zhang,Khaled B. Letaief
关键词: Restless multi-armed bandits, Markov reward processes, Restless multi-armed, online RMAB problem, RMAB problem
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Restless multi-armed bandits (RMABs) have been widely utilized to address resource allocation problems with Markov reward processes (MRPs). Existing works often assume that the dynamics of MRPs are known prior, which makes the RMAB problem solvable from an optimization perspective. Nevertheless, an efficient learning-based solution for RMABs with unknown system dynamics remains an open problem. In this paper, we study the cooperative resource allocation problem with unknown system dynamics of MRPs. This problem can be modeled as a multi-agent online RMAB problem, where multiple agents collaboratively learn the system dynamics while maximizing their accumulated rewards. We devise a federated online RMAB framework to mitigate the communication overhead and data privacy issue by adopting the federated learning paradigm. Based on this framework, we put forth a Federated Thompson Sampling-enabled Whittle Index (FedTSWI) algorithm to solve this multi-agent online RMAB problem. The FedTSWI algorithm enjoys a high communication and computation efficiency, and a privacy guarantee. Moreover, we derive a regret upper bound for the FedTSWI algorithm. Finally, we demonstrate the effectiveness of the proposed algorithm on the case of online multi-user multi-channel access. Numerical results show that the proposed algorithm achieves a fast convergence rate of \mathcalO(\sqrtT\log(T)) and better performance compared with baselines. More importantly, its sample complexity decreases with the number of agents.

[LG-67] Interpetable Target-Feature Aggregation for Multi-Task Learning based on Bias-Variance Analysis

链接: https://arxiv.org/abs/2406.07991
作者: Paolo Bonetti,Alberto Maria Metelli,Marcello Restelli
关键词: leverage shared knowledge, machine learning paradigm, learning paradigm designed, powerful machine learning, Multi-task learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multi-task learning (MTL) is a powerful machine learning paradigm designed to leverage shared knowledge across tasks to improve generalization and performance. Previous works have proposed approaches to MTL that can be divided into feature learning, focused on the identification of a common feature representation, and task clustering, where similar tasks are grouped together. In this paper, we propose an MTL approach at the intersection between task clustering and feature transformation based on a two-phase iterative aggregation of targets and features. First, we propose a bias-variance analysis for regression models with additive Gaussian noise, where we provide a general expression of the asymptotic bias and variance of a task, considering a linear regression trained on aggregated input features and an aggregated target. Then, we exploit this analysis to provide a two-phase MTL algorithm (NonLinCTFA). Firstly, this method partitions the tasks into clusters and aggregates each obtained group of targets with their mean. Then, for each aggregated task, it aggregates subsets of features with their mean in a dimensionality reduction fashion. In both phases, a key aspect is to preserve the interpretability of the reduced targets and features through the aggregation with the mean, which is further motivated by applications to Earth science. Finally, we validate the algorithms on synthetic data, showing the effect of different parameters and real-world datasets, exploring the validity of the proposed methodology on classical datasets, recent baselines, and Earth science applications.

[LG-68] Blowfish: Topological and statistical signatures for quantifying ambiguity in semantic search

链接: https://arxiv.org/abs/2406.07990
作者: Thomas Roland Barillot,Alex De Castro
关键词: Retrieval Augmented Generation, Augmented Generation, Retrieval Augmented, works reports evidence, search and Retrieval
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This works reports evidence for the topological signatures of ambiguity in sentence embeddings that could be leveraged for ranking and/or explanation purposes in the context of vector search and Retrieval Augmented Generation (RAG) systems. We proposed a working definition of ambiguity and designed an experiment where we have broken down a proprietary dataset into collections of chunks of varying size - 3, 5, and 10 lines and used the different collections successively as queries and answers sets. It allowed us to test the signatures of ambiguity with removal of confounding factors. Our results show that proxy ambiguous queries (size 10 queries against size 3 documents) display different distributions of homologies 0 and 1 based features than proxy clear queries (size 5 queries against size 10 documents). We then discuss those results in terms increased manifold complexity and/or approximately discontinuous embedding submanifolds. Finally we propose a strategy to leverage those findings as a new scoring strategy of semantic similarities.

[LG-69] Meta-Learning Neural Procedural Biases

链接: https://arxiv.org/abs/2406.07983
作者: Christian Raymond,Qi Chen,Bing Xue,Mengjie Zhan
关键词: achieve high performance, learning, generalize and achieve, achieve high, limited number
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The goal of few-shot learning is to generalize and achieve high performance on new unseen learning tasks, where each task has only a limited number of examples available. Gradient-based meta-learning attempts to address this challenging task by learning how to learn new tasks by embedding inductive biases informed by prior learning experiences into the components of the learning algorithm. In this work, we build upon prior research and propose Neural Procedural Bias Meta-Learning (NPBML), a novel framework designed to meta-learn task-adaptive procedural biases. Our approach aims to consolidate recent advancements in meta-learned initializations, optimizers, and loss functions by learning them simultaneously and making them adapt to each individual task to maximize the strength of the learned inductive biases. This imbues each learning task with a unique set of procedural biases which is specifically designed and selected to attain strong learning performance in only a few gradient steps. The experimental results show that by meta-learning the procedural biases of a neural network, we can induce strong inductive biases towards a distribution of learning tasks, enabling robust learning performance across many well-established few-shot learning benchmarks.

[LG-70] Reinforcement Learning for High-Level Strategic Control in Tower Defense Games

链接: https://arxiv.org/abs/2406.07980
作者: Joakim Bergdahl,Alessandro Sestini,Linus Gisslén
关键词: important aspects, design is maintaining, maintaining a sense, sense of challenge, strategy games
类目: Machine Learning (cs.LG)
*备注: Published at CoG 2024

点击查看摘要

Abstract:In strategy games, one of the most important aspects of game design is maintaining a sense of challenge for players. Many mobile titles feature quick gameplay loops that allow players to progress steadily, requiring an abundance of levels and puzzles to prevent them from reaching the end too quickly. As with any content creation, testing and validation are essential to ensure engaging gameplay mechanics, enjoyable game assets, and playable levels. In this paper, we propose an automated approach that can be leveraged for gameplay testing and validation that combines traditional scripted methods with reinforcement learning, reaping the benefits of both approaches while adapting to new situations similarly to how a human player would. We test our solution on a popular tower defense game, Plants vs. Zombies. The results show that combining a learned approach, such as reinforcement learning, with a scripted AI produces a higher-performing and more robust agent than using only heuristic AI, achieving a 57.12% success rate compared to 47.95% in a set of 40 levels. Moreover, the results demonstrate the difficulty of training a general agent for this type of puzzle-like game.

[LG-71] Heuristic Learning with Graph Neural Networks: A Unified Framework for Link Prediction

链接: https://arxiv.org/abs/2406.07979
作者: Juzhen Zhang,Lanning Wei,Zhen Xu,Quanming Yao
关键词: inherently shaped, Link prediction, Learning Graph Neural, Graph Neural Network, fundamental task
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Link prediction is a fundamental task in graph learning, inherently shaped by the topology of the graph. While traditional heuristics are grounded in graph topology, they encounter challenges in generalizing across diverse graphs. Recent research efforts have aimed to leverage the potential of heuristics, yet a unified formulation accommodating both local and global heuristics remains undiscovered. Drawing insights from the fact that both local and global heuristics can be represented by adjacency matrix multiplications, we propose a unified matrix formulation to accommodate and generalize various heuristics. We further propose the Heuristic Learning Graph Neural Network (HL-GNN) to efficiently implement the formulation. HL-GNN adopts intra-layer propagation and inter-layer connections, allowing it to reach a depth of around 20 layers with lower time complexity than GCN. HL-GNN is proven to be more expressive than heuristics and conventional GNNs, and it can adaptively trade-off between node features and topological information. Extensive experiments on the Planetoid, Amazon, and OGB datasets underscore the effectiveness and efficiency of HL-GNN. It outperforms existing methods by a large margin in prediction performance. Additionally, HL-GNN is several orders of magnitude faster than heuristic-inspired methods while requiring only a few trainable parameters. The case study further demonstrates that the generalized heuristics and learned weights are highly interpretable.

[LG-72] It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

链接: https://arxiv.org/abs/2406.07971
作者: Taiming Lu,Lingfeng Shen,Xinyu Yang,Weiting Tan,Beidi Chen,Huaxiu Yao
关键词: Reinforcement Learning, involves training policy, Human Feedback, align language models, align language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align language models with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning, introducing the concept of seamlessness. Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress. Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM. To measure seamlessness between PM and RM without human effort, we propose an automatic metric, SEAM. SEAM quantifies the discrepancies between PM and RM judgments induced by data samples. We validate the effectiveness of SEAM in data selection and model augmentation. Our experiments demonstrate that (1) using SEAM-filtered data for RL training improves RLHF performance by 4.5%, and (2) SEAM-guided model augmentation results in a 4% performance improvement over standard augmentation methods.

[LG-73] Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

链接: https://arxiv.org/abs/2406.07967
作者: Jie Ruan,Xiao Pu,Mingqi Gao,Xiaojun Wan,Yuesheng Zhu
关键词: expensive and time-consuming, Human evaluation, Constrained Active Sampling, Active Sampling Framework, reliable evaluation method
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: With Appendix

点击查看摘要

Abstract:Human evaluation is viewed as a reliable evaluation method for NLG which is expensive and time-consuming. To save labor and costs, researchers usually perform human evaluation on a small subset of data sampled from the whole dataset in practice. However, different selection subsets will lead to different rankings of the systems. To give a more correct inter-system ranking and make the gold standard human evaluation more reliable, we propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. CASF operates through a Learner, a Systematic Sampler and a Constrained Controller to select representative samples for getting a more correct inter-system ranking.Experiment results on 137 real NLG evaluation setups with 44 human evaluation metrics across 16 datasets and 5 NLG tasks demonstrate CASF receives 93.18% top-ranked system recognition accuracy and ranks first or ranks second on 90.91% of the human metrics with 0.83 overall inter-system ranking Kendall correlation.Code and data are publicly available online.

[LG-74] How Interpretable Are Interpretable Graph Neural Networks?

链接: https://arxiv.org/abs/2406.07955
作者: Yongqiang Chen,Yatao Bian,Bo Han,James Cheng
关键词: involving graph-structured data, scientific applications involving, applications involving graph-structured, graph neural networks, Interpretable graph neural
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML2024, 44 pages, 21 figures, 12 tables

点击查看摘要

Abstract:Interpretable graph neural networks (XGNNs ) are widely adopted in various scientific applications involving graph-structured data. Existing XGNNs predominantly adopt the attention-based mechanism to learn edge or node importance for extracting and making predictions with the interpretable subgraph. However, the representational properties and limitations of these methods remain inadequately explored. In this work, we present a theoretical framework that formulates interpretable subgraph learning with the multilinear extension of the subgraph distribution, coined as subgraph multilinear extension (SubMT). Extracting the desired interpretable subgraph requires an accurate approximation of SubMT, yet we find that the existing XGNNs can have a huge gap in fitting SubMT. Consequently, the SubMT approximation failure will lead to the degenerated interpretability of the extracted subgraphs. To mitigate the issue, we design a new XGNN architecture called Graph Multilinear neT (GMT), which is provably more powerful in approximating SubMT. We empirically validate our theoretical findings on a number of graph classification benchmarks. The results demonstrate that GMT outperforms the state-of-the-art up to 10% in terms of both interpretability and generalizability across 12 regular and geometric graph benchmarks.

[LG-75] DPSW-Sketch: A Differentially Private Sketch Framework for Frequency Estimation over Sliding Windows (Technical Report)

链接: https://arxiv.org/abs/2406.07953
作者: Yiping Wang,Yanhao Wang,Cen Chen
关键词: computation captures scenarios, sliding window model, sliding window, computation captures, captures scenarios
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Accepted for publication at KDD 2024

点击查看摘要

Abstract:The sliding window model of computation captures scenarios in which data are continually arriving in the form of a stream, and only the most recent w items are used for analysis. In this setting, an algorithm needs to accurately track some desired statistics over the sliding window using a small space. When data streams contain sensitive information about individuals, the algorithm is also urgently needed to provide a provable guarantee of privacy. In this paper, we focus on the two fundamental problems of privately (1) estimating the frequency of an arbitrary item and (2) identifying the most frequent items (i.e., \emphheavy hitters), in the sliding window model. We propose \textscDPSW-Sketch, a sliding window framework based on the count-min sketch that not only satisfies differential privacy over the stream but also approximates the results for frequency and heavy-hitter queries within bounded errors in sublinear time and space w.r.t.~ w . Extensive experiments on five real-world and synthetic datasets show that \textscDPSW-Sketch provides significantly better utility-privacy trade-offs than state-of-the-art methods.

[LG-76] Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG Evaluation

链接: https://arxiv.org/abs/2406.07935
作者: Jie Ruan,Wenqing Wang,Xiaojun Wan
关键词: Natural Language Generation, Language Generation, quality of Natural, Natural Language, Human evaluation
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human evaluation serves as the gold standard for assessing the quality of Natural Language Generation (NLG) systems. Nevertheless, the evaluation guideline, as a pivotal element ensuring reliable and reproducible human assessment, has received limited attention.Our investigation revealed that only 29.84% of recent papers involving human evaluation at top conferences release their evaluation guidelines, with vulnerabilities identified in 77.09% of these guidelines. Unreliable evaluation guidelines can yield inaccurate assessment outcomes, potentially impeding the advancement of NLG in the right direction. To address these challenges, we take an initial step towards reliable evaluation guidelines and propose the first human evaluation guideline dataset by collecting annotations of guidelines extracted from existing papers as well as generated via Large Language Models (LLMs). We then introduce a taxonomy of eight vulnerabilities and formulate a principle for composing evaluation guidelines. Furthermore, a method for detecting guideline vulnerabilities has been explored using LLMs, and we offer a set of recommendations to enhance reliability in human evaluation. The annotated human evaluation guideline dataset and code for the vulnerability detection method are publicly available online.

[LG-77] Large Language Model Unlearning via Embedding-Corrupted Prompts

链接: https://arxiv.org/abs/2406.07933
作者: Chris Yuhao Liu,Yaxuan Wang,Jeffrey Flanigan,Yang Liu
关键词: Large language models, Large language, language models, advanced to encompass, encompass extensive knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 55 pages, 4 figures, 66 tables

点击查看摘要

Abstract:Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present Embedding-COrrupted (ECO) Prompts, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at nearly zero side effects in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases.

[LG-78] A Generic Layer Pruning Method for Signal Modulation Recognition Deep Learning Models

链接: https://arxiv.org/abs/2406.07929
作者: Yao Lu,Yutao Zhu,Yuqi Li,Dongwei Xu,Yun Lin,Qi Xuan,Xiaoniu Yang
关键词: deep neural networks, signal classification, deep learning, deep neural, successful application
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the successful application of deep learning in communications systems, deep neural networks are becoming the preferred method for signal classification. Although these models yield impressive results, they often come with high computational complexity and large model sizes, which hinders their practical deployment in communication systems. To address this challenge, we propose a novel layer pruning method. Specifically, we decompose the model into several consecutive blocks, each containing consecutive layers with similar semantics. Then, we identify layers that need to be preserved within each block based on their contribution. Finally, we reassemble the pruned blocks and fine-tune the compact model. Extensive experiments on five datasets demonstrate the efficiency and effectiveness of our method over a variety of state-of-the-art baselines, including layer pruning and channel pruning methods.

[LG-79] Efficient Neural Common Neighbor for Temporal Graph Link Prediction

链接: https://arxiv.org/abs/2406.07926
作者: Xiaohui Zhang,Yanbo Wang,Xiyuan Wang,Muhan Zhang
关键词: trade and transportation, Temporal, social network, Temporal Graph Benchmark, Temporal graphs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Temporal graphs are ubiquitous in real-world scenarios, such as social network, trade and transportation. Predicting dynamic links between nodes in a temporal graph is of vital importance. Traditional methods usually leverage the temporal neighborhood of interaction history to generate node embeddings first and then aggregate the source and target node embeddings to predict the link. However, such methods focus on learning individual node representations, but overlook the pairwise representation learning nature of link prediction and fail to capture the important pairwise features of links such as common neighbors (CN). Motivated by the success of Neural Common Neighbor (NCN) for static graph link prediction, we propose TNCN, a temporal version of NCN for link prediction in temporal graphs. TNCN dynamically updates a temporal neighbor dictionary for each node, and utilizes multi-hop common neighbors between the source and target node to learn a more effective pairwise representation. We validate our model on five large-scale real-world datasets from the Temporal Graph Benchmark (TGB), and find that it achieves new state-of-the-art performance on three of them. Additionally, TNCN demonstrates excellent scalability on large datasets, outperforming popular GNN baselines by up to 6.4 times in speed. Our code is available at https: //github.com/GraphPKU/TNCN.

[LG-80] Near-Optimal Learning and Planning in Separated Latent MDPs

链接: https://arxiv.org/abs/2406.07920
作者: Fan Chen,Constantinos Daskalakis,Noah Golowich,Alexander Rakhlin
关键词: Markov Decision Processes, Latent Markov Decision, learning Latent Markov, Decision Processes, Latent Markov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: COLT 2024

点击查看摘要

Abstract:We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs). In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs. To sidestep known impossibility results, we consider several notions of separation of the constituent MDPs. The main thrust of this paper is in establishing a nearly-sharp statistical threshold for the horizon length necessary for efficient learning. On the computational side, we show that under a weaker assumption of separability under the optimal policy, there is a quasi-polynomial algorithm with time complexity scaling in terms of the statistical threshold. We further show a near-matching time complexity lower bound under the exponential time hypothesis.

[LG-81] Graph Transductive Defense: a Two-Stage Defense for Graph Membership Inference Attacks

链接: https://arxiv.org/abs/2406.07917
作者: Peizhi Niu,Chao Pan,Siheng Chen,Olgica Milenkovic
关键词: diverse real-world applications, medical data analysis, offering powerful graph, Graph neural networks, real-world applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have become instrumental in diverse real-world applications, offering powerful graph learning capabilities for tasks such as social networks and medical data analysis. Despite their successes, GNNs are vulnerable to adversarial attacks, including membership inference attacks (MIA), which threaten privacy by identifying whether a record was part of the model’s training data. While existing research has explored MIA in GNNs under graph inductive learning settings, the more common and challenging graph transductive learning setting remains understudied in this context. This paper addresses this gap and proposes an effective two-stage defense, Graph Transductive Defense (GTD), tailored to graph transductive learning characteristics. The gist of our approach is a combination of a train-test alternate training schedule and flattening strategy, which successfully reduces the difference between the training and testing loss distributions. Extensive empirical results demonstrate the superior performance of our method (a decrease in attack AUROC by 9.42% and an increase in utility performance by 18.08% on average compared to LBP), highlighting its potential for seamless integration into various classification models with minimal overhead.

[LG-82] Ablation Based Counterfactuals

链接: https://arxiv.org/abs/2406.07908
作者: Zheng Dai,David K Gifford
关键词: generate high-quality samples, Diffusion models, class of generative, generate high-quality, difficult to characterize
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 11 pages, 7 figures, appendix included

点击查看摘要

Abstract:Diffusion models are a class of generative models that generate high-quality samples, but at present it is difficult to characterize how they depend upon their training data. This difficulty raises scientific and regulatory questions, and is a consequence of the complexity of diffusion models and their sampling process. To analyze this dependence, we introduce Ablation Based Counterfactuals (ABC), a method of performing counterfactual analysis that relies on model ablation rather than model retraining. In our approach, we train independent components of a model on different but overlapping splits of a training set. These components are then combined into a single model, from which the causal influence of any training sample can be removed by ablating a combination of model components. We demonstrate how we can construct a model like this using an ensemble of diffusion models. We then use this model to study the limits of training data attribution by enumerating full counterfactual landscapes, and show that single source attributability diminishes with increasing training data size. Finally, we demonstrate the existence of unattributable samples.

[LG-83] Grounding Multimodal Large Language Models in Actions

链接: https://arxiv.org/abs/2406.07904
作者: Andrew Szot,Bogdan Mazoure,Harsh Agrawal,Devon Hjelm,Zsolt Kira,Alexander Toshev
关键词: Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.

[LG-84] When Do Skills Help Reinforcement Learning? A Theoretical Analysis of Temporal Abstractions

链接: https://arxiv.org/abs/2406.07897
作者: Zhening Li,Gabriel Poesia,Armando Solar-Lezama
关键词: improve reinforcement learning, temporal abstractions, intended to improve, improve reinforcement, Skills
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 29 pages, 1 figure. Accepted to ICML 2024

点击查看摘要

Abstract:Skills are temporal abstractions that are intended to improve reinforcement learning (RL) performance through hierarchical RL. Despite our intuition about the properties of an environment that make skills useful, a precise characterization has been absent. We provide the first such characterization, focusing on the utility of deterministic skills in deterministic sparse-reward environments with finite action spaces. We show theoretically and empirically that RL performance gain from skills is worse in environments where solutions to states are less compressible. Additional theoretical results suggest that skills benefit exploration more than they benefit learning from existing experience, and that using unexpressive skills such as macroactions may worsen RL performance. We hope our findings can guide research on automatic skill discovery and help RL practitioners better decide when and how to use skills.

[LG-85] Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP

链接: https://arxiv.org/abs/2406.07892
作者: Tejaram Sangadi,L. A. Prashanth,Krishna Jagannathan
关键词: Markov decision process, reward Markov decision, discounted reward Markov, reinforcement learning scenarios, risk-sensitive reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Motivated by risk-sensitive reinforcement learning scenarios, we consider the problem of policy evaluation for variance in a discounted reward Markov decision process (MDP). For this problem, a temporal difference (TD) type learning algorithm with linear function approximation (LFA) exists in the literature, though only asymptotic guarantees are available for this algorithm. We derive finite sample bounds that hold (i) in the mean-squared sense; and (ii) with high probability, when tail iterate averaging is employed with/without regularization. Our bounds exhibit exponential decay for the initial error, while the overall bound is O(1/t) , where t is the number of update iterations of the TD algorithm. Further, the bound for the regularized TD variant is for a universal step size. Our bounds open avenues for analysis of actor-critic algorithms for mean-variance optimization in a discounted MDP.

[LG-86] An Empirical Study of Mamba-based Language Models

链接: https://arxiv.org/abs/2406.07887
作者: Roger Waleffe,Wonmin Byeon,Duncan Riach,Brandon Norick,Vijay Korthikanti,Tri Dao,Albert Gu,Ali Hatamizadeh,Sudhakar Singh,Deepak Narayanan,Garvit Kulshreshtha,Vartika Singh,Jared Casper,Jan Kautz,Mohammad Shoeybi,Bryan Catanzaro
关键词: Selective state-space models, quadratic computational complexity, large inference-time memory, inference-time memory requirements, Selective state-space
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments comparing SSMs to Transformers. To understand the strengths and weaknesses of these architectures at larger scales, we present a direct comparison between 8B-parameter Mamba, Mamba-2, and Transformer models trained on the same datasets of up to 3.5T tokens. We also compare these models to a hybrid architecture consisting of 43% Mamba-2, 7% attention, and 50% MLP layers (Mamba-2-Hybrid). Using a diverse set of tasks, we answer the question of whether Mamba models can match Transformers at larger training budgets. Our results show that while pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks which require strong copying or in-context learning abilities (e.g., 5-shot MMLU, Phonebook) or long-context reasoning. In contrast, we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average. To enable further study, we release the checkpoints as well as the code used to train our models as part of NVIDIA’s Megatron-LM project.

[LG-87] GENIU: A Restricted Data Access Unlearning for Imbalanced Data

链接: https://arxiv.org/abs/2406.07885
作者: Chenhao Zhang,Shaofei Shen,Yawen Zhao,Weitong Tony Chen,Miao Xu
关键词: data, unlearning, Class unlearning, restricted data access, Class
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the increasing emphasis on data privacy, the significance of machine unlearning has grown substantially. Class unlearning, which involves enabling a trained model to forget data belonging to a specific class learned before, is important as classification tasks account for the majority of today’s machine learning as a service (MLaaS). Retraining the model on the original data, excluding the data to be forgotten (a.k.a forgetting data), is a common approach to class unlearning. However, the availability of original data during the unlearning phase is not always guaranteed, leading to the exploration of class unlearning with restricted data access. While current unlearning methods with restricted data access usually generate proxy sample via the trained neural network classifier, they typically focus on training and forgetting balanced data. However, the imbalanced original data can cause trouble for these proxies and unlearning, particularly when the forgetting data consists predominantly of the majority class. To address this issue, we propose the GENerative Imbalanced Unlearning (GENIU) framework. GENIU utilizes a Variational Autoencoder (VAE) to concurrently train a proxy generator alongside the original model. These generated proxies accurately represent each class and are leveraged in the unlearning phase, eliminating the reliance on the original training data. To further mitigate the performance degradation resulting from forgetting the majority class, we introduce an in-batch tuning strategy that works with the generated proxies. GENIU is the first practical framework for class unlearning in imbalanced data settings and restricted data access, ensuring the preservation of essential information for future unlearning. Experimental results confirm the superiority of GENIU over existing methods, establishing its effectiveness in empirical scenarios.

[LG-88] KernelWarehouse: Rethinking the Design of Dynamic Convolution

链接: https://arxiv.org/abs/2406.07879
作者: Chao Li,Anbang Yao
关键词: demonstrating superior performance, static kernels weighted, Dynamic convolution learns, Dynamic convolution, demonstrating superior
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work is accepted to ICML 2024. The project page: this https URL . arXiv admin note: substantial text overlap with arXiv:2308.08361

点击查看摘要

Abstract:Dynamic convolution learns a linear mixture of n static kernels weighted with their input-dependent attentions, demonstrating superior performance than normal convolution. However, it increases the number of convolutional parameters by n times, and thus is not parameter efficient. This leads to no research progress that can allow researchers to explore the setting n100 (an order of magnitude larger than the typical setting n10) for pushing forward the performance boundary of dynamic convolution while enjoying parameter efficiency. To fill this gap, in this paper, we propose KernelWarehouse, a more general form of dynamic convolution, which redefines the basic concepts of kernels", assembling kernels" and ``attention function" through the lens of exploiting convolutional parameter dependencies within the same layer and across neighboring layers of a ConvNet. We testify the effectiveness of KernelWarehouse on ImageNet and MS-COCO datasets using various ConvNet architectures. Intriguingly, KernelWarehouse is also applicable to Vision Transformers, and it can even reduce the model size of a backbone while improving the model accuracy. For instance, KernelWarehouse (n=4) achieves 5.61%|3.90%|4.38% absolute top-1 accuracy gain on the ResNet18|MobileNetV2|DeiT-Tiny backbone, and KernelWarehouse (n=1/4) with 65.10% model size reduction still achieves 2.29% gain on the ResNet18 backbone. The code and models are available at this https URL.

[LG-89] Hierarchical Reinforcement Learning for Swarm Confrontation with High Uncertainty

链接: https://arxiv.org/abs/2406.07877
作者: Qizhen Wu,Kexin Liu,Lei Chen,Jinhu Lv
关键词: key scenario, pursuit-evasion game, swarm robotics, hybrid decision process, hybrid process
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In swarm robotics, confrontation including the pursuit-evasion game is a key scenario. High uncertainty caused by unknown opponents’ strategies and dynamic obstacles complicates the action space into a hybrid decision process. Although the deep reinforcement learning method is significant for swarm confrontation since it can handle various sizes, as an end-to-end implementation, it cannot deal with the hybrid process. Here, we propose a novel hierarchical reinforcement learning approach consisting of a target allocation layer, a path planning layer, and the underlying dynamic interaction mechanism between the two layers, which indicates the quantified uncertainty. It decouples the hybrid process into discrete allocation and continuous planning layers, with a probabilistic ensemble model to quantify the uncertainty and regulate the interaction frequency adaptively. Furthermore, to overcome the unstable training process introduced by the two layers, we design an integration training method including pre-training and cross-training, which enhances the training efficiency and stability. Experiment results in both comparison and ablation studies validate the effectiveness and generalization performance of our proposed approach.

[LG-90] Small Scale Data-Free Knowledge Distillation

链接: https://arxiv.org/abs/2406.07876
作者: He Liu,Yikai Wang,Huaping Liu,Fuchun Sun,Anbang Yao
关键词: Data-free knowledge distillation, smaller student network, knowledge distillation, large teacher network, Data-free knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work is accepted to CVPR 2024. The project page: this https URL

点击查看摘要

Abstract:Data-free knowledge distillation is able to utilize the knowledge learned by a large teacher network to augment the training of a smaller student network without accessing the original training data, avoiding privacy, security, and proprietary risks in real applications. In this line of research, existing methods typically follow an inversion-and-distillation paradigm in which a generative adversarial network on-the-fly trained with the guidance of the pre-trained teacher network is used to synthesize a large-scale sample set for knowledge distillation. In this paper, we reexamine this common data-free knowledge distillation paradigm, showing that there is considerable room to improve the overall training efficiency through a lens of ``small-scale inverted data for knowledge distillation". In light of three empirical observations indicating the importance of how to balance class distributions in terms of synthetic sample diversity and difficulty during both data inversion and distillation processes, we propose Small Scale Data-free Knowledge Distillation SSD-KD. In formulation, SSD-KD introduces a modulating function to balance synthetic samples and a priority sampling function to select proper samples, facilitated by a dynamic replay buffer and a reinforcement learning strategy. As a result, SSD-KD can perform distillation training conditioned on an extremely small scale of synthetic samples (e.g., 10X less than the original training data scale), making the overall training efficiency one or two orders of magnitude faster than many mainstream methods while retaining superior or competitive model performance, as demonstrated on popular image classification and semantic segmentation benchmarks. The code is available at this https URL.

[LG-91] Carbon Market Simulation with Adaptive Mechanism Design

链接: https://arxiv.org/abs/2406.07875
作者: Han Wang,Wenhao Li,Hongyuan Zha,Baoxiang Wang
关键词: Toggle, reducing carbon emissions, carbon, tackle climate change, Code
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:A carbon market is a market-based tool that incentivizes economic agents to align individual profits with the global utility, i.e., reducing carbon emissions to tackle climate change. \textitCap and trade stands as a critical principle based on allocating and trading carbon allowances (carbon emission credit), enabling economic agents to follow planned emissions and penalizing excess emissions. A central authority is responsible for introducing and allocating those allowances in cap and trade. However, the complexity of carbon market dynamics makes accurate simulation intractable, which in turn hinders the design of effective allocation strategies. To address this, we propose an adaptive mechanism design framework, simulating the market using hierarchical, model-free multi-agent reinforcement learning (MARL). Government agents allocate carbon credits, while enterprises engage in economic activities and carbon trading. This framework illustrates agents’ behavior comprehensively. Numerical results show MARL enables government agents to balance productivity, equality, and carbon emissions. Our project is available at \urlthis https URL. Comments: 10 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2406.07875 [cs.LG] (or arXiv:2406.07875v1 [cs.LG] for this version) Submission history From: Han Wang [view email] [v1] Wed, 12 Jun 2024 05:08:51 UTC (1,504 KB) Full-text links: Access Paper: View a PDF of the paper titled Carbon Market Simulation with Adaptive Mechanism Design, by Han Wang and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2024-06 Change to browse by: cs cs.AI cs.MA References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-92] Asymptotically Optimal Regret for Black-Box Predict-then-Optimize

链接: https://arxiv.org/abs/2406.07866
作者: Samuel Tan,Peter I. Frazier
关键词: make future binary, future binary decisions, make future, future binary, model predicted reward
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 15 pages, 2 figures, 3 tables

点击查看摘要

Abstract:We consider the predict-then-optimize paradigm for decision-making in which a practitioner (1) trains a supervised learning model on historical data of decisions, contexts, and rewards, and then (2) uses the resulting model to make future binary decisions for new contexts by finding the decision that maximizes the model’s predicted reward. This approach is common in industry. Past analysis assumes that rewards are observed for all actions for all historical contexts, which is possible only in problems with special structure. Motivated by problems from ads targeting and recommender systems, we study new black-box predict-then-optimize problems that lack this special structure and where we only observe the reward from the action taken. We present a novel loss function, which we call Empirical Soft Regret (ESR), designed to significantly improve reward when used in training compared to classical accuracy-based metrics like mean-squared error. This loss function targets the regret achieved when taking a suboptimal decision; because the regret is generally not differentiable, we propose a differentiable “soft” regret term that allows the use of neural networks and other flexible machine learning models dependent on gradient-based training. In the particular case of paired data, we show theoretically that optimizing our loss function yields asymptotically optimal regret within the class of supervised learning models. We also show our approach significantly outperforms state-of-the-art algorithms on real-world decision-making problems in news recommendation and personalized healthcare compared to benchmark methods from contextual bandits and conditional average treatment effect estimation.

[LG-93] FaithFill: Faithful Inpainting for Object Completion Using a Single Reference Image

链接: https://arxiv.org/abs/2406.07865
作者: Rupayan Mallick,Amr Abdalla,Sarah Adel Bargal
关键词: diffusion-based inpainting object, inpainting object completion, object completion approach, diffusion-based inpainting, completion approach
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present FaithFill, a diffusion-based inpainting object completion approach for realistic generation of missing object parts. Typically, multiple reference images are needed to achieve such realistic generation, otherwise the generation would not faithfully preserve shape, texture, color, and background. In this work, we propose a pipeline that utilizes only a single input reference image -having varying lighting, background, object pose, and/or viewpoint. The singular reference image is used to generate multiple views of the object to be inpainted. We demonstrate that FaithFill produces faithful generation of the object’s missing parts, together with background/scene preservation, from a single reference image. This is demonstrated through standard similarity metrics, human judgement, and GPT evaluation. Our results are presented on the DreamBooth dataset, and a novel proposed dataset.

[LG-94] Self-Distillation Learning Based on Temporal-Spatial Consistency for Spiking Neural Networks

链接: https://arxiv.org/abs/2406.07862
作者: Lin Zuo,Yongqi Ding,Mengmeng Jing,Kunshan Yang,Yunqian Yu
关键词: Spiking neural networks, high biological interpretability, attracted considerable attention, Spiking neural, low-power characteristics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Spiking neural networks (SNNs) have attracted considerable attention for their event-driven, low-power characteristics and high biological interpretability. Inspired by knowledge distillation (KD), recent research has improved the performance of the SNN model with a pre-trained teacher model. However, additional teacher models require significant computational resources, and it is tedious to manually define the appropriate teacher network architecture. In this paper, we explore cost-effective self-distillation learning of SNNs to circumvent these concerns. Without an explicit defined teacher, the SNN generates pseudo-labels and learns consistency during training. On the one hand, we extend the timestep of the SNN during training to create an implicit temporal teacher" that guides the learning of the original student", i.e., the temporal self-distillation. On the other hand, we guide the output of the weak classifier at the intermediate stage by the final output of the SNN, i.e., the spatial self-distillation. Our temporal-spatial self-distillation (TSSD) learning method does not introduce any inference overhead and has excellent generalization ability. Extensive experiments on the static image datasets CIFAR10/100 and ImageNet as well as the neuromorphic datasets CIFAR10-DVS and DVS-Gesture validate the superior performance of the TSSD method. This paper presents a novel manner of fusing SNNs with KD, providing insights into high-performance SNN learning methods.

[LG-95] BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain

链接: https://arxiv.org/abs/2406.07860
作者: Rahul Kumar,Amar Raja Dibbu,Shrutendra Harsola,Vignesh Subrahmaniam,Ashutosh Modi
关键词: recently been proposed, Spider, natural language interfaces, natural language, accounting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at NAACL 2024; 20 Pages (main + appendix)

点击查看摘要

Abstract:Several large-scale datasets (e.g., WikiSQL, Spider) for developing natural language interfaces to databases have recently been proposed. These datasets cover a wide breadth of domains but fall short on some essential domains, such as finance and accounting. Given that accounting databases are used worldwide, particularly by non-technical people, there is an imminent need to develop models that could help extract information from accounting databases via natural language queries. In this resource paper, we aim to fill this gap by proposing a new large-scale Text-to-SQL dataset for the accounting and financial domain: BookSQL. The dataset consists of 100k natural language queries-SQL pairs, and accounting databases of 1 million records. We experiment with and analyze existing state-of-the-art models (including GPT-4) for the Text-to-SQL task on BookSQL. We find significant performance gaps, thus pointing towards developing more focused models for this domain.

[LG-96] oward Enhanced Reinforcement Learning-Based Resource Management via Digital Twin: Opportunities Applications and Challenges

链接: https://arxiv.org/abs/2406.07857
作者: Nan Cheng,Xiucheng Wang,Zan Li Zhisheng Yin,Tom Luan,Xuemin Shen
关键词: enhanced reinforcement learning, including limited exploration, limited exploration efficiency, network resource management, RL-based resource management
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 7pages, 6figures

点击查看摘要

Abstract:This article presents a digital twin (DT)-enhanced reinforcement learning (RL) framework aimed at optimizing performance and reliability in network resource management, since the traditional RL methods face several unified challenges when applied to physical networks, including limited exploration efficiency, slow convergence, poor long-term performance, and safety concerns during the exploration phase. To deal with the above challenges, a comprehensive DT-based framework is proposed to enhance the convergence speed and performance for unified RL-based resource management. The proposed framework provides safe action exploration, more accurate estimates of long-term returns, faster training convergence, higher convergence performance, and real-time adaptation to varying network conditions. Then, two case studies on ultra-reliable and low-latency communication (URLLC) services and multiple unmanned aerial vehicles (UAV) network are presented, demonstrating improvements of the proposed framework in performance, convergence speed, and training cost reduction both on traditional RL and neural network based Deep RL (DRL). Finally, the article identifies and explores some of the research challenges and open issues in this rapidly evolving field.

[LG-97] ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

链接: https://arxiv.org/abs/2406.07831
作者: Xiang Meng,Kayhan Behdin,Haoyue Wang,Rahul Mazumder
关键词: natural language processing, language processing tasks, Large Language Models, vast computational resources, Large Language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The impressive performance of Large Language Models (LLMs) across various natural language processing tasks comes at the cost of vast computational resources and storage requirements. One-shot pruning techniques offer a way to alleviate these burdens by removing redundant weights without the need for retraining. Yet, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression. In this paper, we introduce ALPS, an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned conjugate gradient-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. ALPS substantially outperforms state-of-the-art methods in terms of the pruning objective and perplexity reduction, particularly for highly sparse models. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.

[LG-98] he Max-Min Formulation of Multi-Objective Reinforcement Learning: From Theory to a Model-Free Algorithm

链接: https://arxiv.org/abs/2406.07826
作者: Giseung Park,Woohyeon Byeon,Seongmin Kim,Elad Havakuk,Amir Leshem,Youngchul Sung
关键词: multi-objective reinforcement learning, multiple optimization goals, reinforcement learning, real-world problems, multi-objective reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:In this paper, we consider multi-objective reinforcement learning, which arises in many real-world problems with multiple optimization goals. We approach the problem with a max-min framework focusing on fairness among the multiple goals and develop a relevant theory and a practical model-free algorithm under the max-min framework. The developed theory provides a theoretical advance in multi-objective reinforcement learning, and the proposed algorithm demonstrates a notable performance improvement over existing baseline methods.

[LG-99] Are Objective Explanatory Evaluation metrics Trustworthy? An Adversarial Analysis

链接: https://arxiv.org/abs/2406.07820
作者: Prithwijit Chowdhury,Mohit Prabhushankar,Ghassan AlRegib,Mohamed Deriche
关键词: neural network models, network models, deep learning, learning by empowering, trust in neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainable AI (XAI) has revolutionized the field of deep learning by empowering users to have more trust in neural network models. The field of XAI allows users to probe the inner workings of these algorithms to elucidate their decision-making processes. The rise in popularity of XAI has led to the advent of different strategies to produce explanations, all of which only occasionally agree. Thus several objective evaluation metrics have been devised to decide which of these modules give the best explanation for specific scenarios. The goal of the paper is twofold: (i) we employ the notions of necessity and sufficiency from causal literature to come up with a novel explanatory technique called SHifted Adversaries using Pixel Elimination(SHAPE) which satisfies all the theoretical and mathematical criteria of being a valid explanation, (ii) we show that SHAPE is, infact, an adversarial explanation that fools causal metrics that are employed to measure the robustness and reliability of popular importance based visual XAI methods. Our analysis shows that SHAPE outperforms popular explanatory techniques like GradCAM and GradCAM++ in these tests and is comparable to RISE, raising questions about the sanity of these metrics and the need for human involvement for an overall better evaluation.

[LG-100] o be Continuous or to be Discrete Those are Bits of Questions

链接: https://arxiv.org/abs/2406.07812
作者: Yiran Wang,Masao Utiyama
关键词: Recently, continuous input vectors, binary representation, replace continuous input, representation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ACL-2024

点击查看摘要

Abstract:Recently, binary representation has been proposed as a novel representation that lies between continuous and discrete representations. It exhibits considerable information-preserving capability when being used to replace continuous input vectors. In this paper, we investigate the feasibility of further introducing it to the output side, aiming to allow models to output binary labels instead. To preserve the structural information on the output side along with label information, we extend the previous contrastive hashing method as structured contrastive hashing. More specifically, we upgrade CKY from label-level to bit-level, define a new similarity function with span marginal probabilities, and introduce a novel contrastive loss function with a carefully designed instance selection strategy. Our model achieves competitive performance on various structured prediction tasks, and demonstrates that binary representation can be considered a novel representation that further bridges the gap between the continuous nature of deep learning and the discrete intrinsic property of natural languages.

[LG-101] Evolutionary Computation and Explainable AI: A Roadmap to Transparent Intelligent Systems

链接: https://arxiv.org/abs/2406.07811
作者: Ryan Zhou,Jaume Bacardit,Alexander Brownlee,Stefano Cagnoni,Martin Fyvie,Giovanni Iacca,John McCall,Niki van Stein,David Walker,Ting Hu
关键词: accountability and trust, XAI, finding an increasing, increasing number, black-box nature
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 29 pages, 4 figures. arXiv admin note: substantial text overlap with arXiv:2306.14786

点击查看摘要

Abstract:AI methods are finding an increasing number of applications, but their often black-box nature has raised concerns about accountability and trust. The field of explainable artificial intelligence (XAI) has emerged in response to the need for human understanding of AI models. Evolutionary computation (EC), as a family of powerful optimization and learning tools, has significant potential to contribute to XAI. In this paper, we provide an introduction to XAI and review various techniques in current use for explaining machine learning (ML) models. We then focus on how EC can be used in XAI, and review some XAI approaches which incorporate EC techniques. Additionally, we discuss the application of XAI principles within EC itself, examining how these principles can shed some light on the behavior and outcomes of EC algorithms in general, on the (automatic) configuration of these algorithms, and on the underlying problem landscapes that these algorithms optimize. Finally, we discuss some open challenges in XAI and opportunities for future research in this field using EC. Our aim is to demonstrate that EC is well-suited for addressing current problems in explainability and to encourage further exploration of these methods to contribute to the development of more transparent and trustworthy ML models and EC algorithms.

[LG-102] Regularizing and Aggregating Clients with Class Distribution for Personalized Federated Learning

链接: https://arxiv.org/abs/2406.07800
作者: Gyuejeong Lee,Daeyoung Choi
关键词: Personalized federated learning, Personalized federated, Class-wise Federated Averaging, Federated Averaging, existing PFL methods
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Personalized federated learning (PFL) enables customized models for clients with varying data distributions. However, existing PFL methods often incur high computational and communication costs, limiting their practical application. This paper proposes a novel PFL method, Class-wise Federated Averaging (cwFedAVG), that performs Federated Averaging (FedAVG) class-wise, creating multiple global models per class on the server. Each local model integrates these global models weighted by its estimated local class distribution, derived from the L2-norms of deep network weights, avoiding privacy violations. Afterward, each global model does the same with local models using the same method. We also newly designed Weight Distribution Regularizer (WDR) to further enhance the accuracy of estimating a local class distribution by minimizing the Euclidean distance between the class distribution and the weight norms’ distribution. Experimental results demonstrate that cwFedAVG matches or outperforms several existing PFL methods. Notably, cwFedAVG is conceptually simple yet computationally efficient as it mitigates the need for extensive calculation to collaborate between clients by leveraging shared global models. Visualizations provide insights into how cwFedAVG enables local model specialization on respective class distributions while global models capture class-relevant information across clients.

[LG-103] From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers

链接: https://arxiv.org/abs/2406.07785
作者: Swaminathan Gurumurthy,Karnik Ram,Bingqing Chen,Zachary Manchester,Zico Kolter
关键词: pose estimation, correspondence estimation problem, squares optimization problem, weighted least squares, estimation problem
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at CVPR 2024

点击查看摘要

Abstract:Various pose estimation and tracking problems in robotics can be decomposed into a correspondence estimation problem (often computed using a deep network) followed by a weighted least squares optimization problem to solve for the poses. Recent work has shown that coupling the two problems by iteratively refining one conditioned on the other’s output yields SOTA results across domains. However, training these models has proved challenging, requiring a litany of tricks to stabilize and speed up training. In this work, we take the visual odometry problem as an example and identify three plausible causes: (1) flow loss interference, (2) linearization errors in the bundle adjustment (BA) layer, and (3) dependence of weight gradients on the BA residual. We show how these issues result in noisy and higher variance gradients, potentially leading to a slow down in training and instabilities. We then propose a simple, yet effective solution to reduce the gradient variance by using the weights predicted by the network in the inner optimization loop to weight the correspondence objective in the training problem. This helps the training objective `focus’ on the more important points, thereby reducing the variance and mitigating the influence of outliers. We show that the resulting method leads to faster training and can be more flexibly trained in varying training setups without sacrificing performance. In particular we show 2 – 2.5\times training speedups over a baseline visual odometry model we modify.

[LG-104] A Critical Look At Tokenwise Reward-Guided Text Generation

链接: https://arxiv.org/abs/2406.07780
作者: Ahmad Rashid,Ruotian Wu,Julia Grosse,Agustinus Kristiadi,Pascal Poupart
关键词: Large language models, so-called reinforcement learning, Large language, human preferences, human feedback
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) can significantly be improved by aligning to human preferences – the so-called reinforcement learning from human feedback (RLHF). However, the cost of fine-tuning an LLM is prohibitive for many users. Due to their ability to bypass LLM finetuning, tokenwise reward-guided text generation (RGTG) methods have recently been proposed. They use a reward model trained on full sequences to score partial sequences during a tokenwise decoding, in a bid to steer the generation towards sequences with high rewards. However, these methods have so far been only heuristically motivated and poorly analyzed. In this work, we show that reward models trained on full sequences are not compatible with scoring partial sequences. To alleviate this issue, we propose to explicitly train a Bradley-Terry reward model on partial sequences, and autoregressively sample from the implied tokenwise policy during decoding time. We study the property of this reward model and the implied policy. In particular, we show that this policy is proportional to the ratio of two distinct RLHF policies. We show that our simple approach outperforms previous RGTG methods and achieves similar performance as strong offline baselines but without large-scale LLM finetuning.

[LG-105] On Trojans in Refined Language Models

链接: https://arxiv.org/abs/2406.07778
作者: Jayaram Raghuram,George Kesidis,David J. Miller
关键词: product reviews, determining the sentiment, sentiment of product, language model, Trojan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A Trojan in a language model can be inserted when the model is refined for a particular application such as determining the sentiment of product reviews. In this paper, we clarify and empirically explore variations of the data-poisoning threat model. We then empirically assess two simple defenses each for a different defense scenario. Finally, we provide a brief survey of related attacks and defenses.

[LG-106] Unifying Interpretability and Explainability for Alzheimers Disease Progression Prediction

链接: https://arxiv.org/abs/2406.07777
作者: Raja Farrukh Ali,Stephanie Milani,John Woods,Emmanuel Adenij,Ayesha Farooq,Clayton Mansel,Jeffrey Burns,William Hsu
关键词: recently shown promise, Reinforcement learning, model domain knowledge, domain knowledge, recently shown
类目: Machine Learning (cs.LG)
*备注: Previous versions accepted to NeurIPS 2023’s XAIA and AAAI 2024’s XAI4DRL workshops

点击查看摘要

Abstract:Reinforcement learning (RL) has recently shown promise in predicting Alzheimer’s disease (AD) progression due to its unique ability to model domain knowledge. However, it is not clear which RL algorithms are well-suited for this task. Furthermore, these methods are not inherently explainable, limiting their applicability in real-world clinical scenarios. Our work addresses these two important questions. Using a causal, interpretable model of AD, we first compare the performance of four contemporary RL algorithms in predicting brain cognition over 10 years using only baseline (year 0) data. We then apply SHAP (SHapley Additive exPlanations) to explain the decisions made by each algorithm in the model. Our approach combines interpretability with explainability to provide insights into the key factors influencing AD progression, offering both global and individual, patient-level analysis. Our findings show that only one of the RL methods is able to satisfactorily model disease progression, but the post-hoc explanations indicate that all methods fail to properly capture the importance of amyloid accumulation, one of the pathological hallmarks of Alzheimer’s disease. Our work aims to merge predictive accuracy with transparency, assisting clinicians and researchers in enhancing disease progression modeling for informed healthcare decisions. Code is available at this https URL.

[LG-107] Self-attention-based non-linear basis transformations for compact latent space modelling of dynamic optical fibre transmission matrices

链接: https://arxiv.org/abs/2406.07775
作者: Yijie Zheng,Robert J. Kilpatrick,David B. Phillips,George S.D. Gordon
关键词: Multimode optical fibres, Multimode optical, efficiently transport light, hair-thin strands, strands of glass
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimode optical fibres are hair-thin strands of glass that efficiently transport light. They promise next-generation medical endoscopes that provide unprecedented sub-cellular image resolution deep inside the body. However, confining light to such fibres means that images are inherently scrambled in transit. Conventionally, this scrambling has been compensated by pre-calibrating how a specific fibre scrambles light and solving a stationary linear matrix equation that represents a physical model of the fibre. However, as the technology develops towards real-world deployment, the unscrambling process must account for dynamic changes in the matrix representing the fibre’s effect on light, due to factors such as movement and temperature shifts, and non-linearities resulting from the inaccessibility of the fibre tip when inside the body. Such complex, dynamic and nonlinear behaviour is well-suited to approximation by neural networks, but most leading image reconstruction networks rely on convolutional layers, which assume strong correlations between adjacent pixels, a strong inductive bias that is inappropriate for fibre matrices which may be expressed in a range of arbitrary coordinate representations with long-range correlations. We introduce a new concept that uses self-attention layers to dynamically transform the coordinate representations of varying fibre matrices to a basis that admits compact, low-dimensional representations suitable for further processing. We demonstrate the effectiveness of this approach on diverse fibre matrix datasets. We show our models significantly improve the sparsity of fibre bases in their transformed bases with a participation ratio, p, as a measure of sparsity, of between 0.01 and 0.11. Further, we show that these transformed representations admit reconstruction of the original matrices with 10% reconstruction error, demonstrating the invertibility.

[LG-108] DualBind: A Dual-Loss Framework for Protein-Ligand Binding Affinity Prediction

链接: https://arxiv.org/abs/2406.07770
作者: Meng Liu,Saee Gopal Paliwal
关键词: drug development, crucial for drug, Abstract, protein-ligand binding affinities, protein-ligand binding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注: Preprint, work in progress

点击查看摘要

Abstract:Accurate prediction of protein-ligand binding affinities is crucial for drug development. Recent advances in machine learning show promising results on this task. However, these methods typically rely heavily on labeled data, which can be scarce or unreliable, or they rely on assumptions like Boltzmann-distributed data that may not hold true in practice. Here, we present DualBind, a novel framework that integrates supervised mean squared error (MSE) with unsupervised denoising score matching (DSM) to accurately learn the binding energy function. DualBind not only addresses the limitations of DSM-only models by providing more accurate absolute affinity predictions but also improves generalizability and reduces reliance on labeled data compared to MSE-only models. Our experimental results demonstrate that DualBind excels in predicting binding affinities and can effectively utilize both labeled and unlabeled data to enhance performance.

[LG-109] Personalized Product Assortment with Real-time 3D Perception and Bayesian Payoff Estimation

链接: https://arxiv.org/abs/2406.07769
作者: Porter Jenkins,Michael Selander,J. Stockton Jenkins,Andrew Merrill,Kyle Armstrong
关键词: facing physical retailers, critical challenge facing, challenge facing physical, Product assortment selection, physical retailers
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Accepted to KDD 2024

点击查看摘要

Abstract:Product assortment selection is a critical challenge facing physical retailers. Effectively aligning inventory with the preferences of shoppers can increase sales and decrease out-of-stocks. However, in real-world settings the problem is challenging due to the combinatorial explosion of product assortment possibilities. Consumer preferences are typically heterogeneous across space and time, making inventory-preference alignment challenging. Additionally, existing strategies rely on syndicated data, which tends to be aggregated, low resolution, and suffer from high latency. To solve these challenges we introduce a real-time recommendation system, which we call \ours. Our system utilizes recent advances in 3D computer vision for perception and automatic, fine grained sales estimation. These perceptual components run on the edge of the network and facilitate real-time reward signals. Additionally, we develop a Bayesian payoff model to account for noisy estimates from 3D LIDAR data. We rely on spatial clustering to allow the system to adapt to heterogeneous consumer preferences, and a graph-based candidate generation algorithm to address the combinatorial search problem. We test our system in real-world stores across two, 6-8 week A/B tests with beverage products and demonstrate a 35% and 27% increase in sales respectively. Finally, we monitor the deployed system for a period of 28 weeks with an observational study and show a 9.4% increase in sales.

[LG-110] Conformalized Teleoperation: Confidently Mapping Human Inputs to High-Dimensional Robot Actions

链接: https://arxiv.org/abs/2406.07767
作者: Michelle Zhao,Reid Simmons,Henny Admoni,Andrea Bajcsy
关键词: Assistive robotic arms, low-dimensional human inputs, robotic arms, teleoperator can control, human teleoperator
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Assistive robotic arms often have more degrees-of-freedom than a human teleoperator can control with a low-dimensional input, like a joystick. To overcome this challenge, existing approaches use data-driven methods to learn a mapping from low-dimensional human inputs to high-dimensional robot actions. However, determining if such a black-box mapping can confidently infer a user’s intended high-dimensional action from low-dimensional inputs remains an open problem. Our key idea is to adapt the assistive map at training time to additionally estimate high-dimensional action quantiles, and then calibrate these quantiles via rigorous uncertainty quantification methods. Specifically, we leverage adaptive conformal prediction which adjusts the intervals over time, reducing the uncertainty bounds when the mapping is performant and increasing the bounds when the mapping consistently mis-predicts. Furthermore, we propose an uncertainty-interval-based mechanism for detecting high-uncertainty user inputs and robot states. We evaluate the efficacy of our proposed approach in a 2D assistive navigation task and two 7DOF Kinova Jaco tasks involving assistive cup grasping and goal reaching. Our findings demonstrate that conformalized assistive teleoperation manages to detect (but not differentiate between) high uncertainty induced by diverse preferences and induced by low-precision trajectories in the mapping’s training dataset. On the whole, we see this work as a key step towards enabling robots to quantify their own uncertainty and proactively seek intervention when needed.

[LG-111] he Future of Software Engineering in an AI-Driven World

链接: https://arxiv.org/abs/2406.07737
作者: Valerio Terragni,Partha Roop,Kelly Blincoe
关键词: gaining increasing importance, LLMs gaining increasing, software development productivity, improving software development, Software Engineering
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: This paper was accepted at the “International Workshop on Software Engineering in 2030,” co-located with FSE 2024. It was also invited to the special issue of ACM TOSEM

点击查看摘要

Abstract:A paradigm shift is underway in Software Engineering, with AI systems such as LLMs gaining increasing importance for improving software development productivity. This trend is anticipated to persist. In the next five years, we will likely see an increasing symbiotic partnership between human developers and AI. The Software Engineering research community cannot afford to overlook this trend; we must address the key research challenges posed by the integration of AI into the software development process. In this paper, we present our vision of the future of software development in an AI-Driven world and explore the key challenges that our research community should address to realize this vision.

[LG-112] REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy

链接: https://arxiv.org/abs/2406.07735
作者: Haw-Shiuan Chang,Nanyun Peng,Mohit Bansal,Anil Ramakrishna,Tagyoung Chung
关键词: REAL sampling, large language models, sampling, REAL, REAL sampling predicts
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decoding methods for large language models (LLMs) usually struggle with the tradeoff between ensuring factuality and maintaining diversity. For example, a higher p threshold in the nucleus (top-p) sampling increases the diversity but decreases the factuality, and vice versa. In this paper, we propose REAL (Residual Entropy from Asymptotic Line) sampling, a decoding method that achieves improved factuality and diversity over nucleus sampling by predicting an adaptive threshold of p . Specifically, REAL sampling predicts the step-wise likelihood of an LLM to hallucinate, and lowers the p threshold when an LLM is likely to hallucinate. Otherwise, REAL sampling increases the p threshold to boost the diversity. To predict the step-wise hallucination likelihood without supervision, we construct a Token-level Hallucination Forecasting (THF) model to predict the asymptotic entropy (i.e., inherent uncertainty) of the next token by extrapolating the next-token entropies from a series of LLMs with different sizes. If a LLM’s entropy is higher than the asymptotic entropy (i.e., the LLM is more uncertain than it should be), the THF model predicts a high hallucination hazard, which leads to a lower p threshold in REAL sampling. In the FactualityPrompts benchmark, we demonstrate that REAL sampling based on a 70M THF model can substantially improve the factuality and diversity of 7B LLMs simultaneously, judged by both retrieval-based metrics and human evaluation. After combined with contrastive decoding, REAL sampling outperforms 9 sampling methods, and generates texts that are more factual than the greedy sampling and more diverse than the nucleus sampling with p=0.5 . Furthermore, the predicted asymptotic entropy is also a useful unsupervised signal for hallucination detection tasks.

[LG-113] Efficient Parallel Multi-Hop Reasoning: A Scalable Approach for Knowledge Graph Analysis

链接: https://arxiv.org/abs/2406.07727
作者: Jesmin Jahan Tithi,Fabio Checconi,Fabrizio Petrini
关键词: natural language processing, make multiple inferential, multiple inferential steps, Multi-hop reasoning, artificial intelligence
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 11 Pages with references

点击查看摘要

Abstract:Multi-hop reasoning (MHR) is a process in artificial intelligence and natural language processing where a system needs to make multiple inferential steps to arrive at a conclusion or answer. In the context of knowledge graphs or databases, it involves traversing multiple linked entities and relationships to understand complex queries or perform tasks requiring a deeper understanding. Multi-hop reasoning is a critical function in various applications, including question answering, knowledge base completion, and link prediction. It has garnered significant interest in artificial intelligence, machine learning, and graph analytics. This paper focuses on optimizing MHR for time efficiency on large-scale graphs, diverging from the traditional emphasis on accuracy which is an orthogonal goal. We introduce a novel parallel algorithm that harnesses domain-specific learned embeddings to efficiently identify the top K paths between vertices in a knowledge graph to find the best answers to a three-hop query. Our contributions are: (1) We present a new parallel algorithm to enhance MHR performance, scalability and efficiency. (2) We demonstrate the algorithm’s superior performance on leading-edge Intel and AMD architectures through empirical results. We showcase the algorithm’s practicality through a case study on identifying academic affiliations of potential Turing Award laureates in Deep Learning, highlighting its capability to handle intricate entity relationships. This demonstrates the potential of our approach to enabling high-performance MHR, useful to navigate the growing complexity of modern knowledge graphs. Comments: 11 Pages with references Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Performance (cs.PF) ACMclasses: H.4; C.4 Cite as: arXiv:2406.07727 [cs.AI] (or arXiv:2406.07727v1 [cs.AI] for this version)

[LG-114] A Concise Mathematical Description of Active Inference in Discrete Time

链接: https://arxiv.org/abs/2406.07726
作者: Jesse van Oostrum,Carlotta Langer,Nihat Ay
关键词: concise mathematical description, discrete time, present a concise, concise mathematical, mathematical description
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:In this paper we present a concise mathematical description of active inference in discrete time. The main part of the paper serves as a general introduction to the topic, including an example illustrating the theory on action selection. In the appendix the more subtle mathematical details are discussed. This part is aimed at readers who have already studied the active inference literature but struggle to make sense of the mathematical details and derivations. Throughout the whole manuscript, special attention has been paid to adopting notation that is both precise and in line with standard mathematical texts. All equations and derivations are linked to specific equation numbers in other popular text on the topic. Furthermore, Python code is provided that implements the action selection mechanism described in this paper and is compatible with pymdp environments.

[LG-115] Loss Gradient Gaussian Width based Generalization and Optimization Guarantees

链接: https://arxiv.org/abs/2406.07712
作者: Arindam Banerjee,Qiaobo Li,Yingxue Zhou
关键词: uniform convergence based, Loss Gradient Gaussian, optimization guarantees, LGGW, machine learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generalization and optimization guarantees on the population loss in machine learning often rely on uniform convergence based analysis, typically based on the Rademacher complexity of the predictors. The rich representation power of modern models has led to concerns about this approach. In this paper, we present generalization and optimization guarantees in terms of the complexity of the gradients, as measured by the Loss Gradient Gaussian Width (LGGW). First, we introduce generalization guarantees directly in terms of the LGGW under a flexible gradient domination condition, which we demonstrate to hold empirically for deep models. Second, we show that sample reuse in finite sum (stochastic) optimization does not make the empirical gradient deviate from the population gradient as long as the LGGW is small. Third, focusing on deep networks, we present results showing how to bound their LGGW under mild assumptions. In particular, we show that their LGGW can be bounded (a) by the L_2 -norm of the loss Hessian eigenvalues, which has been empirically shown to be \tildeO(1) for commonly used deep models; and (b) in terms of the Gaussian width of the featurizer, i.e., the output of the last-but-one layer. To our knowledge, our generalization and optimization guarantees in terms of LGGW are the first results of its kind, avoid the pitfalls of predictor Rademacher complexity based analysis, and hold considerable promise towards quantitatively tight bounds for deep models.

[LG-116] Diagnosing and fixing common problems in Bayesian optimization for molecule design

链接: https://arxiv.org/abs/2406.07709
作者: Austin Tripp,José Miguel Hernández-Lobato
关键词: Bayesian optimization, molecular design tasks, principled approach, approach to molecular, design tasks
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Machine Learning (stat.ML)
*备注: 8 pages, 4 figures. Code at: this https URL

点击查看摘要

Abstract:Bayesian optimization (BO) is a principled approach to molecular design tasks. In this paper we explain three pitfalls of BO which can cause poor empirical performance: an incorrect prior width, over-smoothing, and inadequate acquisition function maximization. We show that with these issues addressed, even a basic BO setup is able to achieve the highest overall performance on the PMO benchmark for molecule design (Gao et al, 2022). These results suggest that BO may benefit from more attention in the machine learning for molecules community.

[LG-117] A Deep Learning Approach to Detect Complete Safety Equipment For Construction Workers Based On YOLOv7

链接: https://arxiv.org/abs/2406.07707
作者: Md. Shariful Islam,SM Shaqib,Shahriar Sultan Ramit,Shahrun Akter Khushbu,Mr. Abdus Sattar,Dr. Sheak Rashed Haider Noor
关键词: ensuring worker safety, safety equipment, safety, utmost significance, ensuring worker
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the construction sector, ensuring worker safety is of the utmost significance. In this study, a deep learning-based technique is presented for identifying safety gear worn by construction workers, such as helmets, goggles, jackets, gloves, and footwears. The recommended approach uses the YOLO v7 (You Only Look Once) object detection algorithm to precisely locate these safety items. The dataset utilized in this work consists of labeled images split into training, testing and validation sets. Each image has bounding box labels that indicate where the safety equipment is located within the image. The model is trained to identify and categorize the safety equipment based on the labeled dataset through an iterative training approach. We used custom dataset to train this model. Our trained model performed admirably well, with good precision, recall, and F1-score for safety equipment recognition. Also, the model’s evaluation produced encouraging results, with a mAP@0.5 score of 87.7%. The model performs effectively, making it possible to quickly identify safety equipment violations on building sites. A thorough evaluation of the outcomes reveals the model’s advantages and points up potential areas for development. By offering an automatic and trustworthy method for safety equipment detection, this research makes a contribution to the fields of computer vision and workplace safety. The proposed deep learning-based approach will increase safety compliance and reduce the risk of accidents in the construction industry

[LG-118] Label Smoothing Improves Machine Unlearning

链接: https://arxiv.org/abs/2406.07698
作者: Zonglin Di,Zhaowei Zhu,Jinghan Jia,Jiancheng Liu,Zafar Takhirov,Bo Jiang,Yuanshun Yao,Sijia Liu,Yang Liu
关键词: eliminate previously learned, previously learned data, objective of machine, eliminate previously, previously learned
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The objective of machine unlearning (MU) is to eliminate previously learned data from a model. However, it is challenging to strike a balance between computation cost and performance when using existing MU techniques. Taking inspiration from the influence of label smoothing on model confidence and differential privacy, we propose a simple gradient-based MU approach that uses an inverse process of label smoothing. This work introduces UGradSL, a simple, plug-and-play MU approach that uses smoothed labels. We provide theoretical analyses demonstrating why properly introducing label smoothing improves MU performance. We conducted extensive experiments on six datasets of various sizes and different modalities, demonstrating the effectiveness and robustness of our proposed method. The consistent improvement in MU performance is only at a marginal cost of additional computations. For instance, UGradSL improves over the gradient ascent MU baseline by 66% unlearning accuracy without sacrificing unlearning efficiency.

[LG-119] A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection

链接: https://arxiv.org/abs/2406.07694
作者: Can Akbas,Irem Su Arin,Sinan Onal
关键词: effective defect detection, Recent advancements, defect detection, Cylindrical Defect Detection, Defect Detection Dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: One figure and one table

点击查看摘要

Abstract:Recent advancements in quality control across various industries have increasingly utilized the integration of video cameras and image processing for effective defect detection. A critical barrier to progress is the scarcity of comprehensive datasets featuring annotated defects, which are essential for developing and refining automated defect detection models. This systematic review, spanning from 2015 to 2023, identifies 15 publicly available datasets and critically examines them to assess their effectiveness and applicability for benchmarking and model development. Our findings reveal a diverse landscape of datasets, such as NEU-CLS, NEU-DET, DAGM, KolektorSDD, PCB Defect Dataset, and the Hollow Cylindrical Defect Detection Dataset, each with unique strengths and limitations in terms of image quality, defect type representation, and real-world applicability. The goal of this systematic review is to consolidate these datasets in a single location, providing researchers who seek such publicly available resources with a comprehensive reference.

[LG-120] A Labelled Dataset for Sentiment Analysis of Videos on YouTube TikTok and Other Sources about the 2024 Outbreak of Measles

链接: https://arxiv.org/abs/2406.07693
作者: Nirmalya Thakur,Vanessa Su,Mingchen Shao,Kesha A. Patel,Hongseok Jeong,Victoria Knieling,Andrew Brian
关键词: internet between January, ongoing outbreak, outbreak of measles, measles published, video
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 19 pages

点击查看摘要

Abstract:The work of this paper presents a dataset that contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. The dataset is available at this https URL. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. Finally, this paper also presents a list of open research questions that may be investigated using this dataset.

[LG-121] AI Radiologist: Revolutionizing Liver Tissue Segmentation with Convolutional Neural Networks and a Clinician-Friendly GUI

链接: https://arxiv.org/abs/2406.07688
作者: Ayman Al-Kababji,Faycal Bensaali,Sarada Prasad Dakua,Yassine Himeur
关键词: Artificial Intelligence, pervasive research topic, permeating various sectors, liver tissues, liver
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 38 pages, 19 figures, 7 tables submitted to journal

点击查看摘要

Abstract:Artificial Intelligence (AI) is a pervasive research topic, permeating various sectors and applications. In this study, we harness the power of AI, specifically convolutional neural networks (ConvNets), for segmenting liver tissues. It also focuses on developing a user-friendly graphical user interface (GUI) tool, “AI Radiologist”, enabling clinicians to effectively delineate different liver tissues (parenchyma, tumors, and vessels), thereby saving lives. This endeavor bridges the gap between academic research and practical, industrial applications. The GUI is a single-page application and is designed using the PyQt5 Python framework. The offline-available AI Radiologist resorts to three ConvNet models trained to segment all liver tissues. With respect to the Dice metric, the best liver ConvNet scores 98.16%, the best tumor ConvNet scores 65.95%, and the best vessel ConvNet scores 51.94%. It outputs 2D slices of the liver, tumors, and vessels, along with 3D interpolations in .obj and .mtl formats, which can be visualized/printed using any 3D-compatible software. Thus, the AI Radiologist offers a convenient tool for clinicians to perform liver tissue segmentation and 3D interpolation employing state-of-the-art models for tissues segmentation. With the provided capacity to select the volumes and pre-trained models, the clinicians can leave the rest to the AI Radiologist.

[LG-122] Adversarial Machine Unlearning

链接: https://arxiv.org/abs/2406.07687
作者: Zonglin Di,Sixie Yu,Yevgeniy Vorobeychik,Yang Liu
关键词: specific training data, unlearning algorithms, unlearning, machine unlearning, machine learning models
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This paper focuses on the challenge of machine unlearning, aiming to remove the influence of specific training data on machine learning models. Traditionally, the development of unlearning algorithms runs parallel with that of membership inference attacks (MIA), a type of privacy threat to determine whether a data instance was used for training. However, the two strands are intimately connected: one can view machine unlearning through the lens of MIA success with respect to removed data. Recognizing this connection, we propose a game-theoretic framework that integrates MIAs into the design of unlearning algorithms. Specifically, we model the unlearning problem as a Stackelberg game in which an unlearner strives to unlearn specific training data from a model, while an auditor employs MIAs to detect the traces of the ostensibly removed data. Adopting this adversarial perspective allows the utilization of new attack advancements, facilitating the design of unlearning algorithms. Our framework stands out in two ways. First, it takes an adversarial approach and proactively incorporates the attacks into the design of unlearning algorithms. Secondly, it uses implicit differentiation to obtain the gradients that limit the attacker’s success, thus benefiting the process of unlearning. We present empirical results to demonstrate the effectiveness of the proposed approach for machine unlearning.

[LG-123] FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation

链接: https://arxiv.org/abs/2406.07676
作者: Swarup Ranjan Behera,Abhishek Dhiman,Karthik Gowda,Aalekhya Satya Narayani
关键词: Audio Spectrogram Transformer, Spectrogram Transformer, play a crucial, crucial role, role in efficient
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:Audio classification models, particularly the Audio Spectrogram Transformer (AST), play a crucial role in efficient audio analysis. However, optimizing their efficiency without compromising accuracy remains a challenge. In this paper, we introduce FastAST, a framework that integrates Token Merging (ToMe) into the AST framework. FastAST enhances inference speed without requiring extensive retraining by merging similar tokens in audio spectrograms. Furthermore, during training, FastAST brings about significant speed improvements. The experiments indicate that FastAST can increase audio classification throughput with minimal impact on accuracy. To mitigate the accuracy impact, we integrate Cross-Model Knowledge Distillation (CMKD) into the FastAST framework. Integrating ToMe and CMKD into AST results in improved accuracy compared to AST while maintaining faster inference speeds. FastAST represents a step towards real-time, resource-efficient audio analysis.

[LG-124] reeffuser: Probabilistic Predictions via Conditional Diffusions with Gradient-Boosted Trees

链接: https://arxiv.org/abs/2406.07658
作者: Nicolas Beltran-Velez,Alessandro Antonio Grande,Achille Nazaret,Alp Kucukelbir,David Blei
关键词: Probabilistic prediction aims, Treeffuser, Gaussian or Poisson, Probabilistic, compute predictive distributions
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Probabilistic prediction aims to compute predictive distributions rather than single-point predictions. These distributions enable practitioners to quantify uncertainty, compute risk, and detect outliers. However, most probabilistic methods assume parametric responses, such as Gaussian or Poisson distributions. When these assumptions fail, such models lead to bad predictions and poorly calibrated uncertainty. In this paper, we propose Treeffuser, an easy-to-use method for probabilistic prediction on tabular data. The idea is to learn a conditional diffusion model where the score function is estimated using gradient-boosted trees. The conditional diffusion model makes Treeffuser flexible and non-parametric, while the gradient-boosted trees make it robust and easy to train on CPUs. Treeffuser learns well-calibrated predictive distributions and can handle a wide range of regression tasks – including those with multivariate, multimodal, and skewed responses. % , as well as categorical predictors and missing data We study Treeffuser on synthetic and real data and show that it outperforms existing methods, providing better-calibrated probabilistic predictions. We further demonstrate its versatility with an application to inventory allocation under uncertainty using sales data from Walmart. We implement Treeffuser in \hrefthis https URLthis https URL.

[LG-125] OPTune: Efficient Online Preference Tuning

链接: https://arxiv.org/abs/2406.07657
作者: Lichang Chen,Jiuhai Chen,Chenxi Liu,John Kirchenbauer,Davit Soselia,Chen Zhu,Tom Goldstein,Tianyi Zhou,Heng Huang
关键词: Large Language Models, aligning Large Language, Language Models, Large Language, aligning Large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:Reinforcement learning with human feedback~(RLHF) is critical for aligning Large Language Models (LLMs) with human preference. Compared to the widely studied offline version of RLHF, \emphe.g. direct preference optimization (DPO), recent works have shown that the online variants achieve even better alignment. However, online alignment requires on-the-fly generation of new training data, which is costly, hard to parallelize, and suffers from varying quality and utility. In this paper, we propose a more efficient data exploration strategy for online preference tuning (OPTune), which does not rely on human-curated or pre-collected teacher responses but dynamically samples informative responses for on-policy preference alignment. During data generation, OPTune only selects prompts whose (re)generated responses can potentially provide more informative and higher-quality training signals than the existing responses. In the training objective, OPTune reweights each generated response (pair) by its utility in improving the alignment so that learning can be focused on the most helpful samples. Throughout our evaluations, OPTune’d LLMs maintain the instruction-following benefits provided by standard preference tuning whilst enjoying 1.27-1.56x faster training speed due to the efficient data exploration strategy.

[LG-126] Pre-training Feature Guided Diffusion Model for Speech Enhancement

链接: https://arxiv.org/abs/2406.07646
作者: Yiyuan Yang,Niki Trigoni,Andrew Markham
关键词: Speech enhancement significantly, enhancement significantly improves, noisy environments, improving communication, listening experiences
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted by Interspeech 2024 Conference

点击查看摘要

Abstract:Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments, improving communication and listening experiences. In this paper, we introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement, addressing the limitations of existing discriminative and generative models. By integrating spectral features into a variational autoencoder (VAE) and leveraging pre-trained features for guidance during the reverse process, coupled with the utilization of the deterministic discrete integration method (DDIM) to streamline sampling steps, our model improves efficiency and speech enhancement quality. Demonstrating state-of-the-art results on two public datasets with different SNRs, our model outshines other baselines in efficiency and robustness. The proposed method not only optimizes performance but also enhances practical deployment capabilities, without increasing computational demands.

[LG-127] Generating Human Understandable Explanations for Node Embeddings

链接: https://arxiv.org/abs/2406.07642
作者: Zohair Shafi,Ayan Chatterjee,Tina Eliassi-Rad
关键词: low-dimensional latent representations, produce low-dimensional latent, Node embedding algorithms, low-dimensional latent, latent representations
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Node embedding algorithms produce low-dimensional latent representations of nodes in a graph. These embeddings are often used for downstream tasks, such as node classification and link prediction. In this paper, we investigate the following two questions: (Q1) Can we explain each embedding dimension with human-understandable graph features (e.g. degree, clustering coefficient and PageRank). (Q2) How can we modify existing node embedding algorithms to produce embeddings that can be easily explained by human-understandable graph features? We find that the answer to Q1 is yes and introduce a new framework called XM (short for eXplain eMbedding) to answer Q2. A key aspect of XM involves minimizing the nuclear norm of the generated explanations. We show that by minimizing the nuclear norm, we minimize the lower bound on the entropy of the generated explanations. We test XM on a variety of real-world graphs and show that XM not only preserves the performance of existing node embedding methods, but also enhances their explainability.

[LG-128] When is an Embedding Model More Promising than Another?

链接: https://arxiv.org/abs/2406.07640
作者: Maxime Darrin,Philippe Formont,Ismail Ben Ayed,Jackie CK Cheung,Pablo Piantanida
关键词: downstream tasks, machine learning, projecting any object, play a central, central role
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Embedders play a central role in machine learning, projecting any object into numerical representations that can, in turn, be leveraged to perform various downstream tasks. The evaluation of embedding models typically depends on domain-specific empirical approaches utilizing downstream tasks, primarily because of the lack of a standardized framework for comparison. However, acquiring adequately large and representative datasets for conducting these assessments is not always viable and can prove to be prohibitively expensive and time-consuming. In this paper, we present a unified approach to evaluate embedders. First, we establish theoretical foundations for comparing embedding models, drawing upon the concepts of sufficiency and informativeness. We then leverage these concepts to devise a tractable comparison criterion (information sufficiency), leading to a task-agnostic and self-supervised ranking procedure. We demonstrate experimentally that our approach aligns closely with the capability of embedding models to facilitate various downstream tasks in both natural language processing and molecular biology. This effectively offers practitioners a valuable tool for prioritizing model trials.

[LG-129] Equivariance via Minimal Frame Averaging for More Symmetries and Efficiency

链接: https://arxiv.org/abs/2406.07598
作者: Yuchao Lin,Jacob Helwig,Shurui Gui,Shuiwang Ji
关键词: machine learning systems, frame averaging, machine learning, learning systems, Minimal Frame Averaging
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider achieving equivariance in machine learning systems via frame averaging. Current frame averaging methods involve a costly sum over large frames or rely on sampling-based approaches that only yield approximate equivariance. Here, we propose Minimal Frame Averaging (MFA), a mathematical framework for constructing provably minimal frames that are exactly equivariant. The general foundations of MFA also allow us to extend frame averaging to more groups than previously considered, including the Lorentz group for describing symmetries in space-time, and the unitary group for complex-valued domains. Results demonstrate the efficiency and effectiveness of encoding symmetries via MFA across a diverse range of tasks, including n -body simulation, top tagging in collider physics, and relaxed energy prediction. Our code is available at this https URL.

[LG-130] MambaLRP: Explaining Selective State Space Sequence Models

链接: https://arxiv.org/abs/2406.07592
作者: Farnoush Rezaei Jafari,Grégoire Montavon,Klaus-Robert Müller,Oliver Eberle
关键词: Selective State Space, State Space Sequence, Recent sequence modeling, Selective State, State Space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent sequence modeling approaches using Selective State Space Sequence Models, referred to as Mamba models, have seen a surge of interest. These models allow efficient processing of long sequences in linear time and are rapidly being adopted in a wide range of applications such as language modeling, demonstrating promising performance. To foster their reliable use in real-world scenarios, it is crucial to augment their transparency. Our work bridges this critical gap by bringing explainability, particularly Layer-wise Relevance Propagation (LRP), to the Mamba architecture. Guided by the axiom of relevance conservation, we identify specific components in the Mamba architecture, which cause unfaithful explanations. To remedy this issue, we propose MambaLRP, a novel algorithm within the LRP framework, which ensures a more stable and reliable relevance propagation through these components. Our proposed method is theoretically sound and excels in achieving state-of-the-art explanation performance across a diverse range of models and datasets. Moreover, MambaLRP facilitates a deeper inspection of Mamba architectures, uncovering various biases and evaluating their significance. It also enables the analysis of previous speculations regarding the long-range capabilities of Mamba models.

[LG-131] StreamPrompt: Learnable Prompt-guided Data Selection for Efficient Stream Learning

链接: https://arxiv.org/abs/2406.07590
作者: Tongjun Shi,Shuhao Zhang
关键词: traditional Continual Learning, traditional Continual, Continual Learning, data, continuous data streams
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Stream Learning (SL) requires models to rapidly adapt to continuous data streams, setting it apart from traditional Continual Learning (CL). Recent SL methods emphasize efficiency by selecting data subsets for training, but they often struggle due to their reliance on static, rule-based selection algorithms that cannot effectively adapt to the changing importance of data. In this work, we introduce StreamPrompt, a method that enhances data selection through dynamic, learnable prompts. These dynamic prompts serve two purposes beyond guiding model inference: 1) optimizing data selection, and 2) guiding updates to the rehearsal buffer. This approach addresses the challenges of adaptability and computational efficiency in processing continuous data streams. Moreover, StreamPrompt introduces Prompt Attunement,a mechanism that enhances the efficiency of prompt learning. By leveraging attention layers from vision transformers and softly combining their outputs with a gate unit, Prompt Attunementrefines prompts with minimal computational resources. Comprehensive evaluations demonstrate StreamPrompts superior performance over state-of-the-art, with significant improvements in accuracy and reductions in training time. These results underscore the efficacy and efficiency of StreamPrompt, establishing its potential as a scalable and effective solution for the evolving demands of SL. Our code is available at this https URL.

[LG-132] A novel method for identifying rice seed purity based on hybrid machine learning algorithms

链接: https://arxiv.org/abs/2406.07581
作者: Phan Thi-Thu-Hong,Vo Quoc-Trinh,Nguyen Huu-Du
关键词: rice seed purity, grain industry, seed purity, crucial task, factor in evaluating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:In the grain industry, the identification of seed purity is a crucial task as it is an important factor in evaluating the quality of seeds. For rice seeds, this property allows for the reduction of unexpected influences of other varieties on rice yield, nutrient composition, and price. However, in practice, they are often mixed with seeds from others. This study proposes a novel method for automatically identifying the rice seed purity of a certain rice variety based on hybrid machine learning algorithms. The main idea is to use deep learning architectures for extracting important features from the raw data and then use machine learning algorithms for classification. Several experiments are conducted following a practical implementation to evaluate the performance of the proposed model. The obtained results show that the novel method improves significantly the performance of existing methods. Thus, it can be applied to design effective identification systems for rice seed purity.

[LG-133] DMS: Addressing Information Loss with More Steps for Pragmatic Adversarial Attacks

链接: https://arxiv.org/abs/2406.07580
作者: Zhiyu Zhu,Jiayu Zhang,Xinyi Wang,Zhibo Jin,Huaming Chen
关键词: deep neural networks, neural networks, deep neural, tasks related, information loss
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the exceptional performance of deep neural networks (DNNs) across different domains, they are vulnerable to adversarial samples, in particular for tasks related to computer vision. Such vulnerability is further influenced by the digital container formats used in computers, where the discrete numerical values are commonly used for storing the pixel values. This paper examines how information loss in file formats impacts the effectiveness of adversarial attacks. Notably, we observe a pronounced hindrance to the adversarial attack performance due to the information loss of the non-integer pixel values. To address this issue, we explore to leverage the gradient information of the attack samples within the model to mitigate the information loss. We introduce the Do More Steps (DMS) algorithm, which hinges on two core techniques: gradient ascent-based \textitadversarial integerization (DMS-AI) and integrated gradients-based \textitattribution selection (DMS-AS). Our goal is to alleviate such lossy process to retain the attack performance when storing these adversarial samples digitally. In particular, DMS-AI integerizes the non-integer pixel values according to the gradient direction, and DMS-AS selects the non-integer pixels by comparing attribution results. We conduct thorough experiments to assess the effectiveness of our approach, including the implementations of the DMS-AI and DMS-AS on two large-scale datasets with various latest gradient-based attack methods. Our empirical findings conclusively demonstrate the superiority of our proposed DMS-AI and DMS-AS pixel integerization methods over the standardised methods, such as rounding, truncating and upper approaches, in maintaining attack integrity.

[LG-134] GFPack: Improving 2D Irregular Packing by Learning Gradient Field with Attention

链接: https://arxiv.org/abs/2406.07579
作者: Tianyang Xue,Lin Lu,Yang Liu,Mingdong Wu,Hao Dong,Yanbin Zhang,Renmin Han,Baoquan Chen
关键词: texture atlas generation, classic combinatorial optimization, combinatorial optimization problem, atlas generation, classic combinatorial
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:2D irregular packing is a classic combinatorial optimization problem with various applications, such as material utilization and texture atlas generation. This NP-hard problem requires efficient algorithms to optimize space utilization. Conventional numerical methods suffer from slow convergence and high computational cost. Existing learning-based methods, such as the score-based diffusion model, also have limitations, such as no rotation support, frequent collisions, and poor adaptability to arbitrary boundaries, and slow inferring. The difficulty of learning from teacher packing is to capture the complex geometric relationships among packing examples, which include the spatial (position, orientation) relationships of objects, their geometric features, and container boundary conditions. Representing these relationships in latent space is challenging. We propose GFPack++, an attention-based gradient field learning approach that addresses this challenge. It consists of two pivotal strategies: \emphattention-based geometry encoding for effective feature encoding and \emphattention-based relation encoding for learning complex relationships. We investigate the utilization distribution between the teacher and inference data and design a weighting function to prioritize tighter teacher data during training, enhancing learning effectiveness. Our diffusion model supports continuous rotation and outperforms existing methods on various datasets. We achieve higher space utilization over several widely used baselines, one-order faster than the previous diffusion-based method, and promising generalization for arbitrary boundaries. We plan to release our source code and datasets to support further research in this direction.

[LG-135] Biharmonic Distance of Graphs and its Higher-Order Variants: Theoretical Properties with Applications to Centrality and Clustering

链接: https://arxiv.org/abs/2406.07574
作者: Mitchell Black,Lucy Lin,Amir Nayyeri,Weng-Keen Wong
关键词: biharmonic distance, Effective resistance, biharmonic, distance, theoretically interesting
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:Effective resistance is a distance between vertices of a graph that is both theoretically interesting and useful in applications. We study a variant of effective resistance called the biharmonic distance. While the effective resistance measures how well-connected two vertices are, we prove several theoretical results supporting the idea that the biharmonic distance measures how important an edge is to the global topology of the graph. Our theoretical results connect the biharmonic distance to well-known measures of connectivity of a graph like its total resistance and sparsity. Based on these results, we introduce two clustering algorithms using the biharmonic distance. Finally, we introduce a further generalization of the biharmonic distance that we call the k -harmonic distance. We empirically study the utility of biharmonic and k -harmonic distance for edge centrality and graph clustering.

[LG-136] Investigating the Potential of Using Large Language Models for Scheduling

链接: https://arxiv.org/abs/2406.07573
作者: Deddy Jobson,Yilin Li
关键词: inaugural ACM International, ACM International Conference, AI-powered Software introduced, ACM International, explore AI-driven tools
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The inaugural ACM International Conference on AI-powered Software introduced the AIware Challenge, prompting researchers to explore AI-driven tools for optimizing conference programs through constrained optimization. We investigate the use of Large Language Models (LLMs) for program scheduling, focusing on zero-shot learning and integer programming to measure paper similarity. Our study reveals that LLMs, even under zero-shot settings, create reasonably good first drafts of conference schedules. When clustering papers, using only titles as LLM inputs produces results closer to human categorization than using titles and abstracts with TFIDF. The code has been made publicly available.

[LG-137] Domain-specific ReAct for physics-integrated iterative modeling: A case study of LLM agents for gas path analysis of gas turbines

链接: https://arxiv.org/abs/2406.07572
作者: Tao Song,Yuwei Fan,Chenlong Feng,Keyu Song,Chao Liu,Dongxiang Jiang
关键词: power engineering domain, gas path analysis, large language models, gas turbines, engineering domain
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the application of large language models (LLMs) with callable tools in energy and power engineering domain, focusing on gas path analysis of gas turbines. We developed a dual-agent tool-calling process to integrate expert knowledge, predefined tools, and LLM reasoning. We evaluated various LLMs, including LLama3, Qwen1.5 and GPT. Smaller models struggled with tool usage and parameter extraction, while larger models demonstrated favorable capabilities. All models faced challenges with complex, multi-component problems. Based on the test results, we infer that LLMs with nearly 100 billion parameters could meet professional scenario requirements with fine-tuning and advanced prompt design. Continued development are likely to enhance their accuracy and effectiveness, paving the way for more robust AI-driven solutions.

[LG-138] Reinforcement Learning Based Escape Route Generation in Low Visibility Environments

链接: https://arxiv.org/abs/2406.07568
作者: Hari Srikanth
关键词: fire-related deaths nationwide, Structure fires, deaths nationwide, fires are responsible, majority of fire-related
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Structure fires are responsible for the majority of fire-related deaths nationwide. In order to assist with the rapid evacuation of trapped people, this paper proposes the use of a system that determines optimal search paths for firefighters and exit paths for civilians in real time based on environmental measurements. Through the use of a LiDAR mapping system evaluated and verified by a trust range derived from sonar and smoke concentration data, a proposed solution to low visibility mapping is tested. These independent point clouds are then used to create distinct maps, which are merged through the use of a RANSAC based alignment methodology and simplified into a visibility graph. Temperature and humidity data are then used to label each node with a danger score, creating an environment tensor. After demonstrating how a Linear Function Approximation based Natural Policy Gradient RL methodology outperforms more complex competitors with respect to robustness and speed, this paper outlines two systems (savior and refugee) that process the environment tensor to create safe rescue and escape routes, respectively.

[LG-139] SVSNet: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models

链接: https://arxiv.org/abs/2406.08445
作者: Chun Yin,Tai-Shih Chi,Yu Tsao,Hsin-Min Wang
关键词: shown impressive performance, pre-trained speech foundation, pre-trained SFM representations, speech foundation models, Voice Conversion Challenge
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability.

[LG-140] Understanding Sounds Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models

链接: https://arxiv.org/abs/2406.08402
作者: Chun-Yi Kuan,Wei-Ping Huang,Hung-yi Lee
关键词: traditional large language, tackle audio-related tasks, Large audio-language models, large language models, enhance traditional large
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:Large audio-language models (LALMs) enhance traditional large language models by integrating audio perception capabilities, allowing them to tackle audio-related tasks. Previous research has primarily focused on assessing the performance of LALMs across various tasks, yet overlooking their reliability, particularly concerning issues like object hallucination. In our study, we introduce methods to assess the extent of object hallucination of publicly available LALMs. Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content, but struggle to answer discriminative questions, specifically those requiring the identification of the presence of particular object sounds within an audio clip. This limitation highlights a critical weakness in current LALMs: their inadequate understanding of discriminative queries. Moreover, we explore the potential of prompt engineering to enhance LALMs’ performance on discriminative questions.

[LG-141] Nystr"om Kernel Stein Discrepancy

链接: https://arxiv.org/abs/2406.08401
作者: Florian Kalinke,Zoltan Szabo,Bharath K. Sriperumbudur
关键词: reproducing kernel Hilbert, kernel Hilbert space, representing probability measures, Kernel methods underpin, Hilbert space
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Kernel methods underpin many of the most successful approaches in data science and statistics, and they allow representing probability measures as elements of a reproducing kernel Hilbert space without loss of information. Recently, the kernel Stein discrepancy (KSD), which combines Stein’s method with kernel techniques, gained considerable attention. Through the Stein operator, KSD allows the construction of powerful goodness-of-fit tests where it is sufficient to know the target distribution up to a multiplicative constant. However, the typical U- and V-statistic-based KSD estimators suffer from a quadratic runtime complexity, which hinders their application in large-scale settings. In this work, we propose a Nyström-based KSD acceleration – with runtime \mathcal O!\left(mn+m^3\right) for n samples and m\ll n Nyström points – , show its \sqrtn -consistency under the null with a classical sub-Gaussian assumption, and demonstrate its applicability for goodness-of-fit testing on a suite of benchmarks.

[LG-142] Differentiable Cost-Parameterized Monge Map Estimators

链接: https://arxiv.org/abs/2406.08399
作者: Samuel Howard,George Deligiannidis,Patrick Rebeschini,James Thornton
关键词: transport map corresponds, real-world applications, optimal transport, field of optimal, crucial to ensuring
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Within the field of optimal transport (OT), the choice of ground cost is crucial to ensuring that the optimality of a transport map corresponds to usefulness in real-world applications. It is therefore desirable to use known information to tailor cost functions and hence learn OT maps which are adapted to the problem at hand. By considering a class of neural ground costs whose Monge maps have a known form, we construct a differentiable Monge map estimator which can be optimized to be consistent with known information about an OT map. In doing so, we simultaneously learn both an OT map estimator and a corresponding adapted cost function. Through suitable choices of loss function, our method provides a general approach for incorporating prior information about the Monge map itself when learning adapted OT maps and cost functions.

[LG-143] MMIL: A novel algorithm for disease associated cell type discovery

链接: https://arxiv.org/abs/2406.08322
作者: Erin Craig,Timothy Keyes,Jolanda Sarno,Maxim Zaslavsky,Garry Nolan,Kara Davis,Trevor Hastie,Robert Tibshirani
关键词: Multiple Instance Learning, Single-cell datasets, lack individual cell, making it challenging, datasets often lack
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Erin Craig and Timothy Keyes contributed equally to this work

点击查看摘要

Abstract:Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. To address this, we introduce Mixture Modeling for Multiple Instance Learning (MMIL), an expectation maximization method that enables the training and calibration of cell-level classifiers using patient-level labels. Our approach can be used to train e.g. lasso logistic regression models, gradient boosted trees, and neural networks. When applied to clinically-annotated, primary patient samples in Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL), our method accurately identifies cancer cells, generalizes across tissues and treatment timepoints, and selects biologically relevant features. In addition, MMIL is capable of incorporating cell labels into model training when they are known, providing a powerful framework for leveraging both labeled and unlabeled data simultaneously. Mixture Modeling for MIL offers a novel approach for cell classification, with significant potential to advance disease understanding and management, especially in scenarios with unknown gold-standard labels and high dimensionality.

[LG-144] Deep learning from strongly mixing observations: Sparse-penalized regularization and minimax optimality

链接: https://arxiv.org/abs/2406.08321
作者: William Kengne,Modou Wade
关键词: considerable progress recently, made considerable progress, deep neural network, progress recently, made considerable
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The explicit regularization and optimality of deep neural networks estimators from independent data have made considerable progress recently. The study of such properties on dependent data is still a challenge. In this paper, we carry out deep learning from strongly mixing observations, and deal with the squared and a broad class of loss functions. We consider sparse-penalized regularization for deep neural network predictor. For a general framework that includes, regression estimation, classification, time series prediction, \cdots , oracle inequality for the expected excess risk is established and a bound on the class of Hölder smooth functions is provided. For nonparametric regression from strong mixing data and sub-exponentially error, we provide an oracle inequality for the L_2 error and investigate an upper bound of this error on a class of Hölder composition functions. For the specific case of nonparametric autoregression with Gaussian and Laplace errors, a lower bound of the L_2 error on this Hölder composition class is established. Up to logarithmic factor, this bound matches its upper bound; so, the deep neural network estimator attains the minimax optimal rate.

[LG-145] Invariant multiscale neural networks for data-scarce scientific applications

链接: https://arxiv.org/abs/2406.08318
作者: I. Schurov,D. Alforov,M. Katsnelson,A. Bagrov,A. Itin
关键词: Success of machine, machine learning, modern world, world is largely, largely determined
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Optics (physics.optics)
*备注: 14 pages, 10 figures

点击查看摘要

Abstract:Success of machine learning (ML) in the modern world is largely determined by abundance of data. However at many industrial and scientific problems, amount of data is limited. Application of ML methods to data-scarce scientific problems can be made more effective via several routes, one of them is equivariant neural networks possessing knowledge of symmetries. Here we suggest that combination of symmetry-aware invariant architectures and stacks of dilated convolutions is a very effective and easy to implement receipt allowing sizable improvements in accuracy over standard approaches. We apply it to representative physical problems from different realms: prediction of bandgaps of photonic crystals, and network approximations of magnetic ground states. The suggested invariant multiscale architectures increase expressibility of networks, which allow them to perform better in all considered cases.

[LG-146] Measuring model variability using robust non-parametric testing

链接: https://arxiv.org/abs/2406.08307
作者: Sinjini Banerjee,Tim Marrinan,Reilly Cannon,Tony Chiang,Anand D. Sarwate
关键词: involves stochastic optimization, meaning each run, random seed, involves stochastic, run will produce
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training a deep neural network often involves stochastic optimization, meaning each run will produce a different model. The seed used to initialize random elements of the optimization procedure heavily influences the quality of a trained model, which may be obscure from many commonly reported summary statistics, like accuracy. However, random seed is often not included in hyper-parameter optimization, perhaps because the relationship between seed and model quality is hard to describe. This work attempts to describe the relationship between deep net models trained with different random seeds and the behavior of the expected model. We adopt robust hypothesis testing to propose a novel summary statistic for network similarity, referred to as the \alpha -trimming level. We use the \alpha -trimming level to show that the empirical cumulative distribution function of an ensemble model created from a collection of trained models with different random seeds approximates the average of these functions as the number of models in the collection grows large. This insight provides guidance for how many random seeds should be sampled to ensure that an ensemble of these trained models is a reliable representative. We also show that the \alpha -trimming level is more expressive than different performance metrics like validation accuracy, churn, or expected calibration error when taken alone and may help with random seed selection in a more principled fashion. We demonstrate the value of the proposed statistic in real experiments and illustrate the advantage of fine-tuning over random seed with an experiment in transfer learning.

[LG-147] Forward-Euler time-discretization for Wasserstein gradient flows can be wrong

链接: https://arxiv.org/abs/2406.08209
作者: Yewei Xu,Qin Li
关键词: Wasserstein gradient flows, simulating Wasserstein gradient, simulating Wasserstein, Wasserstein gradient, gradient flows
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this note, we examine the forward-Euler discretization for simulating Wasserstein gradient flows. We provide two counter-examples showcasing the failure of this discretization even for a simple case where the energy functional is defined as the KL divergence against some nicely structured probability densities. A simple explanation of this failure is also discussed.

[LG-148] ransformer-based Model for ASR N-Best Rescoring and Rewriting

链接: https://arxiv.org/abs/2406.08207
作者: Iwen E. Kang,Christophe Van Gysel,Man-Hung Siu
关键词: Automatic Speech Recognition, on-device Automatic Speech, Voice assistants increasingly, Speech Recognition, Automatic Speech
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Interspeech '24

点击查看摘要

Abstract:Voice assistants increasingly use on-device Automatic Speech Recognition (ASR) to ensure speed and privacy. However, due to resource constraints on the device, queries pertaining to complex information domains often require further processing by a search engine. For such applications, we propose a novel Transformer based model capable of rescoring and rewriting, by exploring full context of the N-best hypotheses in parallel. We also propose a new discriminative sequence training objective that can work well for both rescore and rewrite tasks. We show that our Rescore+Rewrite model outperforms the Rescore-only baseline, and achieves up to an average 8.6% relative Word Error Rate (WER) reduction over the ASR system by itself.

[LG-149] Minimal Communication-Cost Statistical Learning

链接: https://arxiv.org/abs/2406.08193
作者: Milad Sefidgaran,Abdellatif Zaidi,Piotr Krasnowski
关键词: training data samples, data samples, obtain a statistical, server devices share, statistical hypothesis
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted at ISIT 2024

点击查看摘要

Abstract:A client device which has access to n training data samples needs to obtain a statistical hypothesis or model W and then to send it to a remote server. The client and the server devices share some common randomness sequence as well as a prior on the hypothesis space. In this problem a suitable hypothesis or model W should meet two distinct design criteria simultaneously: (i) small (population) risk during the inference phase and (ii) small ‘complexity’ for it to be conveyed to the server with minimum communication cost. In this paper, we propose a joint training and source coding scheme with provable in-expectation guarantees, where the expectation is over the encoder’s output message. Specifically, we show that by imposing a constraint on a suitable Kullback-Leibler divergence between the conditional distribution induced by a compressed learning model \widehatW given W and the prior, one guarantees simultaneously small average empirical risk (aka training loss), small average generalization error and small average communication cost. We also consider a one-shot scenario in which the guarantees on the empirical risk and generalization error are obtained for every encoder’s output message.

[LG-150] Strong and Weak Random Walks on Signed Networks

链接: https://arxiv.org/abs/2406.08034
作者: Shazia’Ayn Babul,Yu Tian,Renaud Lambiotte
关键词: signed network random, Random walks play, Random walks, network random walks, network random
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Random walks play an important role in probing the structure of complex networks. On traditional networks, they can be used to extract community structure, understand node centrality, perform link prediction, or capture the similarity between nodes. On signed networks, where the edge weights can be either positive or negative, it is non-trivial to design a random walk which can be used to extract information about the signed structure of the network, in particular the ability to partition the graph into communities with positive edges inside and negative edges in between. Prior works on signed network random walks focus on the case where there are only two such communities (strong balance), which is rarely the case in empirical networks. In this paper, we propose a signed network random walk which can capture the structure of a network with more than two such communities (weak balance). The walk results in a similarity matrix which can be used to cluster the nodes into antagonistic communities. We compare the characteristics of the so-called strong and weak random walks, in terms of walk length and stationarity. We show through a series of experiments on synthetic and empirical networks that the similarity matrix based on weak walks can be used for both unsupervised and semi-supervised clustering, outperforming the same similarity matrix based on strong walks when the graph has more than two communities, or exhibits asymmetry in the density of links. These results suggest that other random-walk based algorithms for signed networks could be improved simply by running them with weak walks instead of strong walks.

[LG-151] Fault detection in propulsion motors in the presence of concept drift

链接: https://arxiv.org/abs/2406.08030
作者: Martin Tveten,Morten Stakkeland
关键词: Machine learning, learning and statistical, enhance monitoring, monitoring and fault, fault prediction
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Machine learning and statistical methods can be used to enhance monitoring and fault prediction in marine systems. These methods rely on a dataset with records of historical system behaviour, potentially containing periods of both fault-free and faulty operation. An unexpected change in the underlying system, called a concept drift, may impact the performance of these methods, triggering the need for model retraining or other adaptations. In this article, we present an approach for detecting overheating in stator windings of marine propulsion motors that is able to successfully operate during concept drift without the need for full model retraining. Two distinct approaches are presented and tested. All models are trained and verified using a dataset from operational propulsion motors, with known, sudden concept drifts.

[LG-152] LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

链接: https://arxiv.org/abs/2406.07969
作者: Masaya Kawamura,Ryuichi Yamamoto,Yuma Shirahata,Takuya Hasumi,Kentaro Tachibana
关键词: includes utterance-level descriptions, utterance-level descriptions, includes utterance-level, speaker characteristics, speaking style
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing English prompt datasets, our corpus provides more diverse prompt annotations for all speakers of LibriTTS-R. Experimental results for prompt-based controllable TTS demonstrate that the TTS model trained with LibriTTS-P achieves higher naturalness than the model using the conventional dataset. Furthermore, the results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset. Our corpus, LibriTTS-P, is available at this https URL.

[LG-153] Simple yet Sharp Sensitivity Analysis for Any Contrast Under Unmeasured Confounding

链接: https://arxiv.org/abs/2406.07940
作者: Jose M. Peña
关键词: extend our previous, previous work, work on sensitivity, sensitivity analysis, risk ratio
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We extend our previous work on sensitivity analysis for the risk ratio and difference contrasts under unmeasured confounding to any contrast. We prove that the bounds produced are still arbitrarily sharp, i.e. practically attainable. We illustrate the usability of the bounds with real data.

[LG-154] Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions

链接: https://arxiv.org/abs/2406.07890
作者: Anfeng Xu,Kevin Huang,Tiantian Feng,Lue Shen,Helen Tager-Flusberg,Shrikanth Narayanan
关键词: Speech foundation models, opened unique opportunities, addressing challenging low-resource, challenging low-resource speech, foundation models
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Interspeech 2024

点击查看摘要

Abstract:Speech foundation models, trained on vast datasets, have opened unique opportunities in addressing challenging low-resource speech understanding, such as child speech. In this work, we explore the capabilities of speech foundation models on child-adult speaker diarization. We show that exemplary foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate, respectively, compared to previous speaker diarization methods. In addition, we benchmark and evaluate the speaker diarization results of the speech foundation models with varying the input audio window size, speaker demographics, and training data ratio. Our results highlight promising pathways for understanding and adopting speech foundation models to facilitate child speech understanding.

[LG-155] Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial Observations

链接: https://arxiv.org/abs/2406.07884
作者: Pavel Tashev,Stefan Petrov,Friederike Metz,Marin Bukov
关键词: largely unexplored paradigm, address outstanding challenges, quantum interactive dynamics, preparation and compression, partial knowledge
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: The source code as well as a demo in the form of an interactive Jupyter notebook are available on Github: this https URL

点击查看摘要

Abstract:Using partial knowledge of a quantum state to control multiqubit entanglement is a largely unexplored paradigm in the emerging field of quantum interactive dynamics with the potential to address outstanding challenges in quantum state preparation and compression, quantum control, and quantum complexity. We present a deep reinforcement learning (RL) approach to constructing short disentangling circuits for arbitrary 4-, 5-, and 6-qubit states using an actor-critic algorithm. With access to only two-qubit reduced density matrices, our agent decides which pairs of qubits to apply two-qubit gates on; requiring only local information makes it directly applicable on modern NISQ devices. Utilizing a permutation-equivariant transformer architecture, the agent can autonomously identify qubit permutations within the state, and adjusts the disentangling protocol accordingly. Once trained, it provides circuits from different initial states without further optimization. We demonstrate the agent’s ability to identify and exploit the entanglement structure of multiqubit states. For 4-, 5-, and 6-qubit Haar-random states, the agent learns to construct disentangling circuits that exhibit strong correlations both between consecutive gates and among the qubits involved. Through extensive benchmarking, we show the efficacy of the RL approach to find disentangling protocols with minimal gate resources. We explore the resilience of our trained agents to noise, highlighting their potential for real-world quantum computing applications. Analyzing optimal disentangling protocols, we report a general circuit to prepare an arbitrary 4-qubit state using at most 5 two-qubit (10 CNOT) gates.

[LG-156] Fully Adaptive Regret-Guaranteed Algorithm for Control of Linear Quadratic Systems

链接: https://arxiv.org/abs/2406.07746
作者: Jafar Abbaszadeh Chekan,Cedric Langbort
关键词: Linear Quadratic, unknown system model, Abbasi-Yadkori and Szepesvári, system model, unknown system
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The first algorithm for the Linear Quadratic (LQ) control problem with an unknown system model, featuring a regret of \mathcalO(\sqrtT) , was introduced by Abbasi-Yadkori and Szepesvári (2011). Recognizing the computational complexity of this algorithm, subsequent efforts (see Cohen et al. (2019), Mania et al. (2019), Faradonbeh et al. (2020a), and Kargin et al.(2022)) have been dedicated to proposing algorithms that are computationally tractable while preserving this order of regret. Although successful, the existing works in the literature lack a fully adaptive exploration-exploitation trade-off adjustment and require a user-defined value, which can lead to overall regret bound growth with some factors. In this work, noticing this gap, we propose the first fully adaptive algorithm that controls the number of policy updates (i.e., tunes the exploration-exploitation trade-off) and optimizes the upper-bound of regret adaptively. Our proposed algorithm builds on the SDP-based approach of Cohen et al. (2019) and relaxes its need for a horizon-dependant warm-up phase by appropriately tuning the regularization parameter and adding an adaptive input perturbation. We further show that through careful exploration-exploitation trade-off adjustment there is no need to commit to the widely-used notion of strong sequential stability, which is restrictive and can introduce complexities in initialization.

[LG-157] Progress Towards Decoding Visual Imagery via fNIRS

链接: https://arxiv.org/abs/2406.07662
作者: Michel Adamic(1),Wellington Avelino(1),Anna Brandenberger(2),Bryan Chiang(3),Hunter Davis,Stephen Fay(1),Andrew Gregory,Aayush Gupta,Raphael Hotter,Grace Jiang,Fiona Leng,Stephen Polcyn,Thomas Ribeiro(1),Paul Scotti(4),Michelle Wang(1),Marley Xiong,Jonathan Xu(5) ((1) McGill University, (2) Massachusetts Institute of Technology, (3) Stanford University, (4) Princeton University, (5) University of Waterloo)
关键词: fNIRS brain activity, required specs, demonstrate the possibility, possibility of reconstructing, brain activity
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:We demonstrate the possibility of reconstructing images from fNIRS brain activity and start building a prototype to match the required specs. By training an image reconstruction model on downsampled fMRI data, we discovered that cm-scale spatial resolution is sufficient for image generation. We obtained 71% retrieval accuracy with 1-cm resolution, compared to 93% on the full-resolution fMRI, and 20% with 2-cm resolution. With simulations and high-density tomography, we found that time-domain fNIRS can achieve 1-cm resolution, compared to 2-cm resolution for continuous-wave fNIRS. Lastly, we share designs for a prototype time-domain fNIRS device, consisting of a laser driver, a single photon detector, and a time-to-digital converter system.

[LG-158] Rate-Preserving Reductions for Blackwell Approachability

链接: https://arxiv.org/abs/2406.07585
作者: Christoph Dann,Yishay Mansour,Mehryar Mohri,Jon Schneider,Balasubramanian Sivan
关键词: no-regret learning instance, no-regret learning, learning, regret minimization, Blackwell approachability
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Abernethy et al. (2011) showed that Blackwell approachability and no-regret learning are equivalent, in the sense that any algorithm that solves a specific Blackwell approachability instance can be converted to a sublinear regret algorithm for a specific no-regret learning instance, and vice versa. In this paper, we study a more fine-grained form of such reductions, and ask when this translation between problems preserves not only a sublinear rate of convergence, but also preserves the optimal rate of convergence. That is, in which cases does it suffice to find the optimal regret bound for a no-regret learning instance in order to find the optimal rate of convergence for a corresponding approachability instance? We show that the reduction of Abernethy et al. (2011) does not preserve rates: their reduction may reduce a d -dimensional approachability instance I_1 with optimal convergence rate R_1 to a no-regret learning instance I_2 with optimal regret-per-round of R_2 , with R_2/R_1 arbitrarily large (in particular, it is possible that R_1 = 0 and R_2 0 ). On the other hand, we show that it is possible to tightly reduce any approachability instance to an instance of a generalized form of regret minimization we call improper \phi -regret minimization (a variant of the \phi -regret minimization of Gordon et al. (2008) where the transformation functions may map actions outside of the action set). Finally, we characterize when linear transformations suffice to reduce improper \phi -regret minimization problems to standard classes of regret minimization problems in a rate preserving manner. We prove that some improper \phi -regret minimization instances cannot be reduced to either subclass of instance in this way, suggesting that approachability can capture some problems that cannot be phrased in the language of online learning. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2406.07585 [stat.ML] (or arXiv:2406.07585v1 [stat.ML] for this version)

[LG-159] owards objective and interpretable speech disorder assessment: a comparative analysis of CNN and transformer-based models

链接: https://arxiv.org/abs/2406.07576
作者: Malo Maisonneuve,Corinne Fredouille,Muriel Lalain,Alain Ghio,Virginie Woisard
关键词: Head and Neck, Neck Cancers, significantly impact patients’, impact patients’ ability, ability to speak
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:Head and Neck Cancers (HNC) significantly impact patients’ ability to speak, affecting their quality of life. Commonly used metrics for assessing pathological speech are subjective, prompting the need for automated and unbiased evaluation methods. This study proposes a self-supervised Wav2Vec2-based model for phone classification with HNC patients, to enhance accuracy and improve the discrimination of phonetic features for subsequent interpretability purpose. The impact of pre-training datasets, model size, and fine-tuning datasets and parameters are explored. Evaluation on diverse corpora reveals the effectiveness of the Wav2Vec2 architecture, outperforming a CNN-based approach, used in previous work. Correlation with perceptual measures also affirms the model relevance for impaired speech analysis. This work paves the way for better understanding of pathological speech with interpretable approaches for clinicians, by leveraging complex self-learnt speech representations.

[LG-160] Optimizing Sales Forecasts through Automated Integration of Market Indicators

链接: https://arxiv.org/abs/2406.07564
作者: Lina Döring,Felix Grumbach,Pascal Reusch
关键词: customer demand predictions, improving customer demand, integrate market indicators, Recognizing that traditional, demand predictions
类目: Econometrics (econ.EM); Machine Learning (cs.LG); General Finance (q-fin.GN)
*备注:

点击查看摘要

Abstract:Recognizing that traditional forecasting models often rely solely on historical demand, this work investigates the potential of data-driven techniques to automatically select and integrate market indicators for improving customer demand predictions. By adopting an exploratory methodology, we integrate macroeconomic time series, such as national GDP growth, from the \textitEurostat database into \textitNeural Prophet and \textitSARIMAX forecasting models. Suitable time series are automatically identified through different state-of-the-art feature selection methods and applied to sales data from our industrial partner. It could be shown that forecasts can be significantly enhanced by incorporating external information. Notably, the potential of feature selection methods stands out, especially due to their capability for automation without expert knowledge and manual selection effort. In particular, the Forward Feature Selection technique consistently yielded superior forecasting accuracy for both SARIMAX and Neural Prophet across different company sales datasets. In the comparative analysis of the errors of the selected forecasting models, namely Neural Prophet and SARIMAX, it is observed that neither model demonstrates a significant superiority over the other.

信息检索

[IR-0] Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens

链接: https://arxiv.org/abs/2406.08477
作者: Ting-Ji Huang,Jia-Qi Yang,Chunxu Shen,Kai-Qi Liu,De-Chuan Zhan,Han-Jia Ye
关键词: Large Language Models, Characterizing users, OOV tokens, OOV, apply Large Language
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Characterizing users and items through vector representations is crucial for various tasks in recommender systems. Recent approaches attempt to apply Large Language Models (LLMs) in recommendation through a question and answer format, where real users and items (e.g., Item No.2024) are represented with in-vocabulary tokens (e.g., “item”, “20”, “24”). However, since LLMs are typically pretrained on natural language tasks, these in-vocabulary tokens lack the expressive power for distinctive users and items, thereby weakening the recommendation ability even after fine-tuning on recommendation tasks. In this paper, we explore how to effectively tokenize users and items in LLM-based recommender systems. We emphasize the role of out-of-vocabulary (OOV) tokens in addition to the in-vocabulary ones and claim the memorization of OOV tokens that capture correlations of users/items as well as diversity of OOV tokens. By clustering the learned representations from historical user-item interactions, we make the representations of user/item combinations share the same OOV tokens if they have similar properties. Furthermore, integrating these OOV tokens into the LLM’s vocabulary allows for better distinction between users and items and enhanced capture of user-item relationships during fine-tuning on downstream tasks. Our proposed framework outperforms existing state-of-the-art methods across various downstream recommendation tasks.

[IR-1] Bridging the Gap: Unravelling Local Government Data Sharing Barriers in Estonia and Beyond

链接: https://arxiv.org/abs/2406.08461
作者: Katrin Rajamäe Soosaar,Anastasija Nikiforova
关键词: Estonia digital government, encounter persistent challenges, received global acclaim, Estonia digital, digital government success
类目: Computers and Society (cs.CY); Databases (cs.DB); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Estonia’s digital government success has received global acclaim, yet its Open Government Data (OGD) initiatives, especially at the local level, encounter persistent challenges. Despite significant progress of national OGD initiative in OGD rankings, local governments lag in OGD provision. This study aims to examine barriers hindering municipalities from openly sharing OGD. Employing a qualitative approach through interviews with Estonian municipalities and drawing on the OGD-adapted Innovation Resistance Theory model, the study sheds light on barriers impeding OGD sharing. Practical recommendations are proposed to bridge the gap between national policies and local implementation, including enhancing awareness, improving data governance frameworks, and fostering collaboration be-tween local and national authorities. By addressing overlooked weaknesses in the Estonian open data ecosystem and providing actionable recommendations, this research contributes to a more resilient and sustainable open data ecosystem. Additionally, by validating the OGD-adapted Innovation Resistance Theory model and proposing a revised version tailored for local government contexts, the study advances theoretical frameworks for understanding data sharing resistance. Ultimately, this study serves as a call to action for policymakers and practitioners to prioritize local OGD initiatives.

[IR-2] Wiki Entity Summarization Benchmark

链接: https://arxiv.org/abs/2406.08435
作者: Saeedeh Javadi,Atefeh Moradan,Mohammad Sorkhpar,Klim Zaporojets,Davide Mottin,Ira Assent
关键词: compute concise summaries, aims to compute, compute concise, Entity summarization aims, Entity summarization
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Entity summarization aims to compute concise summaries for entities in knowledge graphs. Existing datasets and benchmarks are often limited to a few hundred entities and discard graph structure in source knowledge graphs. This limitation is particularly pronounced when it comes to ground-truth summaries, where there exist only a few labeled summaries for evaluation and training. We propose WikES, a comprehensive benchmark comprising of entities, their summaries, and their connections. Additionally, WikES features a dataset generator to test entity summarization algorithms in different areas of the knowledge graph. Importantly, our approach combines graph algorithms and NLP models as well as different data sources such that WikES does not require human annotation, rendering the approach cost-effective and generalizable to multiple domains. Finally, WikES is scalable and capable of capturing the complexities of knowledge graphs in terms of topology and semantics. WikES features existing datasets for comparison. Empirical studies of entity summarization methods confirm the usefulness of our benchmark. Data, code, and models are available at: this https URL.

[IR-3] Boosting Multimedia Recommendation via Separate Generic and Unique Awareness

链接: https://arxiv.org/abs/2406.08270
作者: Zhuangzhuang He,Zihan Wang,Yonghui Yang,Haoyue Bai,Le Wu
关键词: received widespread attention, improve recommendation quality, widespread attention, received widespread, higher quality representation
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimedia recommendation, which incorporates various modalities (e.g., images, texts, etc.) into user or item representation to improve recommendation quality, has received widespread attention. Recent methods mainly focus on cross-modal alignment with self-supervised learning to obtain higher quality representation. Despite remarkable performance, we argue that there is still a limitation: completely aligning representation undermines modality-unique information. We consider that cross-modal alignment is right, but it should not be the entirety, as different modalities contain generic information between them, and each modality also contains unique information. Simply aligning each modality may ignore modality-unique features, thus degrading the performance of multimedia recommendation. To tackle the above limitation, we propose a Separate Alignment aNd Distancing framework (SAND) for multimedia recommendation, which concurrently learns both modal-unique and -generic representation to achieve more comprehensive items representation. First, we split each modal feature into generic and unique part. Then, in the alignment module, for better integration of semantic information between different modalities , we design a SoloSimLoss to align generic modalities. Furthermore, in the distancing module, we aim to distance the unique modalities from the modal-generic so that each modality retains its unique and complementary information. In the light of the flexibility of our framework, we give two technical solutions, the more capable mutual information minimization and the simple negative l2 distance. Finally, extensive experimental results on three popular datasets demonstrate the effectiveness and generalization of our proposed framework.

[IR-4] GPT4Rec: Graph Prompt Tuning for Streaming Recommendation

链接: https://arxiv.org/abs/2406.08229
作者: Peiyan Zhang,Yuchen Yan,Xi Zhang,Liying Kang,Chaozhuo Li,Feiran Huang,Senzhang Wang,Sunghun Kim
关键词: personalized recommender systems, evolving user preferences, user preferences, recommender systems, items is paramount
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by SIGIR 2024. arXiv admin note: text overlap with arXiv:2303.11700 by other authors

点击查看摘要

Abstract:In the realm of personalized recommender systems, the challenge of adapting to evolving user preferences and the continuous influx of new users and items is paramount. Conventional models, typically reliant on a static training-test approach, struggle to keep pace with these dynamic demands. Streaming recommendation, particularly through continual graph learning, has emerged as a novel solution. However, existing methods in this area either rely on historical data replay, which is increasingly impractical due to stringent data privacy regulations; or are inability to effectively address the over-stability issue; or depend on model-isolation and expansion strategies. To tackle these difficulties, we present GPT4Rec, a Graph Prompt Tuning method for streaming Recommendation. Given the evolving user-item interaction graph, GPT4Rec first disentangles the graph patterns into multiple views. After isolating specific interaction patterns and relationships in different views, GPT4Rec utilizes lightweight graph prompts to efficiently guide the model across varying interaction patterns within the user-item graph. Firstly, node-level prompts are employed to instruct the model to adapt to changes in the attributes or properties of individual nodes within the graph. Secondly, structure-level prompts guide the model in adapting to broader patterns of connectivity and relationships within the graph. Finally, view-level prompts are innovatively designed to facilitate the aggregation of information from multiple disentangled views. These prompt designs allow GPT4Rec to synthesize a comprehensive understanding of the graph, ensuring that all vital aspects of the user-item interactions are considered and effectively integrated. Experiments on four diverse real-world datasets demonstrate the effectiveness and efficiency of our proposal.

[IR-5] Graph Bottlenecked Social Recommendation

链接: https://arxiv.org/abs/2406.08214
作者: Yonghui Yang,Le Wu,Zihan Wang,Zhuangzhuang He,Richang Hong,Meng Wang
关键词: social, graph-based social recommendations, redundant social relations, social recommendations, social networks
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD 2024

点击查看摘要

Abstract:With the emergence of social networks, social recommendation has become an essential technique for personalized services. Recently, graph-based social recommendations have shown promising results by capturing the high-order social influence. Most empirical studies of graph-based social recommendations directly take the observed social networks into formulation, and produce user preferences based on social homogeneity. Despite the effectiveness, we argue that social networks in the real-world are inevitably noisy~(existing redundant social relations), which may obstruct precise user preference characterization. Nevertheless, identifying and removing redundant social relations is challenging due to a lack of labels. In this paper, we focus on learning the denoised social structure to facilitate recommendation tasks from an information bottleneck perspective. Specifically, we propose a novel Graph Bottlenecked Social Recommendation (GBSR) framework to tackle the social noise issue.GBSR is a model-agnostic social denoising framework, that aims to maximize the mutual information between the denoised social graph and recommendation labels, meanwhile minimizing it between the denoised social graph and the original one. This enables GBSR to learn the minimal yet sufficient social structure, effectively reducing redundant social relations and enhancing social recommendations. Technically, GBSR consists of two elaborate components, preference-guided social graph refinement, and HSIC-based bottleneck learning. Extensive experimental results demonstrate the superiority of the proposed GBSR, including high performances and good generality combined with various backbones. Our code is available at: this https URL.

[IR-6] Prediction of the Realisation of an Information Need: An EEG Study

链接: https://arxiv.org/abs/2406.08105
作者: Niall McGuire,Dr Yashar Moshfeghi
关键词: satisfy searchers’ Information, foundational goals, satisfy searchers’, searchers’ Information, Information Retrieval
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:One of the foundational goals of Information Retrieval (IR) is to satisfy searchers’ Information Needs (IN). Understanding how INs physically manifest has long been a complex and elusive process. However, recent studies utilising Electroencephalography (EEG) data have provided real-time insights into the neural processes associated with INs. Unfortunately, they have yet to demonstrate how this insight can practically benefit the search experience. As such, within this study, we explore the ability to predict the realisation of IN within EEG data across 14 subjects whilst partaking in a Question-Answering (Q/A) task. Furthermore, we investigate the combinations of EEG features that yield optimal predictive performance, as well as identify regions within the Q/A queries where a subject’s realisation of IN is more pronounced. The findings from this work demonstrate that EEG data is sufficient for the real-time prediction of the realisation of an IN across all subjects with an accuracy of 73.5% (SD 2.6%) and on a per-subject basis with an accuracy of 90.1% (SD 22.1%). This work helps to close the gap by bridging theoretical neuroscientific advancements with tangible improvements in information retrieval practices, paving the way for real-time prediction of the realisation of IN.

[IR-7] A Self-boosted Framework for Calibrated Ranking

链接: https://arxiv.org/abs/2406.08010
作者: Shunyu Zhang,Hu Liu,Wentian Bao,Enyun Yu,Yang Song
关键词: Scale-calibrated ranking systems, real-world applications nowadays, pursue accurate ranking, accurate ranking quality, Scale-calibrated ranking
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: KDD 2024

点击查看摘要

Abstract:Scale-calibrated ranking systems are ubiquitous in real-world applications nowadays, which pursue accurate ranking quality and calibrated probabilistic predictions simultaneously. For instance, in the advertising ranking system, the predicted click-through rate (CTR) is utilized for ranking and required to be calibrated for the downstream cost-per-click ads bidding. Recently, multi-objective based methods have been wildly adopted as a standard approach for Calibrated Ranking, which incorporates the combination of two loss functions: a pointwise loss that focuses on calibrated absolute values and a ranking loss that emphasizes relative orderings. However, when applied to industrial online applications, existing multi-objective CR approaches still suffer from two crucial limitations. First, previous methods need to aggregate the full candidate list within a single mini-batch to compute the ranking loss. Such aggregation strategy violates extensive data shuffling which has long been proven beneficial for preventing overfitting, and thus degrades the training effectiveness. Second, existing multi-objective methods apply the two inherently conflicting loss functions on a single probabilistic prediction, which results in a sub-optimal trade-off between calibration and ranking. To tackle the two limitations, we propose a Self-Boosted framework for Calibrated Ranking (SBCR).

[IR-8] Counteracting Duration Bias in Video Recommendation via Counterfactual Watch Time

链接: https://arxiv.org/abs/2406.07932
作者: Haiyuan Zhao,Guohao Cai,Jieming Zhu,Zhenhua Dong,Jun Xu,Ji-Rong Wen
关键词: satisfy users’ personalized, users’ personalized information, watch time, logged watch time, watch
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD 2024

点击查看摘要

Abstract:In video recommendation, an ongoing effort is to satisfy users’ personalized information needs by leveraging their logged watch time. However, watch time prediction suffers from duration bias, hindering its ability to reflect users’ interests accurately. Existing label-correction approaches attempt to uncover user interests through grouping and normalizing observed watch time according to video duration. Although effective to some extent, we found that these approaches regard completely played records (i.e., a user watches the entire video) as equally high interest, which deviates from what we observed on real datasets: users have varied explicit feedback proportion when completely playing videos. In this paper, we introduce the counterfactual watch time(CWT), the potential watch time a user would spend on the video if its duration is sufficiently long. Analysis shows that the duration bias is caused by the truncation of CWT due to the video duration limitation, which usually occurs on those completely played records. Besides, a Counterfactual Watch Model (CWM) is proposed, revealing that CWT equals the time users get the maximum benefit from video recommender systems. Moreover, a cost-based transform function is defined to transform the CWT into the estimation of user interest, and the model can be learned by optimizing a counterfactual likelihood function defined over observed user watch times. Extensive experiments on three real video recommendation datasets and online A/B testing demonstrated that CWM effectively enhanced video recommendation accuracy and counteracted the duration bias.

[IR-9] DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning

链接: https://arxiv.org/abs/2406.07913
作者: Yuxi Feng,Raymond Li,Zhenan Fan,Giuseppe Carenini,Mohammadreza Pourreza,Weiwei Zhang,Yong Zhang
关键词: Structured Query Language, Large Language Models, natural language questions, translating natural language, open research problem
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:While in-context Learning (ICL) has proven to be an effective technique to improve the performance of Large Language Models (LLMs) in a variety of complex tasks, notably in translating natural language questions into Structured Query Language (NL2SQL), the question of how to select the most beneficial demonstration examples remains an open research problem. While prior works often adapted off-the-shelf encoders to retrieve examples dynamically, an inherent discrepancy exists in the representational capacities between the external retrievers and the LLMs. Further, optimizing the selection of examples is a non-trivial task, since there are no straightforward methods to assess the relative benefits of examples without performing pairwise inference. To address these shortcomings, we propose DeTriever, a novel demonstration retrieval framework that learns a weighted combination of LLM hidden states, where rich semantic information is encoded. To train the model, we propose a proxy score that estimates the relative benefits of examples based on the similarities between output queries. Experiments on two popular NL2SQL benchmarks demonstrate that our method significantly outperforms the state-of-the-art baselines on one-shot NL2SQL tasks.

[IR-10] “It answers questions that I didnt know I had”: Ph.D. Students Evaluation of an Information Sharing Knowledge Graph

链接: https://arxiv.org/abs/2406.07730
作者: Stanislava Gardasevic,Manika Lamba
关键词: vital information needed, knowledge graph, university websites, interacting with people, knowledge
类目: Human-Computer Interaction (cs.HC); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Interdisciplinary PhD programs can be challenging as the vital information needed by students may not be readily available, it is scattered across university’s websites, while tacit knowledge can be obtained only by interacting with people. Hence, there is a need to develop a knowledge management model to create, query, and maintain a knowledge repository for interdisciplinary students. We propose a knowledge graph containing information on critical categories and their relationships, extracted from multiple sources, essential for interdisciplinary PhD students. This study evaluates the usability of a participatory designed knowledge graph intended to facilitate information exchange and decision-making. The usability findings demonstrate that interaction with this knowledge graph benefits PhD students by notably reducing uncertainty and academic stress, particularly among newcomers. Knowledge graph supported them in decision making, especially when choosing collaborators in an interdisciplinary setting. Key helpful features are related to exploring student faculty networks, milestones tracking, rapid access to aggregated data, and insights into crowdsourced fellow students’ activities. The knowledge graph provides a solution to meet the personalized needs of doctoral researchers and has the potential to improve the information discovery and decision-making process substantially. It also includes the tacit knowledge exchange support missing from most current approaches, which is critical for this population and establishing interdisciplinary collaborations. This approach can be applied to other interdisciplinary programs and domains globally.

人工智能

[AI-0] ICE-G: Image Conditional Editing of 3D Gaussian Splats

链接: https://arxiv.org/abs/2406.08488
作者: Vishnu Jaganathan,Hannah Hanyun Huang,Muhammad Zubair Irshad,Varun Jampani,Amit Raj,Zsolt Kira
关键词: create high quality, emerged to create, create high, Recently, Recently many techniques
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to CVPR AI4CC Workshop 2024. Project page: this https URL

点击查看摘要

Abstract:Recently many techniques have emerged to create high quality 3D assets and scenes. When it comes to editing of these objects, however, existing approaches are either slow, compromise on quality, or do not provide enough customization. We introduce a novel approach to quickly edit a 3D model from a single reference view. Our technique first segments the edit image, and then matches semantically corresponding regions across chosen segmented dataset views using DINO features. A color or texture change from a particular region of the edit image can then be applied to other views automatically in a semantically sensible manner. These edited views act as an updated dataset to further train and re-style the 3D scene. The end-result is therefore an edited 3D model. Our framework enables a wide variety of editing tasks such as manual local edits, correspondence based style transfer from any example image, and a combination of different styles from multiple example images. We use Gaussian Splats as our primary 3D representation due to their speed and ease of local editing, but our technique works for other methods such as NeRFs as well. We show through multiple examples that our method produces higher quality results while offering fine-grained control of editing. Project page: this http URL

[AI-1] RMem: Restricted Memory Banks Improve Video Object Segmentation

链接: https://arxiv.org/abs/2406.08476
作者: Junbao Zhou,Ziqi Pang,Yu-Xiong Wang
关键词: memory banks, expanding memory banks, benchmarks evolving, memory, VOS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: CVPR 2024, Project Page: this https URL

点击查看摘要

Abstract:With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios, we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory banks to accommodate extensive historical information. Our specially designed “memory deciphering” study offers a pivotal insight underpinning such a strategy: expanding memory banks, while seemingly beneficial, actually increases the difficulty for VOS modules to decode relevant features due to the confusion from redundant information. By restricting memory banks to a limited number of essential frames, we achieve a notable improvement in VOS accuracy. This process balances the importance and freshness of frames to maintain an informative memory bank within a bounded capacity. Additionally, restricted memory banks reduce the training-inference discrepancy in memory lengths compared with continuous expansion. This fosters new opportunities in temporal reasoning and enables us to introduce the previously overlooked “temporal positional embedding.” Finally, our insights are embodied in “RMem” (“R” for restricted), a simple yet effective VOS modification that excels at challenging VOS scenarios and establishes new state of the art for object state changes (on the VOST dataset) and long videos (on the Long Videos dataset). Our code and demo are available at this https URL.

[AI-2] Real2Code: Reconstruct Articulated Objects via Code Generation

链接: https://arxiv.org/abs/2406.08474
作者: Zhao Mandi,Yijia Weng,Dominik Bauer,Shuran Song
关键词: code generation, reconstructing articulated objects, real world objects, reconstructing articulated, objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Real2Code, a novel approach to reconstructing articulated objects via code generation. Given visual observations of an object, we first reconstruct its part geometry using an image segmentation model and a shape completion model. We then represent the object parts with oriented bounding boxes, which are input to a fine-tuned large language model (LLM) to predict joint articulation as code. By leveraging pre-trained vision and language models, our approach scales elegantly with the number of articulated parts, and generalizes from synthetic training data to real world objects in unstructured environments. Experimental results demonstrate that Real2Code significantly outperforms previous state-of-the-art in reconstruction accuracy, and is the first approach to extrapolate beyond objects’ structural complexity in the training set, and reconstructs objects with up to 10 articulated parts. When incorporated with a stereo reconstruction model, Real2Code also generalizes to real world objects from a handful of multi-view RGB images, without the need for depth or camera information.

[AI-3] RILe: Reinforced Imitation Learning

链接: https://arxiv.org/abs/2406.08472
作者: Mert Albaba,Sammy Christen,Christoph Gebhardt,Thomas Langarek,Michael J. Black,Otmar Hilliges
关键词: Inverse Reinforcement Learning, achieved significant success, generating complex behavior, Reinforcement Learning, Reinforcement Learning offer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning has achieved significant success in generating complex behavior but often requires extensive reward function engineering. Adversarial variants of Imitation Learning and Inverse Reinforcement Learning offer an alternative by learning policies from expert demonstrations via a discriminator. Employing discriminators increases their data- and computational efficiency over the standard approaches; however, results in sensitivity to imperfections in expert data. We propose RILe, a teacher-student system that achieves both robustness to imperfect data and efficiency. In RILe, the student learns an action policy while the teacher dynamically adjusts a reward function based on the student’s performance and its alignment with expert demonstrations. By tailoring the reward function to both performance of the student and expert similarity, our system reduces dependence on the discriminator and, hence, increases robustness against data imperfections. Experiments show that RILe outperforms existing methods by 2x in settings with limited or noisy expert data.

[AI-4] Surprise! Using Physiological Stress for Allostatic Regulation Under the Active Inference Framework [Pre-Print]

链接: https://arxiv.org/abs/2406.08471
作者: Imran Khan,Robert Lowe
关键词: prediction errors, minimizes long-term prediction, achieved through anticipatory, anticipatory adjustments, Allostasis proposes
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Neurons and Cognition (q-bio.NC)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:Allostasis proposes that long-term viability of a living system is achieved through anticipatory adjustments of its physiology and behaviour: emphasising physiological and affective stress as an adaptive state of adaptation that minimizes long-term prediction errors. More recently, the active inference framework (AIF) has also sought to explain action and long-term adaptation through the minimization of future errors (free energy), through the learning of statistical contingencies of the world, offering a formalism for allostatic regulation. We suggest that framing prediction errors through the lens of biological hormonal dynamics proposed by allostasis offers a way to integrate these two models together in a biologically-plausible manner. In this paper, we describe our initial work in developing a model that grounds prediction errors (surprisal) into the secretion of a physiological stress hormone (cortisol) acting as an adaptive, allostatic mediator on a homeostatically-controlled physiology. We evaluate this using a computational model in simulations using an active inference agent endowed with an artificial physiology, regulated through homeostatic and allostatic control in a stochastic environment. Our results find that allostatic functions of cortisol (stress), secreted as a function of prediction errors, provide adaptive advantages to the agent’s long-term physiological regulation. We argue that the coupling of information-theoretic prediction errors to low-level, biological hormonal dynamics of stress can provide a computationally efficient model to long-term regulation for embodied intelligent systems.

[AI-5] DafnyBench: A Benchmark for Formal Software Verification

链接: https://arxiv.org/abs/2406.08467
作者: Chloe Loughridge,Qinyi Sun,Seth Ahrenbach,Federico Cassano,Chuyue Sun,Ying Sheng,Anish Mudide,Md Rakib Hossain Misu,Nada Amin,Max Tegmark
关键词: evaluating machine learning, machine learning systems, formal software verification, Dafny formal verification, largest benchmark
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Code dataset available at: this https URL

点击查看摘要

Abstract:We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.

[AI-6] Scaling Laws in Linear Regression: Compute Parameters and Data

链接: https://arxiv.org/abs/2406.08466
作者: Licong Lin,Jingfeng Wu,Sham M. Kakade,Peter L. Bartlett,Jason D. Lee
关键词: large-scale deep learning, neural scaling laws, deep learning models, model improves polynomially, model size
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with M parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using N data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree a1 , we show that the reducible part of the test error is \Theta(M^-(a-1) + N^-(a-1)/a) . The variance error, which increases with M , is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2406.08466 [cs.LG] (or arXiv:2406.08466v1 [cs.LG] for this version)

[AI-7] Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

链接: https://arxiv.org/abs/2406.08464
作者: Zhangchen Xu,Fengqing Jiang,Luyao Niu,Yuntian Deng,Radha Poovendran,Yejin Choi,Bill Yuchen Lin
关键词: aligning large language, large language models, critical for aligning, aligning large, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Link: this https URL

点击查看摘要

Abstract:High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the left-side templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We perform a comprehensive analysis of the extracted data and select 300K high-quality instances. To compare Magpie data with other public instruction datasets, we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through supervised fine-tuning (SFT) and subsequent feedback learning. We also show that using Magpie solely for SFT can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.

[AI-8] he Impact of Initialization on LoRA Finetuning Dynamics

链接: https://arxiv.org/abs/2406.08447
作者: Soufiane Hayou,Nikhil Ghosh,Bin Yu
关键词: Low Rank Adaptation, Rank Adaptation, Low Rank, study the role, originally introduced
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: TDLR: Different Initializations lead to completely different finetuning dynamics. One initialization (set A random and B zero) is generally better than the natural opposite initialization. arXiv admin note: text overlap with arXiv:2402.12354

点击查看摘要

Abstract:In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.

[AI-9] OLMES: A Standard for Language Model Evaluations

链接: https://arxiv.org/abs/2406.08446
作者: Yuling Gu,Oyvind Tafjord,Bailey Kuehl,Dany Haddad,Jesse Dodge,Hannaneh Hajishirzi
关键词: claiming improved performance, measuring model capabilities, models claiming improved, claiming improved, tasks measuring model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models in particular is challenging, as small changes to how a model is evaluated on a task can lead to large changes in measured performance. There is no common standard setup, so different models are evaluated on the same tasks in different ways, leading to claims about which models perform best not being reproducible. We propose OLMES, a completely documented, practical, open standard for reproducible LLM evaluations. In developing this standard, we identify and review the varying factors in evaluation practices adopted by the community - such as details of prompt formatting, choice of in-context examples, probability normalizations, and task formulation. In particular, OLMES supports meaningful comparisons between smaller base models that require the unnatural “cloze” formulation of multiple-choice questions against larger models that can utilize the original formulation. OLMES includes well-considered recommendations guided by results from existing literature as well as new experiments investigating open questions.

[AI-10] asTe: Teaching Large Language Models to Translate through Self-Reflection

链接: https://arxiv.org/abs/2406.08434
作者: Yutong Wang,Jiali Zeng,Xuebo Liu,Fandong Meng,Jie Zhou,Min Zhang
关键词: Large language models, exhibited remarkable performance, natural language processing, language processing tasks, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted to the ACL 2024 main conference

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks. Techniques like instruction tuning have effectively enhanced the proficiency of LLMs in the downstream task of machine translation. However, the existing approaches fail to yield satisfactory translation outputs that match the quality of supervised neural machine translation (NMT) systems. One plausible explanation for this discrepancy is that the straightforward prompts employed in these methodologies are unable to fully exploit the acquired instruction-following capabilities. To this end, we propose the TasTe framework, which stands for translating through self-reflection. The self-reflection process includes two stages of inference. In the first stage, LLMs are instructed to generate preliminary translations and conduct self-assessments on these translations simultaneously. In the second stage, LLMs are tasked to refine these preliminary translations according to the evaluation results. The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods. Our work presents a promising approach to unleash the potential of LLMs and enhance their capabilities in MT. The codes and datasets are open-sourced at this https URL.

[AI-11] Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2406.08431
作者: Benjamin Biggs,Arjun Seshadri,Yang Zou,Achin Jain,Aditya Golatkar,Yusheng Xie,Alessandro Achille,Ashwin Swaminathan,Stefano Soatto
关键词: present Diffusion Soup, Diffusion Soup, compartmentalization method, Diffusion Soup samples, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Diffusion Soup, a compartmentalization method for Text-to-Image Generation that averages the weights of diffusion models trained on sharded data. By construction, our approach enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging. We show that Diffusion Soup samples from a point in weight space that approximates the geometric mean of the distributions of constituent datasets, which offers anti-memorization guarantees and enables zero-shot style mixing. Empirically, Diffusion Soup outperforms a paragon model trained on the union of all data shards and achieves a 30% improvement in Image Reward (.34 \to .44) on domain sharded data, and a 59% improvement in IR (.37 \to .59) on aesthetic data. In both cases, souping also prevails in TIFA score (respectively, 85.5 \to 86.5 and 85.6 \to 86.8). We demonstrate robust unlearning – removing any individual domain shard only lowers performance by 1% in IR (.45 \to .44) – and validate our theoretical insights on anti-memorization using real data. Finally, we showcase Diffusion Soup’s ability to blend the distinct styles of models finetuned on different shards, resulting in the zero-shot generation of hybrid styles.

[AI-12] Improving Noise Robustness through Abstractions and its Impact on Machine Learning

链接: https://arxiv.org/abs/2406.08428
作者: Alfredo Ibias(1),Karol Capala(1),Varun Ravi Varma(1),Anna Drozdz(1),Jose Sousa(1) ((1) Personal Health Data Science, Sano - Centre for Computational Personalised Medicine)
关键词: Machine Learning, application of Machine, learning theory, world data tendency, real world data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Noise is a fundamental problem in learning theory with huge effects in the application of Machine Learning (ML) methods, due to real world data tendency to be noisy. Additionally, introduction of malicious noise can make ML methods fail critically, as is the case with adversarial attacks. Thus, finding and developing alternatives to improve robustness to noise is a fundamental problem in ML. In this paper, we propose a method to deal with noise: mitigating its effect through the use of data abstractions. The goal is to reduce the effect of noise over the model’s performance through the loss of information produced by the abstraction. However, this information loss comes with a cost: it can result in an accuracy reduction due to the missing information. First, we explored multiple methodologies to create abstractions, using the training dataset, for the specific case of numerical data and binary classification tasks. We also tested how these abstractions can affect robustness to noise with several experiments that explore the robustness of an Artificial Neural Network to noise when trained using raw data \emphvs when trained using abstracted data. The results clearly show that using abstractions is a viable approach for developing noise robust ML methods.

[AI-13] Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL

链接: https://arxiv.org/abs/2406.08426
作者: Zijin Hong,Zheng Yuan,Qinggang Zhang,Hao Chen,Junnan Dong,Feiran Huang,Xiao Huang
关键词: Generating accurate SQL, Generating accurate, SQL generation, accurate SQL, long-standing problem
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Generating accurate SQL according to natural language questions (text-to-SQL) is a long-standing problem since it is challenging in user question understanding, database schema comprehension, and SQL generation. Conventional text-to-SQL systems include human engineering and deep neural networks. Subsequently, pre-trained language models (PLMs) have been developed and utilized for text-to-SQL tasks, achieving promising performance. As modern databases become more complex and corresponding user questions more challenging, PLMs with limited comprehension capabilities can lead to incorrect SQL generation. This necessitates more sophisticated and tailored optimization methods, which, in turn, restricts the applications of PLM-based systems. Most recently, large language models (LLMs) have demonstrated significant abilities in natural language understanding as the model scale remains increasing. Therefore, integrating the LLM-based implementation can bring unique opportunities, challenges, and solutions to text-to-SQL research. In this survey, we present a comprehensive review of LLM-based text-to-SQL. Specifically, we propose a brief overview of the current challenges and the evolutionary process of text-to-SQL. Then, we provide a detailed introduction to the datasets and metrics designed to evaluate text-to-SQL systems. After that, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we discuss the remaining challenges in this field and propose expectations for future directions.

[AI-14] AWGUNET: Attention-Aided Wavelet Guided U-Net for Nuclei Segmentation in Histopathology Images

链接: https://arxiv.org/abs/2406.08425
作者: Ayush Roy,Payel Pramanik,Dmitrii Kaplun,Sergei Antonov,Ram Sarkar
关键词: Accurate nuclei segmentation, Accurate nuclei, automating nuclei segmentation, Accurate, nuclei segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate nuclei segmentation in histopathological images is crucial for cancer diagnosis. Automating this process offers valuable support to clinical experts, as manual annotation is time-consuming and prone to human errors. However, automating nuclei segmentation presents challenges due to uncertain cell boundaries, intricate staining, and diverse structures. In this paper, we present a segmentation approach that combines the U-Net architecture with a DenseNet-121 backbone, harnessing the strengths of both to capture comprehensive contextual and spatial information. Our model introduces the Wavelet-guided channel attention module to enhance cell boundary delineation, along with a learnable weighted global attention module for channel-specific attention. The decoder module, composed of an upsample block and convolution block, further refines segmentation in handling staining patterns. The experimental results conducted on two publicly accessible histopathology datasets, namely Monuseg and TNBC, underscore the superiority of our proposed model, demonstrating its potential to advance histopathological image analysis and cancer diagnosis. The code is made available at: this https URL.

[AI-15] State Soup: In-Context Skill Learning Retrieval and Mixing

链接: https://arxiv.org/abs/2406.08423
作者: Maciej Pióro,Maciej Wołczyk,Razvan Pascanu,Johannes von Oswald,João Sacramento
关键词: sequence modeling problems, gated-linear recurrent neural, recurrent neural networks, networks has reached, modeling problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A new breed of gated-linear recurrent neural networks has reached state-of-the-art performance on a range of sequence modeling problems. Such models naturally handle long sequences efficiently, as the cost of processing a new input is independent of sequence length. Here, we explore another advantage of these stateful sequence models, inspired by the success of model merging through parameter interpolation. Building on parallels between fine-tuning and in-context learning, we investigate whether we can treat internal states as task vectors that can be stored, retrieved, and then linearly combined, exploiting the linearity of recurrence. We study this form of fast model merging on Mamba-2.8b, a pretrained recurrent model, and present preliminary evidence that simple linear state interpolation methods suffice to improve next-token perplexity as well as downstream in-context learning task performance.

[AI-16] OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

链接: https://arxiv.org/abs/2406.08418
作者: Qingyun Li,Zhe Chen,Weiyun Wang,Wenhai Wang,Shenglong Ye,Zhenjiang Jin,Guanzhou Chen,Yinan He,Zhangwei Gao,Erfei Cui,Jiashuo Yu,Hao Tian,Jiasheng Zhou,Chao Xu,Bin Wang,Xingjian Wei,Wei Li,Wenjian Zhang,Bo Zhang,Pinlong Cai,Licheng Wen,Xiangchao Yan,Pei Chu,Yi Wang,Min Dou,Changyao Tian,Xizhou Zhu,Lewei Lu,Yushi Chen,Junjun He,Tong Lu,Yali Wang,Limin Wang,Dahua Lin,Yu Qiao,Botian Shi,Conghui He,Jifeng Dai
关键词: human reading habits, closely resembles human, resembles human reading, Image-text interleaved, Image-text interleaved data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at this https URL.

[AI-17] ailoring Generative AI Chatbots for Multiethnic Communities in Disaster Preparedness Communication: Extending the CASA Paradigm

链接: https://arxiv.org/abs/2406.08411
作者: Xinyan Zhao,Yuan Sun,Wenlin Liu,Chau-Wai Wong
关键词: powered by GPT, develop different prototypes, prototypes of generative, Social Actors, communicate hurricane preparedness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 21 pages

点击查看摘要

Abstract:This study is among the first to develop different prototypes of generative AI (GenAI) chatbots powered by GPT 4 to communicate hurricane preparedness information to diverse residents. Drawing from the Computers Are Social Actors (CASA) paradigm and the literature on disaster vulnerability and cultural tailoring, this study conducted a between-subjects experiment with 441 Black, Hispanic, and Caucasian residents of Florida. A computational analysis of chat logs (N = 7,848) shows that anthropomorphism and personalization are key communication topics in GenAI chatbot-user interactions. SEM results (N = 441) suggest that GenAI chatbots varying in tone formality and cultural tailoring significantly predict bot perceptions and, subsequently, hurricane preparedness outcomes. These results highlight the potential of using GenAI chatbots to improve diverse communities’ disaster preparedness.

[AI-18] MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

链接: https://arxiv.org/abs/2406.08407
作者: Xuehai He,Weixi Feng,Kaizhi Zheng,Yujie Lu,Wanrong Zhu,Jiachen Li,Yue Fan,Jianfeng Wang,Linjie Li,Zhengyuan Yang,Kevin Lin,William Yang Wang,Lijuan Wang,Xin Eric Wang
关键词: Multimodal Language Language, Language Language Models, Language Language, Multimodal Language, complex real-world dynamics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of “world models” – interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models’ different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.

[AI-19] Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

链接: https://arxiv.org/abs/2406.08404
作者: Yuhui Wang,Qingyuan Wu,Weida Li,Dylan R. Ashley,Francesco Faccio,Chao Huang,Jürgen Schmidhuber
关键词: Iteration Network, performs value iteration, latent MDP, differentiable architecture, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Value Iteration Network (VIN) is an end-to-end differentiable architecture that performs value iteration on a latent MDP for planning in reinforcement learning (RL). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a 100\times 100 maze – a task which typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module’s depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introducing an “adaptive highway loss” that constructs skip connections to improve gradient flow. We evaluate our method on both 2D maze navigation environments and the ViZDoom 3D navigation benchmark. We find that our new method, named Dynamic Transition VIN (DT-VIN), easily scales to 5000 layers and casually solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in RL environments.

[AI-20] cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers

链接: https://arxiv.org/abs/2406.08398
作者: Anirudh Sundar,Jin Xu,William Gay,Christopher Richardson,Larry Heck
关键词: multimodal interactive conversations, interactive conversations, emerging area, situated and multimodal, multimodal interactive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 1 figure

点击查看摘要

Abstract:An emerging area of research in situated and multimodal interactive conversations (SIMMC) includes interactions in scientific papers. Since scientific papers are primarily composed of text, equations, figures, and tables, SIMMC methods must be developed specifically for each component to support the depth of inquiry and interactions required by research scientists. This work introduces Conversational Papers (cPAPERS), a dataset of conversational question-answer pairs from reviews of academic papers grounded in these paper components and their associated references from scientific documents available on arXiv. We present a data collection strategy to collect these question-answer pairs from OpenReview and associate them with contextual information from LaTeX source files. Additionally, we present a series of baseline approaches utilizing Large Language Models (LLMs) in both zero-shot and fine-tuned configurations to address the cPAPERS dataset.

[AI-21] Large Language Models Must Be Taught to Know What They Dont Know

链接: https://arxiv.org/abs/2406.08391
作者: Sanyam Kapoor,Nate Gruver,Manley Roberts,Katherine Collins,Arka Pal,Umang Bhatt,Adrian Weller,Samuel Dooley,Micah Goldblum,Andrew Gordon Wilson
关键词: high-stakes applications, trust their predictions, large language models, argue that prompting, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: Code available at: this https URL

点击查看摘要

Abstract:When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibration and then show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We show that a thousand graded examples are sufficient to outperform baseline methods and that training through the features of a model is necessary for good performance and tractable for large open-source models when using LoRA. We also investigate the mechanisms that enable reliable LLM uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators, applicable not just to their own uncertainties but also the uncertainty of other models. Lastly, we show that uncertainty estimates inform human use of LLMs in human-AI collaborative settings through a user study.

[AI-22] Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

链接: https://arxiv.org/abs/2406.08384
作者: Javier Nistal,Marco Pasini,Cyran Aouameur,Maarten Grachten,Stefan Lattner
关键词: high computational demands, Recent advancements, limited audio quality, deep generative models, generative models present
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 8 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Recent advancements in deep generative models present new opportunities for music production but also pose challenges, such as high computational demands and limited audio quality. Moreover, current systems frequently rely solely on text input and typically focus on producing complete musical pieces, which is incompatible with existing workflows in music production. To address these issues, we introduce “Diff-A-Riff,” a Latent Diffusion Model designed to generate high-quality instrumental accompaniments adaptable to any musical context. This model offers control through either audio references, text prompts, or both, and produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage. We demonstrate the model’s capabilities through objective metrics and subjective listening tests, with extensive examples available on the accompanying website: this http URL

[AI-23] 2.5D Multi-view Averaging Diffusion Model for 3D Medical Image Translation: Application to Low-count PET Reconstruction with CT-less Attenuation Correction

链接: https://arxiv.org/abs/2406.08374
作者: Tianqi Chen,Jun Hou,Yinchi Zhou,Huidong Xie,Xiongchao Chen,Qiong Liu,Xueqi Guo,Menghua Xia,James S. Duncan,Chi Liu,Bo Zhou
关键词: Positron Emission Tomography, Positron Emission, Emission Tomography, important clinical imaging, clinical imaging tool
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:Positron Emission Tomography (PET) is an important clinical imaging tool but inevitably introduces radiation hazards to patients and healthcare providers. Reducing the tracer injection dose and eliminating the CT acquisition for attenuation correction can reduce the overall radiation dose, but often results in PET with high noise and bias. Thus, it is desirable to develop 3D methods to translate the non-attenuation-corrected low-dose PET (NAC-LDPET) into attenuation-corrected standard-dose PET (AC-SDPET). Recently, diffusion models have emerged as a new state-of-the-art deep learning method for image-to-image translation, better than traditional CNN-based methods. However, due to the high computation cost and memory burden, it is largely limited to 2D applications. To address these challenges, we developed a novel 2.5D Multi-view Averaging Diffusion Model (MADM) for 3D image-to-image translation with application on NAC-LDPET to AC-SDPET translation. Specifically, MADM employs separate diffusion models for axial, coronal, and sagittal views, whose outputs are averaged in each sampling step to ensure the 3D generation quality from multiple views. To accelerate the 3D sampling process, we also proposed a strategy to use the CNN-based 3D generation as a prior for the diffusion model. Our experimental results on human patient studies suggested that MADM can generate high-quality 3D translation images, outperforming previous CNN-bas