今日(2024-06-17)Arxiv最新论文

本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，每天早上11:30点定时自动更新，主要按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从arxiv网站获取，每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱，同样每天11:30左右邮件定时自动发送。

链接: https://arxiv.org/abs/2406.10228
作者: Chenyu Zhou,Mengdan Zhang,Peixian Chen,Chaoyou Fu,Yunhang Shen,Xiawu Zheng,Xing Sun,Rongrong Ji
关键词: Multi-modal Large Models, Multi-modal Large, progress of Multi-modal, tackle tasks blending, tasks blending vision
中文关键词: 多模式大型模型，多模式大型，多模式的进步，解决任务混合，任务混合视觉混合
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:The swift progress of Multi-modal Large Models (MLLMs) has showcased their impressive ability to tackle tasks blending vision and language. Yet, most current models and benchmarks cater to scenarios with a narrow scope of visual and textual contexts. These models often fall short when faced with complex comprehension tasks, which involve navigating through a plethora of irrelevant and potentially misleading information in both text and image forms. To bridge this gap, we introduce a new, more demanding task known as Interleaved Image-Text Comprehension (IITC). This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions and to follow intricate instructions to pinpoint the relevant image. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA), to refine image-text correlation skills. Our evaluation of four leading closed-source models, as well as various open-source models using VEGA, underscores the rigorous nature of IITC. Even the most advanced models, such as Gemini-1.5-pro and GPT4V, only achieved modest success. By employing a multi-task, multi-scale post-training strategy, we have set a robust baseline for MLLMs on the IITC task, attaining an 85.8% accuracy rate in image association and a 0.508 Rouge score. These results validate the effectiveness of our dataset in improving MLLMs capabilities for nuanced image-text comprehension.
摘要：多通道大型模型的快速发展展示了它们处理视觉和语言混合任务的令人印象深刻的能力。然而，大多数当前的模型和基准迎合了视觉和文本上下文范围狭窄的场景。当面对复杂的理解任务时，这些模型往往达不到要求，这些任务涉及浏览过多的文本和图像形式的无关和潜在的误导性信息。为了弥补这一差距，我们引入了一种新的、要求更高的任务，称为交错图文理解(IITC)。这项任务要求模型辨别和忽略图像和文本中的多余元素，准确回答问题，并遵循复杂的说明确定相关图像。为了支持这项任务，我们进一步制作了一个新的Vega数据集，该数据集是为IITC的科学内容任务量身定做的，并设计了一个子任务–图像-文本关联(ITA)，以完善图像-文本关联技能。我们评估了四个领先的封闭源代码模型，以及使用Vega的各种开源模型，强调了IITC的严谨性。即使是最先进的机型，如Gemini-1.5-Pro和GPT4V，也只取得了一定的成功。通过采用多任务、多尺度的训练后策略，我们在IITC任务上为MLLMS设定了一个稳健的基线，达到了85.8%的图像关联准确率和0.508的Rouge分数。这些结果验证了我们的数据集在提高MLLMS对细微差别图像-文本理解的能力方面的有效性。

[NLP-1] Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding
[NLP-1] 短片数据集（SFD）：故事级视频理解的基准

链接: https://arxiv.org/abs/2406.10221
作者: Ridouane Ghermi,Xi Wang,Vicky Kalogeiton,Ivan Laptev
关键词: Recent advances, propelled video understanding, advances in vision-language, Recent, video understanding
中文关键词: 最新进展，推动视频理解，视觉语言的进步，最新的视频理解
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models have significantly propelled video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often document the activities of one person in a single scene. Although some movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos and frequently encounter data leakage given the use of movie forums and other resources in LLM training. To address the above limitations, we propose the Short Film Dataset (SFD) with 1,078 publicly available amateur movies, a wide variety of genres and minimal data leakage issues. SFD offers long-term story-oriented video tasks in the form of multiple-choice and open-ended question answering. Our extensive experiments emphasize the need for long-term reasoning to solve SFD tasks. Notably, we find strong signals in movie transcripts leading to the on-par performance of people and LLMs. We also show significantly lower performance of current models compared to people when using vision data alone.
摘要：视觉语言模型的最新进展极大地推动了视频理解。然而，现有的数据集和任务具有显著的局限性。大多数数据集仅限于事件有限、叙事狭窄的短视频。例如，带有指导性和以自我为中心的视频的数据集通常记录一个人在单个场景中的活动。尽管一些电影数据集提供了更丰富的内容，但由于在LLM培训中使用电影论坛和其他资源，它们往往仅限于短期任务，缺乏公开可用的视频，并且经常遇到数据泄露。为了解决上述限制，我们提出了短片数据集(SFD)，其中包含1,078部公开的业余电影，种类繁多，数据泄露问题最少。SFD以多项选择和开放式提问的形式提供长期的故事导向视频任务。我们广泛的实验强调了解决SFD任务需要长期推理。值得注意的是，我们在电影剧本中发现了强烈的信号，导致人们和LLM的表现不相上下。我们还显示，当仅使用视觉数据时，当前模型的性能明显低于人。

[NLP-2] Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs
[NLP-2] 规范化隐藏状态使LLM能够学习可推广奖励模型

链接: https://arxiv.org/abs/2406.10216
作者: Rui Yang,Ruomeng Ding,Yong Lin,Huan Zhang,Tong Zhang
关键词: aligning Large Language, Large Language Models, aligning Large, human preference data, Large Language
中文关键词: 对齐大型语言、大型语言模型、对齐大型人类偏好数据、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:Reward models trained on human preference data have been proven to be effective for aligning Large Language Models (LLMs) with human intent within the reinforcement learning from human feedback (RLHF) framework. However, the generalization capabilities of current reward models to unseen prompts and responses are limited. This limitation can lead to an unexpected phenomenon known as reward over-optimization, where excessive optimization of rewards results in a decline in actual performance. While previous research has advocated for constraining policy optimization, our study proposes a novel approach to enhance the reward model’s generalization ability against distribution shifts by regularizing the hidden states. Specifically, we retain the base model’s language model head and incorporate a suite of text-generation losses to preserve the hidden states’ text generation capabilities, while concurrently learning a reward head behind the same hidden states. Our experimental results demonstrate that the introduced regularization technique markedly improves the accuracy of learned reward models across a variety of out-of-distribution (OOD) tasks and effectively alleviate the over-optimization issue in RLHF, offering a more reliable and robust preference learning paradigm.
摘要：基于人类偏好数据训练的奖励模型已被证明是在人类反馈强化学习(RLHF)框架内将大语言模型(LLM)与人类意图保持一致的有效方法。然而，目前的奖励模型对看不见的提示和反应的泛化能力是有限的。这种限制可能会导致一种称为奖励过度优化的意想不到的现象，即对奖励的过度优化会导致实际表现的下降。虽然以前的研究主张约束政策优化，但我们的研究提出了一种新的方法，通过对隐藏状态进行正则化来提高奖励模型对分布变化的泛化能力。具体地说，我们保留了基本模型的语言模型头部，并加入了一套文本生成损失，以保持隐藏状态的文本生成能力，同时学习相同隐藏状态背后的奖励头。我们的实验结果表明，引入的正则化技术显著提高了学习奖赏模型在各种分布外(OOD)任务中的精度，并有效地缓解了RLHF中的过度优化问题，提供了更可靠和更健壮的偏好学习范式。

[NLP-3] DevBench: A multimodal developmental benchmark for language learning
[NLP-3] DevBench：语言学习的多模式发展基准

链接: https://arxiv.org/abs/2406.10215
作者: Alvin Wei Ming Tan,Sunny Yu,Bria Long,Wanjing Anya Ma,Tonya Murray,Rebecca D. Silverman,Jason D. Yeatman,Michael C. Frank
关键词: models, response patterns, data, trajectories of vision-language, language
中文关键词: 模型、响应模式、数据、视觉语言轨迹、语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:How (dis)similar are the learning trajectories of vision-language models and children? Recent modeling work has attempted to understand the gap between models’ and humans’ data efficiency by constructing models trained on less data, especially multimodal naturalistic data. However, such models are often evaluated on adult-level benchmarks, with limited breadth in language abilities tested, and without direct comparison to behavioral data. We introduce DevBench, a multimodal benchmark comprising seven language evaluation tasks spanning the domains of lexical, syntactic, and semantic ability, with behavioral data from both children and adults. We evaluate a set of vision-language models on these tasks, comparing models and humans not only on accuracy but on their response patterns. Across tasks, models exhibit variation in their closeness to human response patterns, and models that perform better on a task also more closely resemble human behavioral responses. We also examine the developmental trajectory of OpenCLIP over training, finding that greater training results in closer approximations to adult response patterns. DevBench thus provides a benchmark for comparing models to human language development. These comparisons highlight ways in which model and human language learning processes diverge, providing insight into entry points for improving language models.
摘要：视觉语言模型和儿童的学习轨迹有多相似？最近的建模工作试图通过构建基于较少数据，特别是多模式自然数据的模型，来理解模型和人类数据效率之间的差距。然而，这类模型通常是在成人水平的基准上进行评估的，测试的语言能力的广度有限，并且没有与行为数据进行直接比较。我们使用儿童和成人的行为数据，介绍了一种多通道基准测试，它包括七个语言评估任务，横跨词汇、句法和语义能力领域。我们评估了一组关于这些任务的视觉语言模型，不仅在准确性上比较了模型和人类，而且在他们的反应模式上进行了比较。在不同的任务中，模型表现出与人类反应模式的接近程度的差异，在任务中表现更好的模型也更接近人类的行为反应。我们还研究了OpenCLIP在训练中的发展轨迹，发现更大的训练导致更接近成人的反应模式。因此，DevBtch为将模型与人类语言发展进行比较提供了一个基准。这些比较突出了模型和人类语言学习过程的不同之处，为改进语言模型提供了切入点。

[NLP-4] Be like a Goldfish Dont Memorize! Mitigating Memorization in Generative LLMs
[NLP-4] 像金鱼一样不要微型化！缓解生成式LLM中的重复化

链接: https://arxiv.org/abs/2406.10209
作者: Abhimanyu Hans,Yuxin Wen,Neel Jain,John Kirchenbauer,Hamid Kazemi,Prajwal Singhania,Siddharth Singh,Gowthami Somepalli,Jonas Geiping,Abhinav Bhatele,Tom Goldstein
关键词: Large language models, Large language, causing privacy, copyright risks, memorize and repeat
中文关键词: 大型语言模型，大型语言，造成隐私、版权风险，记忆和重复
类目: Computation and Language (cs.CL)
备注: 9.5 pages, 8 figures, and 1 table in the main body. Code available at this https URL

点击查看摘要

Abstract:Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subset of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale Llama-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks.
摘要：大型语言模型可以记住和重复其训练数据，从而导致隐私和版权风险。为了减轻记忆，我们对下一个代币训练目标进行了微妙的修改，我们称之为金鱼损失。在训练期间，随机采样的令牌子集将被排除在损失计算之外。模型不会记住这些丢弃的令牌，这会阻止从训练集中逐字复制完整的令牌链。我们进行了广泛的实验来训练十亿规模的Lama-2模型，包括预训练和从头开始训练，并证明可提取记忆量显着减少，而对下游基准的影响很小或没有。

[NLP-5] A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors
[NLP-5] 对齐语言模型及其与采样适配器的关系的基本权衡

链接: https://arxiv.org/abs/2406.10203
作者: Naaman Tan,Josef Valvoda,Anej Svete,Tianyu Liu,Yanxia Qin,Kan Min-Yen,Ryan Cotterell
关键词: text generation systems, build good text, good text generation, language model, generation systems
中文关键词: 文本生成系统，构建好的文本，好的文本生成，语言模型，生成系统
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The relationship between the quality of a string and its probability p(\boldsymboly) under a language model has been influential in the development of techniques to build good text generation systems. For example, several decoding algorithms have been motivated to manipulate p(\boldsymboly) to produce higher-quality text. In this work, we examine the probability–quality relationship in language models explicitly aligned to human preferences, e.g., through Reinforcement Learning through Human Feedback (RLHF). We find that, given a general language model and its aligned version, for corpora sampled from an aligned language model, there exists a trade-off between the average reward and average log-likelihood of the strings under the general language model. We provide a formal treatment of this issue and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.
摘要：语言模型下字符串的质量与其概率p（\boldsemony）之间的关系对构建良好文本生成系统的技术的开发产生了影响。例如，有几种解码算法被激励来操纵p（\boldsemony）来产生更高质量的文本。在这项工作中，我们检查了与人类偏好明确一致的语言模型中的概率-质量关系，例如，通过人类反馈强化学习（RL HF）。我们发现，给定一个通用语言模型及其对齐版本，对于从对齐语言模型中抽样的数据库，在通用语言模型下字符串的平均回报和平均log似然之间存在权衡。我们提供了对这个问题的正式处理，并演示了采样适配器的选择如何允许选择我们交换奖励的可能性。

[NLP-6] CHIRON: Rich Character Representations in Long-Form Narratives
[NLP-6] CHIRON：长篇叙事中丰富的人物表现

链接: https://arxiv.org/abs/2406.10190
作者: Alexander Gurung,Mirella Lapata
关键词: existing story analysis, long-form narratives, integral to long-form, poorly understood, understood by existing
中文关键词: 现有的故事分析，长篇叙事，长篇叙事的组成部分，理解不足，被现有的理解
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Characters are integral to long-form narratives, but are poorly understood by existing story analysis and generation systems. While prior work has simplified characters via graph-based methods and brief character descriptions, we aim to better tackle the problem of representing complex characters by taking inspiration from advice given to professional writers. We propose CHIRON, a new `character sheet’ based representation that organizes and filters textual information about characters. We construct CHIRON sheets in two steps: a Generation Module that prompts an LLM for character information via question-answering and a Validation Module that uses automated reasoning and a domain-specific entailment model to eliminate false facts about a character. We validate CHIRON via the downstream task of masked-character prediction, where our experiments show CHIRON is better and more flexible than comparable summary-based baselines. We also show that metrics derived from CHIRON can be used to automatically infer character-centricity in stories, and that these metrics align with human judgments.
摘要：人物是长篇叙事不可或缺的一部分，但现有的故事分析和生成系统对人物的理解很少。虽然之前的工作是通过基于图形的方法和简短的字符描述来简化字符，但我们的目标是通过从给专业作家的建议中获得灵感，更好地解决表示复杂字符的问题。我们提出了CHIRON，这是一种新的基于字符表的表示法，它组织和过滤关于字符的文本信息。我们分两步构建ChIron Sheet：生成模块通过问答提示LLM输入字符信息，验证模块使用自动推理和特定于领域的蕴涵模型来消除有关字符的虚假事实。我们通过掩码字符预测的下游任务来验证CHIRON，在那里我们的实验表明CHIRON比基于摘要的可比基线更好和更灵活。我们还表明，从CHIRON派生的度量可以用于自动推断故事中的角色中心，并且这些度量与人类的判断一致。

[NLP-7] Let the Poem Hit the Rhythm: Using a Byte-Based Transformer for Beat-Aligned Poetry Generation
[NLP-7] 让诗歌跟上节奏：使用基于字节的Transformer来生成节拍一致的诗歌

链接: https://arxiv.org/abs/2406.10174
作者: Mohamad Elzohbi,Richard Zhao
关键词: computational creativity, remains relatively unexplored, interesting case, case for computational, poetry and music
中文关键词: 计算创造力，仍然相对未被探索，有趣的案例，计算、诗歌和音乐的案例
类目: Computation and Language (cs.CL)
备注: 5 pages, 3 figures, accepted for the 15th International Conference on Computational Creativity, ICCC’24

点击查看摘要

Abstract:The intersection between poetry and music provides an interesting case for computational creativity, yet remains relatively unexplored. This paper explores the integration of poetry and music through the lens of beat patterns, investigating whether a byte-based language model can generate words that fit specific beat patterns within the context of poetry. Drawing on earlier studies, we developed a method to train a byte-based transformer model, ByT5, to align poems with beat patterns. The results demonstrate a high level of beat alignment while maintaining semantic coherence. Future work will aim to improve the model’s ability to create complete beat-aligned poems.
摘要：诗歌和音乐之间的交叉为计算创造力提供了一个有趣的案例，但仍然相对未被探索。本文通过节拍模式的视角探索诗歌与音乐的融合，研究基于字节的语言模型是否可以生成适合诗歌背景下特定节拍模式的单词。根据早期的研究，我们开发了一种训练基于字节的Transformer模型ByT 5的方法，以使诗歌与节拍模式保持一致。结果表明，节拍高度对齐，同时保持语义一致。未来的工作将旨在提高模型创作完整节拍对齐诗歌的能力。

[NLP-8] IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce
[NLP-8] IntentionQA：评估电子商务语言模型购买意图理解能力的基准

链接: https://arxiv.org/abs/2406.10173
作者: Wenxuan Ding,Weiqi Wang,Sze Heng Douglas Kwok,Minghao Liu,Tianqing Fang,Jiaxin Bai,Junxian He,Yangqiu Song
关键词: Enhancing Language Models’, Enhancing Language, Language Models’, ability to understand, downstream tasks
中文关键词: 增强语言模型、增强语言、语言模型、理解能力、下游任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enhancing Language Models’ (LMs) ability to understand purchase intentions in E-commerce scenarios is crucial for their effective assistance in various downstream tasks. However, previous approaches that distill intentions from LMs often fail to generate meaningful and human-centric intentions applicable in real-world E-commerce contexts. This raises concerns about the true comprehension and utilization of purchase intentions by LMs. In this paper, we present IntentionQA, a double-task multiple-choice question answering benchmark to evaluate LMs’ comprehension of purchase intentions in E-commerce. Specifically, LMs are tasked to infer intentions based on purchased products and utilize them to predict additional purchases. IntentionQA consists of 4,360 carefully curated problems across three difficulty levels, constructed using an automated pipeline to ensure scalability on large E-commerce platforms. Human evaluations demonstrate the high quality and low false-negative rate of our benchmark. Extensive experiments across 19 language models show that they still struggle with certain scenarios, such as understanding products and intentions accurately, jointly reasoning with products and intentions, and more, in which they fall far behind human performances. Our code and data are publicly available at this https URL.
摘要：提高语言模型(LMS)在电子商务场景中理解购买意图的能力是其有效辅助各种下游任务的关键。然而，以前从LMS中提取意图的方法往往无法生成适用于现实世界电子商务环境的有意义的、以人为中心的意图。这引发了人们对LMS对购买意向的真实理解和利用的担忧。在本文中，我们提出了一个双任务多项选择问答基准IntentionQA来评估LMS对电子商务中购买意向的理解。具体地说，LMS的任务是根据购买的产品推断意图，并利用它们来预测额外的购买。IntentionQA由3个难度级别的4,360个精心策划的问题组成，使用自动化管道构建，以确保在大型电子商务平台上的可扩展性。人工评估表明，我们的基准具有高质量和低假阴性率。对19个语言模型进行的广泛实验表明，它们仍然在某些场景中挣扎，比如准确理解产品和意图，与产品和意图进行联合推理，等等，在这些场景中，它们远远落后于人类的表现。我们的代码和数据在此HTTPS URL上公开可用。

[NLP-9] Datasets for Multilingual Answer Sentence Selection
[NLP-9] 多语言答案句子选择的数据集

链接: https://arxiv.org/abs/2406.10172
作者: Matteo Gabburo,Stefano Campese,Federico Agostini,Alessandro Moschitti
关键词: Answer Sentence Selection, retrieval-based Question Answering, Answer Sentence, Sentence Selection, Question Answering
中文关键词: 答案句子选择、基于检索的问题解答、答案句子、句子选择、问题解答
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Answer Sentence Selection (AS2) is a critical task for designing effective retrieval-based Question Answering (QA) systems. Most advancements in AS2 focus on English due to the scarcity of annotated datasets for other languages. This lack of resources prevents the training of effective AS2 models in different languages, creating a performance gap between QA systems in English and other locales. In this paper, we introduce new high-quality datasets for AS2 in five European languages (French, German, Italian, Portuguese, and Spanish), obtained through supervised Automatic Machine Translation (AMT) of existing English AS2 datasets such as ASNQ, WikiQA, and TREC-QA using a Large Language Model (LLM). We evaluated our approach and the quality of the translated datasets through multiple experiments with different Transformer architectures. The results indicate that our datasets are pivotal in producing robust and powerful multilingual AS2 models, significantly contributing to closing the performance gap between English and other languages.
摘要：答案句选择是设计有效的基于检索的问答系统的关键任务。由于其他语言的注释数据集很少，AS2中的大多数改进都集中在英语上。由于缺乏资源，无法以不同的语言培训有效的AS2模型，从而造成英语QA系统与其他地区的QA系统之间的性能差距。在本文中，我们介绍了五种欧洲语言(法语、德语、意大利语、葡萄牙语和西班牙语)的新的高质量AS2数据集，这些数据集是通过使用大型语言模型(LLM)对现有的英语AS2数据集进行监督自动机器翻译(AMT)获得的，这些数据集包括ASNQ、WikiQA和TREC-QA。通过使用不同Transformer架构的多次实验，我们对我们的方法和翻译后的数据集的质量进行了评估。结果表明，我们的数据集在生成健壮和强大的多语言AS2模型方面起到了关键作用，为缩小英语和其他语言之间的性能差距做出了重大贡献。

[NLP-10] Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
[NLP-10] 谄媚到托辞：调查大型语言模型中的奖励篡改

链接: https://arxiv.org/abs/2406.10162
作者: Carson Denison,Monte MacDiarmid,Fazl Barez,David Duvenaud,Shauna Kravec,Samuel Marks,Nicholas Schiefer,Ryan Soklaski,Alex Tamkin,Jared Kaplan,Buck Shlegeris,Samuel R. Bowman,Ethan Perez,Evan Hubinger
关键词: systems learn undesired, highly rewarded due, learn undesired behaviors, specification gaming, misspecified training goals
中文关键词: 系统学习不需要的、高回报的、学习不需要的行为、规范游戏、错误指定的训练目标
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove.
摘要：在强化学习中，当人工智能系统学习到由于错误指定训练目标而获得高回报的不良行为时，就会出现规范博弈。规范游戏的范围可以从简单的奉承行为到复杂而有害的行为，比如篡改奖励，在这些行为中，模型直接修改自己的奖励机制。然而，这些更具危害性的行为可能过于复杂，无法通过探测发现。在本文中，我们研究了大型语言模型(LLM)助手是否会推广到执行更罕见、更明目张胆的形式，直到并包括奖励篡改。我们构建了一个日益复杂的可游戏环境的课程，并发现在早期课程环境中的培训会导致在剩余环境中进行更具体的游戏。令人惊讶的是，在接受过完整课程培训的LLM助理中，有一小部分时间是不可忽视的，他们将零命中率概括为直接重写了自己的奖励函数。对LLM进行再培训，使其不在早期课程环境中进行游戏，可以减轻但不能消除后来环境中的薪酬篡改。此外，在我们的游戏环境中增加无害训练并不能防止奖励篡改。这些结果表明，LLM可以从常见形式的规范博弈推广到更有害的奖励篡改，而且这种行为可能不是很容易消除的。

[NLP-11] BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
[NLP-11] BABILong：通过干草堆中的长上下文推理测试LLM的局限性

链接: https://arxiv.org/abs/2406.10149
作者: Yuri Kuratov,Aydar Bulatov,Petr Anokhin,Ivan Rodkin,Dmitry Sorokin,Artyom Sorokin,Mikhail Burtsev
关键词: input context sizes, large language models, recent years, sizes of large, large language
中文关键词: 输入上下文大小、大型语言模型、近年来、大型语言的大小
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models’ ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.
摘要：近年来，大型语言模型的输入上下文大小急剧增加。然而，现有的评估方法没有跟上步伐，未能全面评估模型在处理长上下文方面的效率。为了弥合这一差距，我们引入了BABIL龙基准测试，旨在测试语言模型对分布在极长文档中的事实进行推理的能力。BABILong包括一组不同的20个推理任务，包括事实链、简单归纳、演绎、计数和处理列表/集合。这些任务本身具有挑战性，当所需的事实散布在冗长的自然文本中时，要求更高。我们的评估表明，流行的LLM只有效地利用了10-20%的上下文，并且它们的性能随着推理复杂度的增加而急剧下降。在上下文推理的替代方法中，检索-增强生成方法在单个事实问题的回答上获得了60%的中等准确率，而与上下文长度无关。在上下文扩展方法中，循环内存转换器表现出最高的性能，能够处理多达1100万个令牌的长度。BABIL龙基准可以扩展到任何长度，以支持对功能更强的即将推出的新型号的评估，我们提供多达100万个令牌长度的拆分。

[NLP-12] Evaluation of Large Language Models: STEM education and Gender Stereotypes
[NLP-12] 大型语言模型的评估：STEM教育和性别刻板印象

链接: https://arxiv.org/abs/2406.10133
作者: Smilla Due,Sneha Das,Marianne Andersen,Berta Plandolit López,Sniff Andersen Nexø,Line Clemmensen
关键词: study support, coding support, writing assistance, increasing impact, Large Language Models
中文关键词: 学习支持、编码支持、写作帮助、增加影响、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have an increasing impact on our lives with use cases such as chatbots, study support, coding support, ideation, writing assistance, and more. Previous studies have revealed linguistic biases in pronouns used to describe professions or adjectives used to describe men vs women. These issues have to some degree been addressed in updated LLM versions, at least to pass existing tests. However, biases may still be present in the models, and repeated use of gender stereotypical language may reinforce the underlying assumptions and are therefore important to examine further. This paper investigates gender biases in LLMs in relation to educational choices through an open-ended, true to user-case experimental design and a quantitative analysis. We investigate the biases in the context of four different cultures, languages, and educational systems (English/US/UK, Danish/DK, Catalan/ES, and Hindi/IN) for ages ranging from 10 to 16 years, corresponding to important educational transition points in the different countries. We find that there are significant and large differences in the ratio of STEM to non-STEM suggested education paths provided by chatGPT when using typical girl vs boy names to prompt lists of suggested things to become. There are generally fewer STEM suggestions in the Danish, Spanish, and Indian context compared to the English. We also find subtle differences in the suggested professions, which we categorise and report.
摘要：大型语言模型(LLM)通过聊天机器人、学习支持、编码支持、思维、写作辅助等用例对我们的生活产生了越来越大的影响。之前的研究已经发现，用来描述职业的代词或用来描述男性和女性的形容词存在语言偏见。这些问题在某种程度上已经在更新的LLM版本中得到了解决，至少是为了通过现有的测试。然而，模型中可能仍然存在偏见，重复使用性别刻板印象的语言可能会强化潜在的假设，因此重要的是进一步检查。本文通过开放式的、真实的用户案例实验设计和定量分析，研究了学习管理中的性别偏见与教育选择的关系。我们在四种不同的文化、语言和教育系统(英语/美国/英国、丹麦语/DK、加泰罗尼亚语/ES和印地语/IN)的背景下调查了年龄从10到16岁的偏见，对应于不同国家的重要教育转折点。我们发现，当使用典型的女孩和男孩的名字来提示建议的事情列表时，聊天GPT提供的STEM和非STEM建议的教育路径的比例存在显著和较大的差异。与英语相比，丹麦语、西班牙语和印度语环境中的STEM建议通常较少。我们还发现了建议职业中的细微差异，并对其进行分类和报告。

[NLP-13] he Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models
[NLP-13] 魔鬼在神经元中：在预训练的语言模型中解释和缓解社会偏见

链接: https://arxiv.org/abs/2406.10130
作者: Yan Liu,Yu Liu,Xiaokang Chen,Pin-Yu Chen,Daoguang Zan,Min-Yen Kan,Tsung-Yi Ho
关键词: Pre-trained Language models, negative social impacts, Social Bias Neurons, bring catastrophic results, Pre-trained Language
中文关键词: 预训练的语言模型、负面社会影响、社会偏见神经元，带来灾难性的结果，预训练的语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train language models on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside language models by introducing the concept of \sc Social Bias Neurons. Specifically, we propose \sc Integrated Gap Gradients (IG ^2 ) to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior as a distributional property of language, we employ sentiment-bearing prompts to elicit classes of sensitive words (demographics) correlated with such sentiments. Our IG ^2 thus attributes the uneven distribution for different demographics to specific Social Bias Neurons, which track the trail of unwanted behavior inside PLM units to achieve interoperability. Moreover, derived from our interpretable technique, \sc Bias Neuron Suppression (BNS) is further proposed to mitigate social biases. By studying BERT, RoBERTa, and their attributable differences from debiased FairBERTa, IG ^2 allows us to locate and suppress identified neurons, and further mitigate undesired behaviors. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
摘要：预先训练好的语言模型含有社会偏见等有害信息，在应用中可能会造成负面的社会影响，甚至带来灾难性的后果。以往关于这个问题的研究主要集中在使用黑盒方法，如通过观察模型输出来检测和量化PLM中的社会偏见。因此，以往的去偏方法主要是在新构建的反刻板印象数据集上对语言模型进行微调甚至预训练，成本较高。在这项工作中，我们试图通过引入社会偏见神经元的概念来揭开语言模型中社会偏见的神秘面纱。具体地说，我们提出了集成间隙梯度(IG^2)来精确定位语言模型中可归因于不良行为(如社会偏见)的单位(即神经元)。通过将不受欢迎的行为正式化为语言的一种分布属性，我们使用带有情感的提示来引出与这种情感相关的敏感词类别(人口统计学)。因此，我们的IG^2将不同人口统计数据的不均匀分布归因于特定的社会偏见神经元，这些神经元跟踪PLM单元内不想要的行为的踪迹，以实现互操作性。此外，在我们的可解释技术的基础上，进一步提出了减少社会偏见的偏向神经元抑制(BNS)。通过研究Bert、Roberta和他们与去偏FairBERTa的归属差异，IG^2允许我们定位和抑制已识别的神经元，并进一步减轻不受欢迎的行为。从StereoSet的先验度量来看，我们的模型在以较低的代价保持语言建模能力的同时，实现了较高的公平性。

[NLP-14] SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
[NLP-14] SEACrowd：针对东南亚语言的多语言多模式数据中心和基准套件

链接: https://arxiv.org/abs/2406.10118
作者: Holy Lovenia,Rahmad Mahendra,Salsabil Maulana Akbar,Lester James V. Miranda,Jennifer Santoso,Elyanah Aco,Akhdan Fadhilah,Jonibek Mansurov,Joseph Marvin Imperial,Onno P. Kampman,Joel Ruben Antony Moniz,Muhammad Ravi Shulthan Habibi,Frederikus Hudi,Railey Montalan,Ryan Ignatius,Joanito Agili Lopo,William Nixon,Börje F. Karlsson,James Jaya,Ryandito Diandaru,Yuze Gao,Patrick Amadeus,Bin Wang,Jan Christian Blaise Cruz,Chenxi Whitehouse,Ivan Halim Parmonangan,Maria Khelli,Wenyu Zhang,Lucky Susanto,Reynard Adha Ryanda,Sonny Lazuardi Hermawan,Dan John Velasco,Muhammad Dehan Al Kautsar,Willy Fitra Hendria,Yasmin Moslem,Noah Flynn,Muhammad Farid Adilazuarda,Haochen Li,Johanes Lee,R. Damanhuri,Shuo Sun,Muhammad Reza Qorib,Amirbek Djanibekov,Wei Qi Leong,Quyet V. Do,Niklas Muennighoff,Tanrada Pansuwan,Ilham Firdausi Putra,Yan Xu,Ngee Chia Tai,Ayu Purwarianti,Sebastian Ruder,William Tjhi,Peerat Limkonchotiwat,Alham Fikri Aji,Sedrick Keh,Genta Indra Winata,Ruochen Zhang,Fajri Koto,Zheng-Xin Yong,Samuel Cahyawijaya
关键词: Southeast Asia, SEA languages, SEA, million people, region rich
中文关键词: 东南亚，SEA语言，SEA，百万人口，地区丰富
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
摘要：东南亚是一个语言多样性和文化多样性丰富的地区，有1300多种土著语言和6.71亿人口。然而，目前流行的人工智能模型严重缺乏对来自SEA的文本、图像和音频数据集的表示，这损害了用于海洋语言的人工智能模型的质量。由于缺乏高质量的数据集，再加上英语培训数据的主导地位，评估海洋语言模型具有挑战性，这引发了人们对潜在文化歪曲的担忧。为了应对这些挑战，我们引入了SEACrowd，这是一项协作倡议，整合了一个全面的资源中心，通过提供三种模式的近1000种海洋语言的标准化语料库来填补资源缺口。通过我们的SEACrowd基准，我们评估了13个任务中36种土著语言的人工智能模型的质量，为当前海洋中的人工智能格局提供了有价值的见解。此外，我们提出了促进更大的人工智能进步的战略，最大化了未来人工智能在海洋中的潜在效用和资源公平。

[NLP-15] Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning
[NLP-15] 了解未知：LLM指令调优的不确定性敏感方法

链接: https://arxiv.org/abs/2406.10099
作者: Jiaqi Li,Yixuan Tang,Yi Yang
关键词: demonstrated remarkable capabilities, demonstrated remarkable, remarkable capabilities, face challenges, Large language
中文关键词: 表现出非凡的能力，表现出非凡的、非凡的能力，面临挑战，大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across various tasks but still face challenges such as hallucinations. One potential reason for hallucinations is the lack of relevant knowledge or context. Thus, a promising solution to mitigate this issue involves instructing LLMs to respond with “I do not know” when a question falls outside their knowledge domain or the provided context. However, in this work, we observed that LLMs struggle to admit their lack of knowledge, primarily due to existing instruction datasets designed to encourage specific answers. To improve large language models’ capability to recognize the boundaries of their knowledge, we propose a novel approach called uncertainty-sensitive tuning. This method involves two-stage training designed for uncertainty recognition and prompt-sensitive activation. In the first stage, we guide the LLM to reject unknown questions. In the second stage, we recover the decreased performance in QA tasks by incorporating designed causal instructions. By leveraging this method, we aim to enhance the model’s ability to identify areas of uncertainty. The experimental results demonstrate that our proposed uncertainty-sensitive tuning method significantly improves the performance of the Llama2-chat-7B model. Specifically, it achieves a substantial 34.7% improvement in handling questions involving knowledge gaps compared to the original model. Moreover, our approach outperforms GPT-4, exhibiting a 9.4% increase in overall performance. We open-source the model and code on GitHub.
摘要：大型语言模型在各种任务中表现出了非凡的能力，但仍然面临着幻觉等挑战。产生幻觉的一个潜在原因是缺乏相关知识或背景。因此，缓解这一问题的一个有希望的解决方案是，当一个问题超出其知识领域或所提供的背景时，指示LLMS以“我不知道”来回答。然而，在这项工作中，我们观察到LLM难以承认他们缺乏知识，主要是因为现有的教学数据集旨在鼓励特定的答案。为了提高大型语言模型识别其知识边界的能力，我们提出了一种新的方法，称为不确定性敏感调优。这种方法包括两个阶段的训练，旨在识别不确定性和迅速敏感地激活。在第一阶段，我们引导LLM拒绝未知问题。在第二阶段，我们通过加入设计的因果说明来恢复QA任务中降低的绩效。通过利用这种方法，我们的目标是增强模型识别不确定性领域的能力。实验结果表明，我们提出的不确定性敏感调整方法显著提高了Llama2-Chat-7B模型的性能。具体地说，与原始模型相比，它在处理涉及知识差距的问题方面获得了34.7%的大幅改进。此外，我们的方法的性能优于GPT-4，总体性能提高了9.4%。我们在GitHub上开放了该模型和代码。

[NLP-16] Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation
[NLP-16] 探讨同步语音翻译的人机评估之间的相关性

链接: https://arxiv.org/abs/2406.10091
作者: Xiaoman Wang,Claudio Fantinuoli
关键词: Assessing the performance, spoken language translation, expectations of users, performance of interpreting, interpreting services
中文关键词: 评估绩效、口语翻译、用户期望、口译绩效、口译服务
类目: Computation and Language (cs.CL)
备注: Paper accepted at the European Association for Machine Translation conference 2024

点击查看摘要

Abstract:Assessing the performance of interpreting services is a complex task, given the nuanced nature of spoken language translation, the strategies that interpreters apply, and the diverse expectations of users. The complexity of this task become even more pronounced when automated evaluation methods are applied. This is particularly true because interpreted texts exhibit less linearity between the source and target languages due to the strategies employed by the interpreter. This study aims to assess the reliability of automatic metrics in evaluating simultaneous interpretations by analyzing their correlation with human evaluations. We focus on a particular feature of interpretation quality, namely translation accuracy or faithfulness. As a benchmark we use human assessments performed by language experts, and evaluate how well sentence embeddings and Large Language Models correlate with them. We quantify semantic similarity between the source and translated texts without relying on a reference translation. The results suggest GPT models, particularly GPT-3.5 with direct prompting, demonstrate the strongest correlation with human judgment in terms of semantic similarity between source and target texts, even when evaluating short textual segments. Additionally, the study reveals that the size of the context window has a notable impact on this correlation. Comments: Paper accepted at the European Association for Machine Translation conference 2024 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.10091 [cs.CL] (or arXiv:2406.10091v1 [cs.CL] for this version)
摘要：考虑到口语翻译的细微差别、口译员所采用的策略以及用户的不同期望，评估口译服务的绩效是一项复杂的任务。当应用自动评估方法时，这项任务的复杂性变得更加明显。这一点尤其正确，因为由于口译员使用的策略，口译文本在源语言和目标语言之间表现出较少的线性。本研究旨在通过分析自动指标与人工评价的相关性，来评估自动指标在评价同声传译中的可靠性。我们关注口译质量的一个特殊特征，即翻译的准确性或忠实性。作为基准，我们使用语言专家进行的人类评估，并评估句子嵌入和大型语言模型与它们的相关性如何。我们不依赖参考翻译来量化源文本和翻译文本之间的语义相似度。结果表明，GPT模型，特别是直接提示的GPT-3.5模型，即使在评价短小的文本片段时，在源文本和目标文本之间的语义相似性方面也表现出与人的判断最强的相关性。此外，研究表明，上下文窗口的大小对这种相关性有显著影响。评论：在2024年欧洲机器翻译协会会议上接受的论文主题：计算与语言(cs.CL)引用为：arxiv：2406.10091cs.CL

[NLP-17] Discovering influential text using convolutional neural networks
[NLP-17] 使用卷积神经网络发现有影响力的文本

链接: https://arxiv.org/abs/2406.10086
作者: Megan Ayers,Luke Sanford,Margaret Roberts,Eddie Yang
关键词: social sciences, estimating the impacts, text, text treatments, Experimental
中文关键词: 社会科学，估计影响，文本，文本处理，实验
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)
备注: To be published in ACL 2024 Findings

点击查看摘要

Abstract:Experimental methods for estimating the impacts of text on human evaluation have been widely used in the social sciences. However, researchers in experimental settings are usually limited to testing a small number of pre-specified text treatments. While efforts to mine unstructured texts for features that causally affect outcomes have been ongoing in recent years, these models have primarily focused on the topics or specific words of text, which may not always be the mechanism of the effect. We connect these efforts with NLP interpretability techniques and present a method for flexibly discovering clusters of similar text phrases that are predictive of human reactions to texts using convolutional neural networks. When used in an experimental setting, this method can identify text treatments and their effects under certain assumptions. We apply the method to two datasets. The first enables direct validation of the model’s ability to detect phrases known to cause the outcome. The second demonstrates its ability to flexibly discover text treatments with varying textual structures. In both cases, the model learns a greater variety of text treatments compared to benchmark methods, and these text features quantitatively meet or exceed the ability of benchmark methods to predict the outcome.
摘要：评价文本对人类评价影响的实验方法在社会科学中得到了广泛的应用。然而，在实验环境中，研究人员通常仅限于测试少量预先指定的文本处理。虽然近年来一直在努力挖掘非结构化文本以寻找因果影响结果的特征，但这些模型主要集中在文本的主题或特定单词上，这可能并不总是影响结果的机制。我们将这些努力与NLP可解释性技术相结合，并提出了一种使用卷积神经网络灵活地发现预测人类对文本反应的相似文本短语簇的方法。当在实验环境中使用时，该方法可以在一定的假设下识别文本处理及其效果。我们将该方法应用于两个数据集。第一个允许直接验证模型检测已知导致结果的短语的能力。第二个演示了它灵活地发现具有不同文本结构的文本处理的能力。在这两种情况下，与基准方法相比，该模型学习了更多种类的文本处理，并且这些文本特征在数量上达到或超过了基准方法预测结果的能力。

[NLP-18] Enhancing Question Answering on Charts Through Effective Pre-training Tasks
[NLP-18] 通过有效的预培训任务加强图表上的提问解答

链接: https://arxiv.org/abs/2406.10085
作者: Ashim Gupta,Vivek Gupta,Shuo Zhang,Yujie He,Ning Zhang,Shalin Shah
关键词: completely understand, textual information, Abstract, Understanding, understand a document
中文关键词: 完全理解，文本信息，抽象，理解，理解文档
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To completely understand a document, the use of textual information is not enough. Understanding visual cues, such as layouts and charts, is also required. While the current state-of-the-art approaches for document understanding (both OCR-based and OCR-free) work well, a thorough analysis of their capabilities and limitations has not yet been performed. Therefore, in this work, we addresses the limitation of current VisualQA models when applied to charts and plots. To investigate shortcomings of the state-of-the-art models, we conduct a comprehensive behavioral analysis, using ChartQA as a case study. Our findings indicate that existing models particularly underperform in answering questions related to the chart’s structural and visual context, as well as numerical information. To address these issues, we propose three simple pre-training tasks that enforce the existing model in terms of both structural-visual knowledge, as well as its understanding of numerical questions. We evaluate our pre-trained model (called MatCha-v2) on three chart datasets - both extractive and abstractive question datasets - and observe that it achieves an average improvement of 1.7% over the baseline model.
摘要：要完全理解一篇文档，仅使用文本信息是不够的。了解布局和图表等视觉提示也是必需的。虽然目前最先进的文档理解方法(基于OCR和无OCR)运行良好，但尚未对其能力和局限性进行彻底分析。因此，在这项工作中，我们解决了当前VisualQA模型在应用于图表和绘图时的局限性。为了调查最新模型的缺陷，我们进行了全面的行为分析，以ChartQA为案例研究。我们的发现表明，现有的模型在回答与图表的结构和视觉背景以及数字信息相关的问题方面表现特别差。为了解决这些问题，我们提出了三个简单的预训练任务，分别在结构视觉知识及其对数字问题的理解方面加强现有模型。我们在三个图表数据集–提取和抽象问题数据集–上评估了我们的预训练模型(称为Matcha-v2)，并观察到它比基准模型平均提高了1.7%。

[NLP-19] On the Evaluation of Speech Foundation Models for Spoken Language Understanding
[NLP-19] 关于口语理解的言语基础模型的评估

链接: https://arxiv.org/abs/2406.10083
作者: Siddhant Arora,Ankita Pasad,Chung-Ming Chien,Jionghao Han,Roshan Sharma,Jee-weon Jung,Hira Dhamyal,William Chen,Suwon Shon,Hung-yi Lee,Karen Livescu,Shinji Watanabe
关键词: Spoken Language Understanding, Spoken Language, complex spoken language, Language Understanding Evaluation, Language Understanding
中文关键词: 口语理解，口语，复杂口语，语言理解评估，语言理解
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at ACL Findings 2024

点击查看摘要

Abstract:The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pre-trained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies.
摘要：口语理解评估(SLUE)是最近推出的一套基准任务，以满足自然语音上对复杂口语理解(SLU)任务的开放资源和基准测试的需求，包括分类和序列生成任务。该基准在使用预先训练的语音基础模型(SFM)用于这些SLU任务方面取得了初步成功。然而，社区仍然缺乏对不同SFM的比较效用的细粒度理解。受此启发，我们问：哪些SFM为这些复杂的SLU任务提供了最大的好处，什么是整合这些SFM的最有效方法？为了回答这个问题，我们使用几种评估协议对多个监督和自我监督的SFMS进行了广泛的评估：(I)具有轻量级预测头部的冻结SFMS，(Ii)具有复杂预测头部的冻结SFMS，以及(Iii)具有轻量级预测头部的微调SFMS。尽管受监督的SFM对更多的语音识别数据(带有标签)进行了预训练，但它们并不总是优于自监督SFMS；后者往往至少与监督SFMS一样好，有时甚至更好，特别是在SLUE中的序列生成任务上。虽然没有通用的最优方法来整合SFMS，但复杂预测头对大多数任务提供了最佳性能，尽管它增加了推理时间。我们还为这些任务和建模策略引入了一个开源工具包和性能排行榜SLUE-PERB。

[NLP-20] Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection
[NLP-20] Sim-Whisper：具有截断检测的注意力引导流媒体Whisper

链接: https://arxiv.org/abs/2406.10052
作者: Haoyu Wang,Guoqiang Hu,Guodong Lin,Wei-Qiang Zhang,Jian Li
关键词: large-scale multilingual speech, multilingual speech recognition, demonstrated impressive results, streaming speech recognition, speech recognition model
中文关键词: 大规模多语言语音、多语言语音识别、展示令人印象深刻的结果、流语音识别、语音识别模型
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by INTERSPEECH 2024

点击查看摘要

Abstract:As a robust and large-scale multilingual speech recognition model, Whisper has demonstrated impressive results in many low-resource and out-of-distribution scenarios. However, its encoder-decoder structure hinders its application to streaming speech recognition. In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper’s cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. Furthermore, we observe the negative effect of the truncated words at the chunk boundaries on the decoding results and propose an integrate-and-fire-based truncation detection model to address this issue. Experiments on multiple languages and Whisper architectures show that Simul-Whisper achieves an average absolute word error rate degradation of only 1.46% at a chunk size of 1 second, which significantly outperforms the current state-of-the-art baseline.
摘要：作为一种强大且大规模的多语言语音识别模型，Whisper在许多低资源和分发外的场景中表现出了令人印象深刻的结果。然而，其编码器-解码器结构阻碍了其在流语音识别中的应用。本文中，我们介绍了Simul-Whisper，它使用Whisper交叉注意力中嵌入的时间对齐来指导自回归解码，并在无需对预训练模型进行任何微调的情况下实现基于块的流媒体ASB。此外，我们观察了块边界处的截断词对解码结果的负面影响，并提出了一种基于集成和射击的截断检测模型来解决这个问题。针对多种语言和Whisper架构的实验表明，在1秒的块大小下，Simul-Whisper的平均绝对字错误率仅降低1.46%，显着优于当前最先进的基线。

[NLP-21] FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in Biomedical Domain
[NLP-21] FZI-WIM参加SemEval-2024任务2：生物医学领域复杂NLI的自相容CoT

链接: https://arxiv.org/abs/2406.10040
作者: Jin Liu,Steffen Thoma
关键词: Safe Biomedical Natural, Biomedical Natural Language, Natural Language Inference, Safe Biomedical, Clinical Trials
中文关键词: 安全生物医学自然、生物医学自然语言、自然语言推理、安全生物医学、临床试验
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper describes the inference system of FZI-WIM at the SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. Our system utilizes the chain of thought (CoT) paradigm to tackle this complex reasoning problem and further improves the CoT performance with self-consistency. Instead of greedy decoding, we sample multiple reasoning chains with the same prompt and make the final verification with majority voting. The self-consistent CoT system achieves a baseline F1 score of 0.80 (1st), faithfulness score of 0.90 (3rd), and consistency score of 0.73 (12th). We release the code and data publicly this https URL.
摘要：本文描述了SemEval-2024任务2：临床试验的安全生物医学自然语言推理中FZI-WIM的推理系统。我们的系统利用思想链（CoT）范式来解决这个复杂的推理问题，并通过自一致性进一步提高CoT性能。我们不是贪婪解码，而是用相同提示对多个推理链进行采样，并通过多数投票进行最终验证。自相容的CoT系统的F1基线评分为0.80（第1名）、忠诚度评分为0.90（第3名）和一致性评分为0.73（第12名）。我们通过https URL公开发布代码和数据。

[NLP-22] Deep Bayesian Active Learning for Preference Modeling in Large Language Models
[NLP-22] 用于大型语言模型中偏好建模的深度Bayesian主动学习

链接: https://arxiv.org/abs/2406.10023
作者: Luckeciano C. Melo,Panagiotis Tigas,Alessandro Abate,Yarin Gal
关键词: Large Language Models, Leveraging human preferences, Large Language, Leveraging human, Language Models
中文关键词: 大型语言模型，利用人类偏好，大型语言，利用人类，语言模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Leveraging human preferences for steering the behavior of Large Language Models (LLMs) has demonstrated notable success in recent years. Nonetheless, data selection and labeling are still a bottleneck for these systems, particularly at large scale. Hence, selecting the most informative points for acquiring human feedback may considerably reduce the cost of preference labeling and unleash the further development of LLMs. Bayesian Active Learning provides a principled framework for addressing this challenge and has demonstrated remarkable success in diverse settings. However, previous attempts to employ it for Preference Modeling did not meet such expectations. In this work, we identify that naive epistemic uncertainty estimation leads to the acquisition of redundant samples. We address this by proposing the Bayesian Active Learner for Preference Modeling (BAL-PM), a novel stochastic acquisition policy that not only targets points of high epistemic uncertainty according to the preference model but also seeks to maximize the entropy of the acquired prompt distribution in the feature space spanned by the employed LLM. Notably, our experiments demonstrate that BAL-PM requires 33% to 68% fewer preference labels in two popular human preference datasets and exceeds previous stochastic Bayesian acquisition policies.
摘要：近年来，利用人类偏好来控制大型语言模型(LLM)的行为已显示出显著的成功。尽管如此，数据选择和标记仍然是这些系统的瓶颈，特别是在大规模的情况下。因此，选择信息量最大的点来获取人类反馈可能会大大降低偏好标注的成本，并释放LLMS的进一步发展。贝叶斯主动学习为应对这一挑战提供了一个原则性的框架，并在不同的环境中取得了显着的成功。然而，之前利用它进行偏好建模的尝试并没有达到这样的期望。在这项工作中，我们发现朴素的认知不确定性估计会导致冗余样本的获取。为了解决这一问题，我们提出了贝叶斯主动学习偏好建模(BAL-PM)，这是一种新的随机获取策略，它不仅根据偏好模型针对认知不确定性较高的点，而且寻求最大化所获取的即时分布在所使用的LLM所覆盖的特征空间中的熵。值得注意的是，我们的实验表明，BAL-PM在两个流行的人类偏好数据集中需要的偏好标签减少了33%到68%，并且超过了以前的随机贝叶斯获取策略。

[NLP-23] Group and Shuffle: Efficient Structured Orthogonal Parametrization
[NLP-23] 分组和洗牌：高效的结构化垂直参数化

链接: https://arxiv.org/abs/2406.10019
作者: Mikhail Gorbunov,Nikolay Yudin,Vera Soboleva,Aibek Alanov,Alexey Naumov,Maxim Rakhuba
关键词: increasing size, growing demand, orthogonal, efficient fine-tuning, fine-tuning
中文关键词: 规模不断扩大、需求不断增长、垂直、高效微调、微调
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:The increasing size of neural networks has led to a growing demand for methods of efficient fine-tuning. Recently, an orthogonal fine-tuning paradigm was introduced that uses orthogonal matrices for adapting the weights of a pretrained model. In this paper, we introduce a new class of structured matrices, which unifies and generalizes structured classes from previous works. We examine properties of this class and build a structured orthogonal parametrization upon it. We then use this parametrization to modify the orthogonal fine-tuning framework, improving parameter and computational efficiency. We empirically validate our method on different domains, including adapting of text-to-image diffusion models and downstream task fine-tuning in language modeling. Additionally, we adapt our construction for orthogonal convolutions and conduct experiments with 1-Lipschitz neural networks.
摘要：神经网络规模的不断增加导致对高效微调方法的需求不断增长。最近，引入了一种垂直微调范式，该范式使用垂直矩阵来调整预训练模型的权重。在本文中，我们引入了一类新的结构化矩阵，它统一和推广了以前作品中的结构化类。我们检查这个类的属性，并在它上构建结构化的垂直参数化。然后我们使用这个参数化来修改垂直微调框架，提高参数和计算效率。我们在不同领域通过经验验证了我们的方法，包括调整文本到图像扩散模型和语言建模中的下游任务微调。此外，我们还调整了我们的结构以适应正交卷积，并使用1-Lipschitz神经网络进行实验。

[NLP-24] Precision Empowers Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models
[NLP-24] 精确性赋予过度干扰：在语言模型中动态注入知识的视觉问题解答

链接: https://arxiv.org/abs/2406.09994
作者: Manas Jhalani,Annervaz K M,Pushpak Bhattacharyya
关键词: Visual Question Answering, visual content, addressing natural language, Visual Question, Knowledge-Based Visual Question
中文关键词: 视觉问题解答、视觉内容、解决自然语言、视觉问题、基于知识的视觉问题
类目: Computation and Language (cs.CL)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:In the realm of multimodal tasks, Visual Question Answering (VQA) plays a crucial role by addressing natural language questions grounded in visual content. Knowledge-Based Visual Question Answering (KBVQA) advances this concept by adding external knowledge along with images to respond to questions. We introduce an approach for KBVQA, augmenting the existing vision-language transformer encoder-decoder (OFA) model. Our main contribution involves enhancing questions by incorporating relevant external knowledge extracted from knowledge graphs, using a dynamic triple extraction method. We supply a flexible number of triples from the knowledge graph as context, tailored to meet the requirements for answering the question. Our model, enriched with knowledge, demonstrates an average improvement of 4.75% in Exact Match Score over the state-of-the-art on three different KBVQA datasets. Through experiments and analysis, we demonstrate that furnishing variable triples for each question improves the reasoning capabilities of the language model in contrast to supplying a fixed number of triples. This is illustrated even for recent large language models. Additionally, we highlight the model’s generalization capability by showcasing its SOTA-beating performance on a small dataset, achieved through straightforward fine-tuning.
摘要：在多通道任务领域，视觉问答(VQA)通过解决基于视觉内容的自然语言问题而发挥着至关重要的作用。基于知识的可视化问答(KBVQA)通过添加外部知识和图像来回答问题，从而推进了这一概念。在现有的视觉-语言转换器编解码器(OFA)模型的基础上，提出了一种KBVQA方法。我们的主要贡献包括通过结合从知识图中提取的相关外部知识，使用动态三重提取方法来增强问题。我们提供了知识图中灵活数量的三元组作为上下文，以满足回答问题的要求。我们的模型具有丰富的知识，在三个不同的KBVQA数据集上，精确匹配分数平均提高了4.75%。通过实验和分析，我们证明，与提供固定数量的三元组相比，为每个问题提供变量三元组提高了语言模型的推理能力。这一点甚至在最近的大型语言模型中也得到了说明。此外，我们通过在小数据集上展示其通过直接微调实现的SOTA击败性能，突出了该模型的泛化能力。

[NLP-25] Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning
[NLP-25] 细节发挥作用：对象状态敏感神经机器人任务规划

链接: https://arxiv.org/abs/2406.09988
作者: Xiaowen Sun,Xufeng Zhao,Jae Hee Lee,Wenhao Lu,Matthias Kerzel,Stefan Wermter
关键词: planning and manipulation, reflects its current, current status, status or condition, object
中文关键词: 规划和操纵，反映其当前的、当前的状态、状态或条件、对象
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The state of an object reflects its current status or condition and is important for a robot’s task planning and manipulation. However, detecting an object’s state and generating a state-sensitive plan for robots is challenging. Recently, pre-trained Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown impressive capabilities in generating plans. However, to the best of our knowledge, there is hardly any investigation on whether LLMs or VLMs can also generate object state-sensitive plans. To study this, we introduce an Object State-Sensitive Agent (OSSA), a task-planning agent empowered by pre-trained neural networks. We propose two methods for OSSA: (i) a modular model consisting of a pre-trained vision processing module (dense captioning model, DCM) and a natural language processing model (LLM), and (ii) a monolithic model consisting only of a VLM. To quantitatively evaluate the performances of the two methods, we use tabletop scenarios where the task is to clear the table. We contribute a multimodal benchmark dataset that takes object states into consideration. Our results show that both methods can be used for object state-sensitive tasks, but the monolithic approach outperforms the modular approach. The code for OSSA is available at \urlthis https URL
摘要：物体的状态反映了其当前的状态或条件，对机器人的任务规划和操作具有重要意义。然而，检测对象的状态并为机器人生成状态敏感的计划是具有挑战性的。最近，预先训练的大型语言模型(LLM)和视觉语言模型(VLM)在生成计划方面表现出了令人印象深刻的能力。然而，就我们所知，关于LLM或VLM是否也能产生对象状态敏感计划的研究几乎没有。为了研究这一点，我们引入了对象状态敏感代理(OSSA)，这是一种由预先训练的神经网络授权的任务规划代理。我们提出了两种OSSA方法：(I)由预先训练的视觉处理模块(密集字幕模型，DCM)和自然语言处理模型(LLM)组成的模块化模型；(Ii)只由VLM组成的整体模型。为了定量评估这两种方法的性能，我们使用桌面场景，其中任务是清理桌子。我们提供了一个考虑对象状态的多模式基准数据集。我们的结果表明，两种方法都可以用于对象状态敏感任务，但整体方法的性能优于模块化方法。OSSA的代码可在此HTTPS URL中找到

[NLP-26] HIRO: Hierarchical Information Retrieval Optimization
[NLP-26] HIRO：分层信息检索优化

链接: https://arxiv.org/abs/2406.09979
作者: Krish Goel,Mahek Chandak
关键词: Large Language Models, Large Language, natural language tasks, face limitations due, natural language
中文关键词: 大型语言模型、大型语言、自然语言任务、面临限制、自然语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in natural language tasks but face limitations due to static training datasets, resulting in outdated or contextually shallow responses. Retrieval-Augmented Generation (RAG) addresses this by integrating real-time external knowledge, enhancing model accuracy and credibility, especially for knowledge-intensive tasks. However, RAG-enhanced LLMs struggle with long contexts, causing them to “choke” on information overload, compromising response quality. Recent RAG applications use hierarchical data structures for storing documents, organized at various levels of summarization and information density. In this context, we introduce HIRO (Hierarchical Information Retrieval Optimization), a novel querying approach for RAG applications using hierarchical structures for storing documents. HIRO employs DFS-based recursive similarity score calculation and branch pruning to minimize the context returned to the LLM without informational loss. HIRO outperforms existing querying mechanisms on the NarrativeQA dataset by an absolute performance gain of 10.85%.
摘要：大语言模型在自然语言任务中表现优异，但由于静态训练数据集的限制，导致响应过时或上下文浅。检索-增强生成(RAG)通过集成实时外部知识来解决这一问题，提高了模型的准确性和可信度，特别是对于知识密集型任务。然而，RAG增强的LLM在长上下文中苦苦挣扎，导致它们因信息过载而“窒息”，从而影响响应质量。最近的RAG应用程序使用分层数据结构来存储以不同级别的摘要和信息密度组织的文档。在此背景下，我们介绍了HIRO(层次信息检索优化)，这是一种使用层次结构存储文档的RAG应用程序的新查询方法。Hiro使用基于DFS的递归相似度计算和分支剪枝来最小化返回到LLM的上下文而不丢失信息。Hiro在NarativeQA数据集上的性能优于现有的查询机制，绝对性能提高了10.85%。

[NLP-27] Disentangling Dialect from Social Bias via Multitask Learning to Improve Fairness
[NLP-27] 通过多任务学习将方言与社会偏见区分开来以提高公平性

链接: https://arxiv.org/abs/2406.09977
作者: Maximilian Spliethöver,Sai Nikhil Menon,Henning Wachsmuth
关键词: social groups, occur in regional, regional or social, biased language, Dialects introduce syntactic
中文关键词: 社会群体，发生在地区、地区或社会、有偏见的语言中，方言引入语法
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of the Association for Computational Linguistics: ACL 2024

点击查看摘要

Abstract:Dialects introduce syntactic and lexical variations in language that occur in regional or social groups. Most NLP methods are not sensitive to such variations. This may lead to unfair behavior of the methods, conveying negative bias towards dialect speakers. While previous work has studied dialect-related fairness for aspects like hate speech, other aspects of biased language, such as lewdness, remain fully unexplored. To fill this gap, we investigate performance disparities between dialects in the detection of five aspects of biased language and how to mitigate them. To alleviate bias, we present a multitask learning approach that models dialect language as an auxiliary task to incorporate syntactic and lexical variations. In our experiments with African-American English dialect, we provide empirical evidence that complementing common learning approaches with dialect modeling improves their fairness. Furthermore, the results suggest that multitask learning achieves state-of-the-art performance and helps to detect properties of biased language more reliably.
摘要：方言是指出现在地区或社会群体中的语言中的句法和词汇变化。大多数自然语言处理方法对这种变化不敏感。这可能会导致这些方法的不公平行为，传达出对说方言的人的负面偏见。虽然之前的工作已经从仇恨言论等方面研究了与方言相关的公平性，但偏见语言的其他方面，如淫秽，仍然完全没有被探索。为了填补这一差距，我们调查了不同方言在检测语言偏见的五个方面的表现差异，以及如何缓解这些差异。为了减轻偏见，我们提出了一种多任务学习方法，将方言语言作为辅助任务来结合句法和词汇变化。在我们对非裔美国人英语方言的实验中，我们提供了经验证据，证明用方言建模来补充常见的学习方法可以提高它们的公平性。此外，结果表明，多任务学习达到了最先进的表现，有助于更可靠地检测有偏见的语言的属性。

[NLP-28] A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization
[NLP-28] 用于文本生成的更好的LLM评估器：提示输出排序和优化的影响

链接: https://arxiv.org/abs/2406.09972
作者: KuanChao Chu,Yi-Pei Chen,Hideki Nakayama
关键词: large language models, evaluating generated texts, investigates prompt designs, research investigates prompt, research investigates
中文关键词: 大型语言模型、评估生成的文本、调查提示设计、研究调查提示、研究调查
类目: Computation and Language (cs.CL)
备注: Presented in JSAI 2024. The first two authors contributed equally. arXiv admin note: substantial text overlap with arXiv:2406.02863

点击查看摘要

Abstract:This research investigates prompt designs of evaluating generated texts using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for open-ended text evaluation remains challenging due to model sensitivity and subjectivity in evaluation of text generation. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs’ scoring, with a different level of rule understanding in the prompt. An additional optimization may enhance scoring alignment if sufficient data is available. This insight is crucial for improving the accuracy and consistency of LLM-based evaluations.
摘要：本研究调查了使用大型语言模型（LLM）评估生成文本的提示设计。虽然LLM越来越多地用于对各种输入进行评分，但由于文本生成评估中的模型敏感性和主观性，为开放式文本评估创建有效的提示仍然具有挑战性。我们的研究实验了不同的提示结构，改变了输出指令的顺序并包括解释原因。我们发现，呈现原因和分数的顺序显着影响LLM的评分，提示中的规则理解程度不同。如果有足够的数据可用，额外的优化可能会增强评分一致性。这一见解对于提高基于LLM的评估的准确性和一致性至关重要。

[NLP-29] Bag of Lies: Robustness in Continuous Pre-training BERT
[NLP-29] 谎言袋：连续预训练BERT的稳健性

链接: https://arxiv.org/abs/2406.09967
作者: Ine Gevers,Walter Daelemans
关键词: continuous pre-training phase, entity knowledge, BERT pre-training data, continuous pre-training, study aims
中文关键词: 持续预训练阶段、实体知识、BERT预训练数据、持续预训练、研究目标
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study aims to acquire more insights into the continuous pre-training phase of BERT regarding entity knowledge, using the COVID-19 pandemic as a case study. Since the pandemic emerged after the last update of BERT’s pre-training data, the model has little to no entity knowledge about COVID-19. Using continuous pre-training, we control what entity knowledge is available to the model. We compare the baseline BERT model with the further pre-trained variants on the fact-checking benchmark Check-COVID. To test the robustness of continuous pre-training, we experiment with several adversarial methods to manipulate the input data, such as training on misinformation and shuffling the word order until the input becomes nonsensical. Surprisingly, our findings reveal that these methods do not degrade, and sometimes even improve, the model’s downstream performance. This suggests that continuous pre-training of BERT is robust against misinformation. Furthermore, we are releasing a new dataset, consisting of original texts from academic publications in the LitCovid repository and their AI-generated false counterparts.
摘要：本研究旨在以新冠肺炎大流行为个案，探讨BERT持续训练前阶段的实体知识。由于疫情是在伯特的训练前数据最后一次更新后出现的，该模型对新冠肺炎几乎没有实体知识。使用连续的预训练，我们控制哪些实体知识可用于模型。我们将基准BERT模型与事实核查基准Check-COVID上的进一步预训练变体进行了比较。为了测试连续预训练的稳健性，我们实验了几种对抗性的输入数据处理方法，例如对错误信息的训练和重新排列词序，直到输入变得毫无意义。令人惊讶的是，我们的发现显示，这些方法不会降低模型的下游性能，有时甚至会改善模型的下游性能。这表明，BERT的持续预训练对错误信息是健壮的。此外，我们正在发布一个新的数据集，由LitCovid存储库中学术出版物的原始文本和它们的人工智能生成的虚假副本组成。

[NLP-30] ChartMimic: Evaluating LMMs Cross-Modal Reasoning Capability via Chart-to-Code Generation
[NLP-30] ChartMimic：通过图表到代码生成评估LSYS跨模式推理能力

链接: https://arxiv.org/abs/2406.09961
作者: Chufan Shi,Cheng Yang,Yaxin Liu,Bo Shui,Junjie Wang,Mohan Jing,Linran Xu,Xinyu Zhu,Siheng Li,Yuxiang Zhang,Gongye Liu,Xiaomei Nie,Deng Cai,Yujiu Yang
关键词: large multimodal models, aimed at assessing, visually-grounded code generation, assessing the visually-grounded, large multimodal
中文关键词: 大型多模式模型，旨在评估基于视觉的代码生成，评估基于视觉的大型多模式
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Data and code are available at this https URL

点击查看摘要

Abstract:We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs). ChartMimic utilizes information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate the corresponding code for chart rendering. ChartMimic includes 1,000 human-curated (figure, instruction, code) triplets, which represent the authentic chart use cases found in scientific papers across various domains(e.g., Physics, Computer Science, Economics, etc). These charts span 18 regular types and 4 advanced types, diversifying into 191 subcategories. Furthermore, we propose multi-level evaluation metrics to provide an automatic and thorough assessment of the output code and the rendered charts. Unlike existing code generation benchmarks, ChartMimic places emphasis on evaluating LMMs’ capacity to harmonize a blend of cognitive capabilities, encompassing visual understanding, code generation, and cross-modal reasoning. The evaluation of 3 proprietary models and 11 open-weight models highlights the substantial challenges posed by ChartMimic. Even the advanced GPT-4V, Claude-3-opus only achieve an average score of 73.2 and 53.7, respectively, indicating significant room for improvement. We anticipate that ChartMimic will inspire the development of LMMs, advancing the pursuit of artificial general intelligence.
摘要：我们引入了一个新的基准，ChartMimic，旨在评估大型多通道模型(LMM)的视觉接地代码生成能力。ChartMimic使用信息密集型可视图表和文本指令作为输入，需要LMM生成相应的代码来呈现图表。ChartMimic包括1,000个人工策划的(图形、指令、代码)三元组，它们代表了在不同领域(例如，物理、计算机科学、经济学等)的科学论文中找到的真实图表用例。这些图表涵盖18种常规类型和4种高级类型，分为191个子类别。此外，我们提出了多层次的评估指标，以提供对输出代码和呈现的图表的自动和彻底的评估。与现有的代码生成基准不同，ChartMimic侧重于评估LMM协调各种认知能力的能力，包括视觉理解、代码生成和跨模式推理。对3个专有模型和11个开放重量模型的评估突显了ChartMimic带来的重大挑战。即使是先进的GPT-4V、克劳德-3-OPUS的平均分也只有73.2分和53.7分，这表明还有很大的改进空间。我们期待ChartMimic将激励LMM的发展，推动对人工通用智能的追求。

[NLP-31] BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval
[NLP-31] BiSLC：通过文本到图像检索扩展视觉语言组合性评估

链接: https://arxiv.org/abs/2406.09952
作者: Imanol Miranda,Ander Salaberria,Eneko Agirre,Gorka Azkune
关键词: Existing Vision-Language Compositionality, Bidirectional Vision-Language Compositionality, correct textual description, Vision-Language Compositionality, Existing Vision-Language
中文关键词: 现有视觉-语言组合、双向视觉-语言组合、正确的文本描述、视觉-语言组合、现有视觉-语言
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing Vision-Language Compositionality (VLC) benchmarks like SugarCrepe are formulated as image-to-text retrieval problems, where, given an image, the models need to select between the correct textual description and a synthetic hard negative text. In this work we present the Bidirectional Vision-Language Compositionality (BiVLC) dataset. The novelty of BiVLC is to add a synthetic hard negative image generated from the synthetic text, resulting in two image-to-text retrieval examples (one for each image) and, more importantly, two text-to-image retrieval examples (one for each text). Human annotators filter out ill-formed examples ensuring the validity of the benchmark. The experiments on BiVLC uncover a weakness of current multimodal models, as they perform poorly in the text-to-image direction. In fact, when considering both retrieval directions, the conclusions obtained in previous works change significantly. In addition to the benchmark, we show that a contrastive model trained using synthetic images and texts improves the state of the art in SugarCrepe and in BiVLC for both retrieval directions. The gap to human performance in BiVLC confirms that Vision-Language Compositionality is still a challenging problem. BiVLC and code are available at this https URL.
摘要：现有的视觉-语言合成性(VLC)基准测试(如SugarCrepe)被描述为图像到文本的检索问题，在给定图像的情况下，模型需要在正确的文本描述和合成的硬否定文本之间进行选择。在这项工作中，我们提出了双向视觉-语言合成性(BiVLC)数据集。BiVLC的新奇之处在于添加了从合成文本生成的合成硬负片图像，产生了两个图像到文本的检索示例(每个图像一个)，更重要的是，产生了两个文本到图像的检索示例(每个文本一个)。人工注释器过滤掉格式错误的示例，确保基准测试的有效性。在BiVLC上的实验揭示了当前多模式模型的一个弱点，因为它们在文本到图像的方向上表现不佳。事实上，当考虑这两个检索方向时，前人工作中得到的结论发生了很大的变化。除了基准测试外，我们还表明，使用合成图像和文本训练的对比模型改善了SugarCrepe和BiVLC中两个检索方向的最新水平。在BiVLC中与人类表现的差距证实了视觉-语言合成仍然是一个具有挑战性的问题。BiVLC和代码可在此HTTPS URL上找到。

[NLP-32] An efficient text augmentation approach for contextualized Mandarin speech recognition
[NLP-32] 一种有效的上下文化普通话语音识别文本增强方法

链接: https://arxiv.org/abs/2406.09950
作者: Naijun Zheng,Xucheng Wan,Kai Liu,Ziqing Du,Zhou Huan
关键词: contextualized automatic speech, automatic speech recognition, speech-text data availability, automatic speech, effectiveness is hindered
中文关键词: 上下文自动语音、自动语音识别、语音文本数据可用性、自动语音、有效性受到阻碍
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: accepted to interspeech2024

点击查看摘要

Abstract:Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA) technique, all while keeping computational costs minimal. In particular, to contextualize a pre-trained CIF-based ASR, we construct a codebook using limited speech-text data. By utilizing a simple codebook lookup process, we convert available text-only data into latent text embeddings. These embeddings then enhance the inputs for the contextualized ASR. Our experiments on diverse Mandarin test sets demonstrate that our TA approach significantly boosts recognition performance. The top-performing system shows relative CER improvements of up to 30% on rare words and 15% across all words in general.
摘要：尽管语境化的自动语音识别(ASR)系统通常被用来提高对不常见单词的识别，但其有效性受到语音-文本数据可获得性的内在限制。为了应对这一挑战，我们的研究建议利用大量的纯文本数据集，并使用简单的文本增强(TA)技术对预先训练的ASR模型进行上下文处理，同时保持最低的计算成本。特别是，为了使预先训练的基于CIF的ASR具有上下文，我们使用有限的语音-文本数据构建了一个码本。通过使用一个简单的码本查找过程，我们将可用的纯文本数据转换为潜在的文本嵌入。然后，这些嵌入增强了语境化ASR的输入。我们在不同的普通话测试集上的实验表明，我们的TA方法显著提高了识别性能。性能最好的系统在稀有单词上的相对CER提高了30%，在所有单词上的相对CER提高了15%。

[NLP-33] BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages
[NLP-33] BLenD：法学硕士在不同文化和语言中日常知识的基准

链接: https://arxiv.org/abs/2406.09948
作者: Junho Myung,Nayeon Lee,Yi Zhou,Jiho Jin,Rifki Afina Putri,Dimosthenis Antypas,Hsuvas Borkakoty,Eunsu Kim,Carla Perez-Almendros,Abinew Ali Ayele,Víctor Gutiérrez-Basulto,Yazmín Ibáñez-García,Hwaran Lee,Shamsuddeen Hassan Muhammad,Kiwoong Park,Anar Sabuhi Rzayev,Nina White,Seid Muhie Yimam,Mohammad Taher Pilehvar,Nedjma Ousidhoum,Jose Camacho-Collados,Alice Oh
关键词: Large language models, lack culture-specific knowledge, Large language, daily life, lack culture-specific
中文关键词: 大型语言模型，缺乏特定文化知识，大型语言，日常生活，缺乏特定文化
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs’ cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs’ everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages. We make our dataset publicly available at: this https URL.
摘要：大型语言模型通常缺乏日常生活的特定文化知识，特别是跨不同地区和非英语语言。现有的评估LLMS文化敏感性的基准仅限于一种语言，或从维基百科等在线资源收集，这些资源没有反映不同地区平凡的日常生活方式。也就是说，关于人们在生日庆祝时吃的食物、他们通常使用的香料、年轻人演奏的乐器或他们在学校练习的运动的信息是常见的文化知识，但在容易收集的在线资源中并不常见，特别是对于代表性不足的文化。为了解决这个问题，我们引入了Blend，这是一个手工制作的基准，旨在评估不同文化和语言的LLM的日常知识。Blend包括来自16个国家/地区的13种不同语言的52.6k个问答对，包括阿姆哈拉语、阿萨姆语、阿塞拜疆语、豪萨语和舜达语等资源较少的语言。我们构建的基准包括两种形式的问题：简答题和选择题。我们发现，LLM在在线高度代表的文化中表现得更好，在简答式中表现最好的GPT-4模型的最大差异为57.34%。对于以中高资源语言为代表的文化，LLM在当地语言中的表现更好，但对于以低资源语言为代表的文化，LLM在英语中的表现比当地语言更好。我们通过以下网址公开我们的数据集：This HTTPS URL。

[NLP-34] Experiments in News Bias Detection with Pre-Trained Neural Transformers
[NLP-34] 使用预训练的神经变形器进行新闻偏见检测实验

链接: https://arxiv.org/abs/2406.09938
作者: Tim Menzner,Jochen L. Leidner
关键词: World Wide Web, World Wide, Wide Web, Web provides unrivalled, including factual
中文关键词: 万维网，网络提供无与伦比的，包括事实
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The World Wide Web provides unrivalled access to information globally, including factual news reporting and commentary. However, state actors and commercial players increasingly spread biased (distorted) or fake (non-factual) information to promote their agendas. We compare several large, pre-trained language models on the task of sentence-level news bias detection and sub-type classification, providing quantitative and qualitative results.
摘要：万维网提供了无与伦比的全球信息访问，包括事实新闻报道和评论。然而，国家行为者和商业行为者越来越多地传播有偏见（扭曲）或虚假（非事实）的信息来宣传他们的议程。我们比较了几个大型的、预先训练的语言模型来执行业务级新闻偏见检测和子类型分类的任务，提供定量和定性的结果。

[NLP-35] CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses Procedures Lab Tests Orders and Prescriptions
[NLP-35] CliBench：诊断程序临床决策中大型语言模型的多方面评估实验室测试命令和处方

链接: https://arxiv.org/abs/2406.09923
作者: Mingyu Derek Ma,Chenchen Ye,Yu Yan,Xiaoxuan Wang,Peipei Ping,Timothy S Chang,Wei Wang
关键词: Large Language Models, Artificial Intelligence, Language Models, Large Language, process offers significant
中文关键词: 大型语言模型、人工智能、语言模型、大型语言、流程提供了重要的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI), especially Large Language Models (LLMs), into the clinical diagnosis process offers significant potential to improve the efficiency and accessibility of medical care. While LLMs have shown some promise in the medical domain, their application in clinical diagnosis remains underexplored, especially in real-world clinical practice, where highly sophisticated, patient-specific decisions need to be made. Current evaluations of LLMs in this field are often narrow in scope, focusing on specific diseases or specialties and employing simplified diagnostic tasks. To bridge this gap, we introduce CliBench, a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs’ capabilities in clinical diagnosis. This benchmark not only covers diagnoses from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions. Supported by structured output ontologies, CliBench enables a precise and multi-granular evaluation, offering an in-depth understanding of LLM’s capability on diverse clinical tasks of desired granularity. We conduct a zero-shot evaluation of leading LLMs to assess their proficiency in clinical decision-making. Our preliminary results shed light on the potential and limitations of current LLMs in clinical settings, providing valuable insights for future advancements in LLM-powered healthcare.
摘要：人工智能(AI)，特别是大语言模型(LLMS)与临床诊断过程的结合，为提高医疗服务的效率和可及性提供了巨大的潜力。虽然LLM在医学领域显示出了一些希望，但它们在临床诊断中的应用仍未得到探索，特别是在现实世界的临床实践中，需要做出高度复杂的、针对患者的决策。目前在这一领域对LLMS的评价往往范围狭窄，侧重于特定的疾病或专科，并采用简化的诊断任务。为了弥补这一差距，我们引入了从MIMIC IV数据集开发的新型基准测试CLBENCH，为LLMS在临床诊断中的能力提供了全面和现实的评估。这一基准不仅涵盖不同专科的各种医疗病例的诊断，还包括具有临床意义的任务：治疗程序识别、实验室测试排序和药物处方。在结构化输出本体的支持下，CliBtch实现了精确和多粒度的评估，提供了对LLM在所需粒度的各种临床任务上的能力的深入了解。我们对领先的低层管理人员进行零风险评估，以评估他们在临床决策方面的熟练程度。我们的初步结果揭示了当前LLM在临床环境中的潜力和局限性，为LLM支持的医疗保健的未来发展提供了有价值的见解。

[NLP-36] Knowledge Editing in Language Models via Adapted Direct Preference Optimization
[NLP-36] 通过自适应直接偏好优化进行语言模型知识编辑

链接: https://arxiv.org/abs/2406.09920
作者: Amit Rozner,Barak Battash,Lior Wolf,Ofir Lindenbaum
关键词: Large Language Models, Large Language, Direct Preference Optimization, lack updated world, Language Models
中文关键词: 大型语言模型，大型语言，直接偏好优化，缺乏更新的世界，语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) can become outdated over time as they may lack updated world knowledge, leading to factual knowledge errors and gaps. Knowledge Editing (KE) aims to overcome this challenge using weight updates that do not require expensive retraining. We propose treating KE as an LLM alignment problem. Toward this goal, we introduce Knowledge Direct Preference Optimization (KDPO), a variation of the Direct Preference Optimization (DPO) that is more effective for knowledge modifications. Our method is based on an online approach that continually updates the knowledge stored in the model. We use the current knowledge as a negative sample and the new knowledge we want to introduce as a positive sample in a process called DPO. We also use teacher-forcing for negative sample generation and optimize using the positive sample, which helps maintain localized changes. We tested our KE method on various datasets and models, comparing it to several cutting-edge methods, with 100 and 500 sequential edits. Additionally, we conducted an ablation study comparing our method to the standard DPO approach. Our experimental results show that our modified DPO method allows for more refined KE, achieving similar or better performance compared to previous methods.
摘要：大型语言模型(LLM)可能会随着时间的推移而过时，因为它们可能缺乏最新的世界知识，从而导致事实知识错误和差距。知识编辑(KE)旨在通过不需要昂贵的再培训的体重更新来克服这一挑战。我们建议将KE视为一个LLM对齐问题。为了达到这一目标，我们引入了知识直接偏好优化(KDPO)，它是直接偏好优化(DPO)的一种变体，对知识修改更有效。我们的方法基于在线方法，不断更新存储在模型中的知识。在称为DPO的过程中，我们将当前知识作为负样本，而将新知识作为正样本引入。我们还使用教师强迫来生成负样本，并使用正样本进行优化，这有助于保持局部更改。我们在不同的数据集和模型上测试了我们的KE方法，并将其与几种前沿方法进行了比较，分别进行了100次和500次连续编辑。此外，我们还进行了消融研究，将我们的方法与标准的DPO方法进行了比较。我们的实验结果表明，我们的改进的DPO方法允许更精确的KE，与以前的方法相比，获得了相似或更好的性能。

[NLP-37] GEB-1.3B: Open Lightweight Large Language Model
[NLP-37] GEB-1.3B：开放轻量级大型语言模型

链接: https://arxiv.org/abs/2406.09900
作者: Jie Wu,Yufeng Zhu,Lei Shen,Xuqing Lu
关键词: Recently developed large, demonstrated impressive abilities, Recently developed, Llama have demonstrated, developed large language
中文关键词: 最近开发的大型语言，展示了令人印象深刻的能力，最近开发的大型语言
类目: Computation and Language (cs.CL)
备注: GEB-1.3B technical report

点击查看摘要

Abstract:Recently developed large language models (LLMs) such as ChatGPT, Claude, and Llama have demonstrated impressive abilities, and even surpass human-level performance in several tasks. Despite their success, the resource-intensive demands of these models, requiring significant computational power for both training and inference, limit their deployment to high-performance servers. Additionally, the extensive calculation requirements of the models often lead to increased latency in response times. With the increasing need for LLMs to operate efficiently on CPUs, research about lightweight models that are optimized for CPU inference has emerged. In this work, we introduce GEB-1.3B, a lightweight LLM trained on 550 billion tokens in both Chinese and English languages. We employ novel training techniques, including ROPE, Group-Query-Attention, and FlashAttention-2, to accelerate training while maintaining model performance. Additionally, we fine-tune the model using 10 million samples of instruction data to enhance alignment. GEB-1.3B exhibits outstanding performance on general benchmarks such as MMLU, C-Eval, and CMMLU, outperforming comparative models such as MindLLM-1.3B and TinyLLaMA-1.1B. Notably, the FP32 version of GEB-1.3B achieves commendable inference times on CPUs, with ongoing efforts to further enhance speed through advanced quantization techniques. The release of GEB-1.3B as an open-source model marks a significant contribution to the development of lightweight LLMs, promising to foster further research and innovation in the field.
摘要：最近发展起来的大型语言模型(LLM)，如ChatGPT、Claude和Llama，已经显示出令人印象深刻的能力，甚至在一些任务中超过了人类的水平。尽管它们取得了成功，但这些模型的资源密集型需求，需要大量的计算能力来进行训练和推理，将它们的部署限制在高性能服务器上。此外，模型的大量计算要求通常会导致响应时间延迟增加。随着对LLMS在CPU上高效运行的需求的增加，针对CPU推理进行优化的轻量级模型的研究应运而生。在这项工作中，我们介绍了GEB-1.3B，一个轻量级的LLM，训练了5500亿个中英语言的标记。我们使用了新颖的训练技术，包括绳索、组查询-注意和FlashAttent-2，以在保持模型性能的同时加快训练速度。此外，我们使用1000万个指令数据样本对模型进行了微调，以增强比对。GEB-1.3B在MMLU、C-EVAL和CMMLU等通用基准测试中表现突出，表现优于MindLLM-1.3B和TinyLLaMA-1.1B等比较型号。值得注意的是，GEB-1.3B的FP32版本在CPU上实现了值得称赞的推理时间，并不断努力通过先进的量化技术进一步提高速度。GEB-1.3B作为开源模型的发布标志着对轻量级LLMS开发的重大贡献，有望促进该领域的进一步研究和创新。

[NLP-38] 3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding
[NLP-38] 3D-BEP：通过3D旋转位置编码增强长上下文建模

链接: https://arxiv.org/abs/2406.09897
作者: Xindian Ma,Wenyuan Liu,Peng Zhang,Nan Xu
关键词: Bloch Sphere representation, rotary position encoding, Bloch Sphere, rotary position, position encoding
中文关键词: 布洛赫球表示，旋转位置编码，布洛赫球，旋转位置，位置编码
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inspired by the Bloch Sphere representation, we propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position Encoding (3D-RPE). 3D-RPE is an advanced version of the widely used 2D Rotary Position Encoding (RoPE), with two major advantages for modeling long contexts: controllable long-term decay and improved position resolution. For controllable long-term decay, 3D-RPE allows for the regulation of long-term decay within the chunk size, ensuring the modeling of relative positional information between tokens at a distant relative position. For enhanced position resolution, 3D-RPE can mitigate the degradation of position resolution caused by position interpolation on RoPE. We have conducted experiments on long-context Natural Language Understanding (NLU) and long-sequence Language Modeling (LM) tasks. From the experimental results, 3D-RPE achieved performance improvements over RoPE, especially in long-context NLU tasks.
摘要：受布洛赫球表示的启发，我们提出了一种新型的三维球体旋转位置编码，称为3D旋转位置编码（3D-GPT）。3D-GPT是广泛使用的2D旋转位置编码（RoPE）的高级版本，对于建模长上下文具有两个主要优势：可控的长期衰减和改进的位置分辨率。对于可控的长期衰变，3D-RPE允许在块大小内调节长期衰变，确保对遥远相对位置的代币之间的相对位置信息进行建模。为了增强位置分辨率，3D-RPE可以减轻RoPE上位置插值引起的位置分辨率下降。我们对长上下文自然语言理解（NLU）和长序列语言建模（LM）任务进行了实验。从实验结果来看，3D-BEP比RoPE实现了性能改进，尤其是在长上下文NLU任务中。

[NLP-39] A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation
[NLP-39] 用于低资源多域对话生成的统一数据增强框架

链接: https://arxiv.org/abs/2406.09881
作者: Yongkang Liu,Ercong Nie,Zheng Hua,Zifeng Ding,Daling Wang,Yifei Zhang,Hinrich Schütze
关键词: systems heavily rely, textbf, training, dialogue systems heavily, Current
中文关键词: 系统严重依赖，textBF，培训，对话系统严重，当前
类目: Computation and Language (cs.CL)
备注: 17pages,ECML-PKDD

点击查看摘要

Abstract:Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data \textbfAugmentation framework for \textbfMulti-\textbfDomain \textbfDialogue \textbfGeneration, referred to as \textbfAMD ^2 G. The AMD ^2 G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a \textit\textbfde-domaining data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMD ^2 G achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMD ^2 G as a viable alternative solution for low-resource multi-domain dialogue generation. Code and data associated with our work are available on GitHub repository ^\text 1 .
摘要：当前最先进的对话系统严重依赖于大量的训练数据集。然而，在特定领域的训练数据集不足或完全不存在的领域中，会出现挑战。为了应对这一挑战，我们提出了一种新的数据增强框架-领域不可知训练和领域适应训练。我们假设领域语料库是领域不可知性和领域特定特征的混合，在不同的领域之间共享某些表示模式。领域不可知培训旨在使模型能够学习这些常见的表达模式。为了构建领域不可知对话语料库，我们使用了一种用于去除领域特定特征的数据处理技术。通过减轻领域特征的影响，在去领域语料库上训练的模型可以有效地学习不同领域的共同表达模式。随后，我们通过领域自适应训练将学习到的领域无关的特征适应到目标领域。我们在五个不同领域的汉语对话语料库上进行了实验，结果表明，AMD^2G在目标领域语料库上的直接训练和在所有五个领域语料库上的集合训练都取得了更好的性能。我们的工作强调AMD^2G是低资源多领域对话生成的一个可行的替代解决方案。与我们的工作相关的代码和数据可在GitHub存储库^\Text 1上找到。

[NLP-40] LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data
[NLP-40] LUMA：从不确定和多模式数据中学习的基准数据集

链接: https://arxiv.org/abs/2406.09864
作者: Grigor Bezirganyan,Sana Sellami,Laure Berti-Équille,Sébastien Fournier
关键词: diverse information sources, integrating diverse information, Learning enhances decision-making, Multimodal Deep Learning, Deep Learning enhances
中文关键词: 多样化的信息源，集成多样化的信息，学习增强决策，多模式深度学习，深度学习增强
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Deep Learning enhances decision-making by integrating diverse information sources, such as texts, images, audio, and videos. To develop trustworthy multimodal approaches, it is essential to understand how uncertainty impacts these models. We introduce LUMA, a unique benchmark dataset, featuring audio, image, and textual data from 50 classes, for learning from uncertain and multimodal data. It extends the well-known CIFAR 10/100 dataset with audio samples extracted from three audio corpora, and text data generated using the Gemma-7B Large Language Model (LLM). The LUMA dataset enables the controlled injection of varying types and degrees of uncertainty to achieve and tailor specific experiments and benchmarking initiatives. LUMA is also available as a Python package including the functions for generating multiple variants of the dataset with controlling the diversity of the data, the amount of noise for each modality, and adding out-of-distribution samples. A baseline pre-trained model is also provided alongside three uncertainty quantification methods: Monte-Carlo Dropout, Deep Ensemble, and Reliable Conflictive Multi-View Learning. This comprehensive dataset and its tools are intended to promote and support the development and benchmarking of trustworthy and robust multimodal deep learning approaches.
摘要：多模式深度学习通过整合文本、图像、音频和视频等多种信息源来提高决策能力。要开发可靠的多模式方法，必须了解不确定性如何影响这些模型。我们引入了LUMA，这是一个独特的基准数据集，具有来自50个类别的音频、图像和文本数据，用于从不确定和多模式数据中学习。它用从三个音频语料库提取的音频样本和使用Gema-7B大型语言模型(LLM)生成的文本数据扩展了著名的CIFAR 10/100数据集。LUMA数据集允许对不同类型和程度的不确定性进行受控注入，以实现并量身定制特定的实验和基准举措。Luma还可以作为一个Python包提供，其中包括用于生成数据集的多个变量的函数，以及控制数据的多样性、每个通道的噪声量以及添加分布外样本的函数。文中还给出了一个基线预训练模型和三种不确定性量化方法：蒙特卡罗丢弃、深度集成和可靠的冲突多视角学习。这一全面的数据集及其工具旨在促进和支持可靠和稳健的多模式深度学习方法的开发和基准确定。

[NLP-41] On the Encoding of Gender in Transformer-based ASR Representations
[NLP-41] 基于变形者的ASB表示中的性别编码

链接: https://arxiv.org/abs/2406.09855
作者: Aravind Krishnan,Badr M. Abdullah,Dietrich Klakow
关键词: existing literature relies, uncover gender biases, ASR models, transformer-based ASR models, transcript generation
中文关键词: 现有文献依赖、揭示性别偏见、ASB模型、基于转换器的ASB模型、转录本生成
类目: Computation and Language (cs.CL)
备注: Accepted at Interspeech 2024

点击查看摘要

Abstract:While existing literature relies on performance differences to uncover gender biases in ASR models, a deeper analysis is essential to understand how gender is encoded and utilized during transcript generation. This work investigates the encoding and utilization of gender in the latent representations of two transformer-based ASR models, Wav2Vec2 and HuBERT. Using linear erasure, we demonstrate the feasibility of removing gender information from each layer of an ASR model and show that such an intervention has minimal impacts on the ASR performance. Additionally, our analysis reveals a concentration of gender information within the first and last frames in the final layers, explaining the ease of erasing gender in these layers. Our findings suggest the prospect of creating gender-neutral embeddings that can be integrated into ASR frameworks without compromising their efficacy.
摘要：虽然现有文献依赖于表现差异来揭示ASB模型中的性别偏见，但更深入的分析对于了解在文字记录生成过程中如何编码和利用性别至关重要。这项工作研究了两个基于转换器的ASB模型Wave 2Vec 2和HuBERT的潜在表示中性别的编码和利用。使用线性擦除，我们证明了从ASB模型的每一层中删除性别信息的可行性，并表明这种干预对ASB性能的影响最小。此外，我们的分析揭示了性别信息集中在最后一层的第一帧和最后一帧中，这解释了在这些层中消除性别的容易性。我们的研究结果表明，可以创建不分性别的嵌入，这些嵌入可以集成到ASB框架中，而不会损害其功效。

[NLP-42] Rapport-Driven Virtual Agent: Rapport Building Dialogue Strategy for Improving User Experience at First Meeting
[NLP-42] 融洽驱动的虚拟代理：改善第一次会议用户体验的融洽建立对话策略

链接: https://arxiv.org/abs/2406.09839
作者: Muhammad Yeza Baihaqi,Angel García Contreras,Seiya Kawano,Koichiro Yoshino
关键词: conversational aspect focusing, relationship building, collaborative tasks, conversational aspect, aspect focusing
中文关键词: 对话方面专注、关系建立、协作任务、对话方面、方面专注
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: will be presented at INTERSPEECH 2024

点击查看摘要

Abstract:Rapport is known as a conversational aspect focusing on relationship building, which influences outcomes in collaborative tasks. This study aims to establish human-agent rapport through small talk by using a rapport-building strategy. We implemented this strategy for the virtual agents based on dialogue strategies by prompting a large language model (LLM). In particular, we utilized two dialogue strategies-predefined sequence and free-form-to guide the dialogue generation framework. We conducted analyses based on human evaluations, examining correlations between total turn, utterance characters, rapport score, and user experience variables: naturalness, satisfaction, interest, engagement, and usability. We investigated correlations between rapport score and naturalness, satisfaction, engagement, and conversation flow. Our experimental results also indicated that using free-form to prompt the rapport-building strategy performed the best in subjective scores.
摘要：融洽关系被认为是一个注重关系建立的对话方面，它会影响协作任务的结果。这项研究旨在通过使用融洽关系建立策略，通过闲聊建立人际关系。我们通过提示大型语言模型（LLM），基于对话策略为虚拟代理实施了该策略。特别是，我们利用了两种对话策略–预定义序列和自由形式–来指导对话生成框架。我们基于人类评估进行了分析，检查了总转弯、话语特征、融洽得分和用户体验变量（自然性、满意度、兴趣、参与度和可用性）之间的相关性。我们调查了融洽评分与自然性、满意度、参与度和对话流量之间的相关性。我们的实验结果还表明，使用自由形式来促进融洽建立策略在主观得分方面表现最好。

[NLP-43] Federated Learning driven Large Language Models for Swarm Intelligence: A Survey
[NLP-43] 联邦学习驱动的群体智能大型语言模型：调查

链接: https://arxiv.org/abs/2406.09831
作者: Youyang Qu
关键词: training large language, large language models, large language, addressing data privacy, offers a compelling
中文关键词: 训练大型语言、大型语言模型、大型语言、解决数据隐私问题，提供了令人信服的
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Federated learning (FL) offers a compelling framework for training large language models (LLMs) while addressing data privacy and decentralization challenges. This paper surveys recent advancements in the federated learning of large language models, with a particular focus on machine unlearning, a crucial aspect for complying with privacy regulations like the Right to be Forgotten. Machine unlearning in the context of federated LLMs involves systematically and securely removing individual data contributions from the learned model without retraining from scratch. We explore various strategies that enable effective unlearning, such as perturbation techniques, model decomposition, and incremental learning, highlighting their implications for maintaining model performance and data privacy. Furthermore, we examine case studies and experimental results from recent literature to assess the effectiveness and efficiency of these approaches in real-world scenarios. Our survey reveals a growing interest in developing more robust and scalable federated unlearning methods, suggesting a vital area for future research in the intersection of AI ethics and distributed machine learning technologies.
摘要：联合学习(FL)为训练大型语言模型(LLM)提供了一个引人注目的框架，同时解决了数据隐私和分散化的挑战。本文综述了大型语言模型的联合学习的最新进展，特别关注机器遗忘，这是遵守隐私法规的关键方面，如被遗忘权。在联合LLMS的上下文中，机器遗忘涉及系统和安全地从学习的模型中移除个人数据贡献，而无需从头开始重新训练。我们探索了各种实现有效遗忘的策略，如扰动技术、模型分解和增量学习，强调了它们对维护模型性能和数据隐私的影响。此外，我们还检查了最近文献中的案例研究和实验结果，以评估这些方法在真实世界场景中的有效性和效率。我们的调查显示，人们对开发更健壮和可扩展的联合遗忘方法越来越感兴趣，这表明未来在人工智能伦理和分布式机器学习技术的交叉领域有一个重要的研究领域。

[NLP-44] HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning
[NLP-44] HiP注意力：具有分层注意力修剪的稀疏次二次注意力

链接: https://arxiv.org/abs/2406.09827
作者: Heejun Lee,Geon Park,Youngwan Lee,Jina Kim,Wonyoung Jeong,Myeongjae Jeon,Sung Ju Hwang
关键词: multi-modal question answering, increasing sequence lengths, handling complex tasks, modern large language, question answering
中文关键词: 多模式问答、增加序列长度、处理复杂任务、现代大型语言、问答
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 26 pages, 15 figures

点击查看摘要

Abstract:In modern large language models (LLMs), increasing sequence lengths is a crucial challenge for enhancing their comprehension and coherence in handling complex tasks such as multi-modal question answering. However, handling long context sequences with LLMs is prohibitively costly due to the conventional attention mechanism’s quadratic time and space complexity, and the context window size is limited by the GPU memory. Although recent works have proposed linear and sparse attention mechanisms to address this issue, their real-world applicability is often limited by the need to re-train pre-trained models. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which simultaneously reduces the training and inference time complexity from O(T^2) to O(T \log T) and the space complexity from O(T^2) to O(T) . To this end, we devise a dynamic sparse attention mechanism that generates an attention mask through a novel tree-search-like algorithm for a given query on the fly. HiP is training-free as it only utilizes the pre-trained attention scores to spot the positions of the top- k most significant elements for each query. Moreover, it ensures that no token is overlooked, unlike the sliding window-based sub-quadratic attention methods, such as StreamingLLM. Extensive experiments on diverse real-world benchmarks demonstrate that HiP significantly reduces prompt (i.e., prefill) and decoding latency and memory usage while maintaining high generation performance with little or no degradation. As HiP allows pretrained LLMs to scale to millions of tokens on commodity GPUs with no additional engineering due to its easy plug-and-play deployment, we believe that our work will have a large practical impact, opening up the possibility to many long-context LLM applications previously infeasible.
摘要：在现代大语言模型中，增加序列长度是提高它们在处理复杂任务(如多模式问答)时的理解力和一致性的关键挑战。然而，由于传统注意机制的二次时间和空间复杂性，使用LLMS处理长上下文序列的成本高得令人望而却步，并且上下文窗口大小受到GPU存储器的限制。尽管最近的研究提出了线性和稀疏注意机制来解决这个问题，但它们在现实世界中的适用性往往受到重新训练预先训练的模型的需要的限制。对此，我们提出了一种新的方法–分层剪枝注意力(HIP)，它同时将训练和推理的时间复杂度从O(T^2)降低到O(T\logT)，将空间复杂度从O(T^2)降低到O(T)。为此，我们设计了一种动态稀疏注意机制，该机制通过一种新颖的树形搜索算法为给定的查询动态生成注意掩码。HIP是无需训练的，因为它只利用预先训练的注意力分数来确定每个查询的前k个最重要元素的位置。此外，与基于滑动窗口的子二次注意方法(如StreamingLLM)不同，该方法确保了标记不会被忽略。在不同的真实世界基准测试上的广泛实验表明，HIP显著减少了提示(即预填充)和解码延迟以及内存使用，同时保持了高性能而几乎没有降级。由于HIP允许预先训练的LLM在商用GPU上扩展到数百万令牌，而不需要额外的工程，因为它易于即插即用部署，我们相信我们的工作将产生巨大的实际影响，为许多以前不可行的长上下文LLM应用打开了可能性。

[NLP-45] Retrieval Augmented Fact Verification by Synthesizing Contrastive Arguments
[NLP-45] 综合对比参数的检索增强事实验证

链接: https://arxiv.org/abs/2406.09815
作者: Zhenrui Yue,Huimin Zeng,Lanyu Shang,Yifan Liu,Yang Zhang,Dong Wang
关键词: poses substantial risks, misinformation poses substantial, public interest, rapid propagation, poses substantial
中文关键词: 构成重大风险，错误信息构成重大公共利益，快速传播，构成重大
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024

点击查看摘要

Abstract:The rapid propagation of misinformation poses substantial risks to public interest. To combat misinformation, large language models (LLMs) are adapted to automatically verify claim credibility. Nevertheless, existing methods heavily rely on the embedded knowledge within LLMs and / or black-box APIs for evidence collection, leading to subpar performance with smaller LLMs or upon unreliable context. In this paper, we propose retrieval augmented fact verification through the synthesis of contrasting arguments (RAFTS). Upon input claims, RAFTS starts with evidence retrieval, where we design a retrieval pipeline to collect and re-rank relevant documents from verifiable sources. Then, RAFTS forms contrastive arguments (i.e., supporting or refuting) conditioned on the retrieved evidence. In addition, RAFTS leverages an embedding model to identify informative demonstrations, followed by in-context prompting to generate the prediction and explanation. Our method effectively retrieves relevant documents as evidence and evaluates arguments from varying perspectives, incorporating nuanced information for fine-grained decision-making. Combined with informative in-context examples as prior, RAFTS achieves significant improvements to supervised and LLM baselines without complex prompts. We demonstrate the effectiveness of our method through extensive experiments, where RAFTS can outperform GPT-based methods with a significantly smaller 7B LLM.
摘要：虚假信息的迅速传播给公共利益带来了巨大的风险。为了打击错误信息，采用大型语言模型(LLM)来自动验证声明的可信度。然而，现有的方法严重依赖于LLM和/或黑盒API中的嵌入知识来收集证据，导致使用较小的LLM或不可靠的上下文时性能不佳。在本文中，我们提出了通过合成对比论元(RAFT)来进行检索增强的事实验证。在输入索赔时，RAFTS从证据检索开始，在那里我们设计了一个检索管道，从可核实的来源收集相关文件并重新排序。然后，根据取回的证据，木排形成对比论点(即，支持或反驳)。此外，RAFTS利用嵌入模型来识别信息丰富的演示，然后根据上下文提示生成预测和解释。我们的方法有效地检索相关文档作为证据，并从不同的角度评估论点，结合细微差别的信息进行细粒度决策。如前所述，与提供信息的上下文中的示例相结合，RAFT无需复杂的提示即可显著改进监督基线和LLM基线。我们通过广泛的实验证明了我们的方法的有效性，在这些实验中，RAFTS可以在7B LLM明显较小的情况下比基于GPT的方法性能更好。

[NLP-46] Pcc-tuning: Breaking the Contrastive Learning Ceiling in Semantic Textual Similarity
[NLP-46] PCC调整：打破语义文本相似性的对比学习天花板

链接: https://arxiv.org/abs/2406.09790
作者: Bowen Zhang,Chunping Li
关键词: Semantic Textual Similarity, Semantic Textual, Textual Similarity, critical research direction, constitutes a critical
中文关键词: 语义文本相似性、语义文本、文本相似性、批判性研究方向，构成批判性
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Semantic Textual Similarity (STS) constitutes a critical research direction in computational linguistics and serves as a key indicator of the encoding capabilities of embedding models. Driven by advances in pre-trained language models and contrastive learning techniques, leading sentence representation methods can already achieved average Spearman’s correlation scores of approximately 86 across seven STS benchmarks in SentEval. However, further improvements have become increasingly marginal, with no existing method attaining an average score higher than 87 on these tasks. This paper conducts an in-depth analysis of this phenomenon and concludes that the upper limit for Spearman’s correlation scores using contrastive learning is 87.5. To transcend this ceiling, we propose an innovative approach termed Pcc-tuning, which employs Pearson’s correlation coefficient as a loss function to refine model performance beyond contrastive learning. Experimental results demonstrate that Pcc-tuning markedly surpasses previous state-of-the-art strategies, raising the Spearman’s correlation score to above 90.
摘要：语义文本相似度是计算语言学的一个重要研究方向，也是衡量嵌入模型编码能力的重要指标。在预先训练的语言模型和对比学习技术的推动下，领先的句子表示方法已经在SentEval的七个STS基准上获得了大约86分的平均Spearman相关性分数。然而，进一步的改进已经变得越来越微不足道，现有的方法在这些任务上的平均得分都没有达到87分以上。本文对这一现象进行了深入的分析，并得出结论：采用对比学习的Spearman相关分数的上限为87.5。为了突破这一限制，我们提出了一种名为PCC-Tuning的创新方法，该方法使用皮尔逊相关系数作为损失函数，以改进模型的性能，使其超越对比学习。实验结果表明，PCC-Tuning明显优于以往最先进的策略，将Spearman的相关性分数提高到90以上。

[NLP-47] OSPC: Detecting Harmful Memes with Large Language Model as a Catalyst
[NLP-47] OSPC：以大型语言模型为催化剂检测有害模因

链接: https://arxiv.org/abs/2406.09779
作者: Jingtao Cao,Zheng Zhang,Hongru Wang,Bin Liang,Hao Wang,Kam-Fai Wong
关键词: rapidly disseminate personal, disseminate personal opinions, propagating social bias, pose significant challenges, bias and prejudice
中文关键词: 迅速传播个人、传播个人观点、传播社会偏见、构成重大挑战、偏见和偏见
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Memes, which rapidly disseminate personal opinions and positions across the internet, also pose significant challenges in propagating social bias and prejudice. This study presents a novel approach to detecting harmful memes, particularly within the multicultural and multilingual context of Singapore. Our methodology integrates image captioning, Optical Character Recognition (OCR), and Large Language Model (LLM) analysis to comprehensively understand and classify harmful memes. Utilizing the BLIP model for image captioning, PP-OCR and TrOCR for text recognition across multiple languages, and the Qwen LLM for nuanced language understanding, our system is capable of identifying harmful content in memes created in English, Chinese, Malay, and Tamil. To enhance the system’s performance, we fine-tuned our approach by leveraging additional data labeled using GPT-4V, aiming to distill the understanding capability of GPT-4V for harmful memes to our system. Our framework achieves top-1 at the public leaderboard of the Online Safety Prize Challenge hosted by AI Singapore, with the AUROC as 0.7749 and accuracy as 0.7087, significantly ahead of the other teams. Notably, our approach outperforms previous benchmarks, with FLAVA achieving an AUROC of 0.5695 and VisualBERT an AUROC of 0.5561.
摘要：模因在互联网上迅速传播个人观点和立场，在传播社会偏见和偏见方面也构成了巨大的挑战。这项研究提出了一种新的方法来检测有害的模因，特别是在新加坡的多文化和多语言背景下。我们的方法集成了图像字幕、光学字符识别(OCR)和大语言模型(LLM)分析，以全面理解和分类有害的模因。利用BLIP模型进行图像字幕识别，PP-OCR和TrOCR用于跨语言文本识别，Qwen LLM用于细微差别语言理解，我们的系统能够识别用英语、汉语、马来语和泰米尔语创建的Meme中的有害内容。为了增强系统的性能，我们通过利用额外的使用GPT-4V标记的数据来微调我们的方法，旨在提取GPT-4V对我们系统有害模因的理解能力。我们的框架在AI新加坡主办的在线安全奖挑战赛的公共排行榜上名列前茅，AUROC为0.7749，准确率为0.7087，明显领先于其他团队。值得注意的是，我们的方法超过了以前的基准，FLAVA实现了0.5695的AUROC，VisualBERT实现了0.5561的AUROC。

[NLP-48] Bootstrapping Language Models with DPO Implicit Rewards
[NLP-48] 具有DPO隐性奖励的引导语言模型

链接: https://arxiv.org/abs/2406.09760
作者: Changyu Chen,Zichen Liu,Chao Du,Tianyu Pang,Qian Liu,Arunesh Sinha,Pradeep Varakantham,Min Lin
关键词: large language models, area of research, large language, active area, implicit reward model
中文关键词: 大型语言模型、研究领域、大型语言、活动领域、隐性奖励模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM model to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate refinements that debias the length of the responses and improve the quality of the preference dataset to further improve our approach. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment and achieves superior performance than Gemini Pro on AlpacaEval 2, reaching 27.55% length-controlled win rate against GPT-4 Turbo, but with only 8B parameters and no external feedback. Our code is available at this https URL.
摘要：大型语言模型中的人类对齐是一个活跃的研究领域。最近的一项开创性工作，直接偏好优化(DPO)，通过绕过人类反馈强化学习(RLHF)的奖励学习阶段，大大简化了过去的工作过程。DPO经过训练后，提供了一种隐性奖励模型。在这项工作中，我们做了一个新的观察，这个隐式奖励模型本身可以用一种自举的方式来进一步对齐LLM。我们的方法是使用当前LLM模型的回报来构建偏好数据集，然后在后续的DPO回合中使用该数据集。我们结合了对响应长度进行去偏向的改进，并改进了偏好数据集的质量，以进一步改进我们的方法。我们的方法名为带有DPO隐含奖励的自对准(DICE)，在对准方面有了很大改进，并在AlpacaEval 2上实现了比Gemini Pro更优越的性能，对GPT-4 Turbo的长度控制胜率达到27.55%，但只有8B参数，没有外部反馈。我们的代码可以在这个HTTPS URL上找到。

[NLP-49] Self-Knowledge Distillation for Learning Ambiguity
[NLP-49] 学习模糊的自我知识提炼

链接: https://arxiv.org/abs/2406.09719
作者: Hancheol Park,Soyeong Jeong,Sukmin Cho,Jong C. Park
关键词: Recent language models, natural language understanding, Recent language, shown remarkable performance, language understanding
中文关键词: 最近的语言模型，自然语言理解，最近的语言，表现出出色的表现，语言理解
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Recent language models have shown remarkable performance on natural language understanding (NLU) tasks. However, they are often sub-optimal when faced with ambiguous samples that can be interpreted in multiple ways, over-confidently predicting a single label without consideration for its correctness. To address this issue, we propose a novel self-knowledge distillation method that enables models to learn label distributions more accurately by leveraging knowledge distilled from their lower layers. This approach also includes a learning phase that re-calibrates the unnecessarily strengthened confidence for training samples judged as extremely ambiguous based on the distilled distribution knowledge. We validate our method on diverse NLU benchmark datasets and the experimental results demonstrate its effectiveness in producing better label distributions. Particularly, through the process of re-calibrating the confidence for highly ambiguous samples, the issue of over-confidence when predictions for unseen samples do not match with their ground-truth labels has been significantly alleviated. This has been shown to contribute to generating better distributions than the existing state-of-the-art method. Moreover, our method is more efficient in training the models compared to the existing method, as it does not involve additional training processes to refine label distributions.
摘要：最近的语言模型在自然语言理解(NLU)任务上表现出了显著的性能。然而，当面对可以用多种方式解释的模棱两可的样本时，它们往往是次优的，过度自信地预测单个标签，而不考虑其正确性。为了解决这个问题，我们提出了一种新的自知识蒸馏方法，该方法使模型能够利用从其较低层提取的知识来更准确地学习标签分布。该方法还包括学习阶段，该学习阶段基于提取的分布知识来重新校准对于被判断为极端模糊的训练样本的不必要的增强的置信度。我们在不同的NLU基准数据集上对我们的方法进行了验证，实验结果表明该方法在生成更好的标签分布方面是有效的。特别是，通过重新校准高度模糊样本的置信度，对看不见的样本的预测与其基本事实标签不符时的过度自信问题已得到显著缓解。这已被证明有助于生成比现有最先进的方法更好的分发。此外，与现有方法相比，我们的方法在训练模型方面更有效，因为它不需要额外的训练过程来细化标签分布。

[NLP-50] UniBridge: A Unified Approach to Cross-Lingual Transfer Learning for Low-Resource Languages
[NLP-50] UniBridge：低资源语言跨语言迁移学习的统一方法

链接: https://arxiv.org/abs/2406.09717
作者: Trinh Pham,Khoi M. Le,Luu Anh Tuan
关键词: Cross-Lingual Transfer Learning, Transfer Learning, Learning with Optimized, Cross-Lingual Transfer, comprehensive approach developed
中文关键词: 跨语言迁移学习、迁移学习、优化学习、跨语言迁移、开发的综合方法
类目: Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:In this paper, we introduce UniBridge (Cross-Lingual Transfer Learning with Optimized Embeddings and Vocabulary), a comprehensive approach developed to improve the effectiveness of Cross-Lingual Transfer Learning, particularly in languages with limited resources. Our approach tackles two essential elements of a language model: the initialization of embeddings and the optimal vocabulary size. Specifically, we propose a novel embedding initialization method that leverages both lexical and semantic alignment for a language. In addition, we present a method for systematically searching for the optimal vocabulary size, ensuring a balance between model complexity and linguistic coverage. Our experiments across multilingual datasets show that our approach greatly improves the F1-Score in several languages. UniBridge is a robust and adaptable solution for cross-lingual systems in various languages, highlighting the significance of initializing embeddings and choosing the right vocabulary size in cross-lingual environments.
摘要：在本文中，我们介绍了UniBridge(优化嵌入和词汇的跨语言迁移学习)，这是一种综合的方法，旨在提高跨语言迁移学习的有效性，特别是在资源有限的语言中。我们的方法解决了语言模型的两个基本元素：嵌入的初始化和最佳词汇大小。具体地说，我们提出了一种新的嵌入初始化方法，该方法同时利用了语言的词汇和语义对齐。此外，我们还提出了一种系统地搜索最佳词汇量的方法，确保了模型复杂性和语言覆盖率之间的平衡。我们在多语言数据集上的实验表明，我们的方法极大地提高了几种语言的F1分数。UniBridge是一种适用于各种语言的跨语言系统的健壮且适应性强的解决方案，突出了在跨语言环境中初始化嵌入和选择正确词汇表大小的重要性。

[NLP-51] Detecting Response Generation Not Requiring Factual Judgment
[NLP-51] 检测响应生成不需要事实判断

链接: https://arxiv.org/abs/2406.09702
作者: Ryohei Kamei,Daiki Shiono,Reina Akama,Jun Suzuki
关键词: large language models, remarkable development, development of large, large language, language models
中文关键词: 大型语言模型，显着的发展，大型语言的发展，语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the remarkable development of large language models (LLMs), ensuring the factuality of output has become a challenge. However, having all the contents of the response with given knowledge or facts is not necessarily a good thing in dialogues. This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings. We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset. The model with the highest classification accuracy could yield about 88% accurate classification results.
摘要：随着大型语言模型（LLM）的显着发展，确保输出的真实性已成为一项挑战。然而，在对话中，拥有特定知识或事实的回复的所有内容并不一定是一件好事。这项研究的目的是在对话反应中同时实现吸引力和真实性，其中一项任务被设置为预测不需要事实正确性判断（例如同意或个人观点/感受）的句子。我们通过众包为这项任务创建了一个数据集，即用事实检查需要标签（DDFC）注释的对话数据集，并使用该数据集对多个模型执行了分类任务。分类准确率最高的模型可以产生约88%的准确分类结果。

[NLP-52] FreeCtrl: Constructing Control Centers with Feedforward Layers for Learning-Free Controllable Text Generation
[NLP-52] FreeControl：构建具有前向层的控制中心，以实现免学习的可控文本生成

链接: https://arxiv.org/abs/2406.09688
作者: Zijian Feng,Hanzhang Zhou,Zixiao Zhu,Kezhi Mao
关键词: Controllable text generation, craft texts adhering, Controllable text, traditionally employing learning-based, employing learning-based techniques
中文关键词: 可控文本生成，工艺文本遵守，可控文本，传统上采用基于学习的，采用基于学习的技术
类目: Computation and Language (cs.CL)
备注: ACL 2024

点击查看摘要

Abstract:Controllable text generation (CTG) seeks to craft texts adhering to specific attributes, traditionally employing learning-based techniques such as training, fine-tuning, or prefix-tuning with attribute-specific datasets. These approaches, while effective, demand extensive computational and data resources. In contrast, some proposed learning-free alternatives circumvent learning but often yield inferior results, exemplifying the fundamental machine learning trade-off between computational expense and model efficacy. To overcome these limitations, we propose FreeCtrl, a learning-free approach that dynamically adjusts the weights of selected feedforward neural network (FFN) vectors to steer the outputs of large language models (LLMs). FreeCtrl hinges on the principle that the weights of different FFN vectors influence the likelihood of different tokens appearing in the output. By identifying and adaptively adjusting the weights of attribute-related FFN vectors, FreeCtrl can control the output likelihood of attribute keywords in the generated content. Extensive experiments on single- and multi-attribute control reveal that the learning-free FreeCtrl outperforms other learning-free and learning-based methods, successfully resolving the dilemma between learning costs and model performance.
摘要：可控文本生成(CTG)致力于生成符合特定属性的文本，传统上使用基于学习的技术，如利用特定属性的数据集进行训练、微调或前缀调整。这些方法虽然有效，但需要大量的计算和数据资源。相比之下，一些提出的无学习替代方案绕过了学习，但往往产生较差的结果，例证了计算成本和模型效率之间的基本机器学习权衡。为了克服这些限制，我们提出了一种无学习的方法–FreeCtrl，它动态地调整选定的前馈神经网络(FFN)向量的权重，以指导大型语言模型(LLM)的输出。FreeCtrl取决于不同FFN向量的权重影响不同标记出现在输出中的可能性的原理。通过识别和自适应调整属性相关FFN向量的权重，FreeCtrl可以控制生成内容中属性关键字的输出似然。在单属性和多属性控制上的大量实验表明，无学习的FreeCtrl方法优于其他无学习和基于学习的方法，成功地解决了学习代价和模型性能之间的矛盾。

[NLP-53] Evaluating ChatGPT-4 Vision on Brazils National Undergraduate Computer Science Exam
[NLP-53] 巴西国家本科计算机科学考试ChatGPT-4愿景评估

链接: https://arxiv.org/abs/2406.09671
作者: Nabor C. Mendonça
关键词: Large Language Models, Large Language, National Undergraduate Exam, Computer Science section, Language Models
中文关键词: 大型语言模型，大型语言，国家本科考试，计算机科学部分，语言模型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for publication

点击查看摘要

Abstract:The recent integration of visual capabilities into Large Language Models (LLMs) has the potential to play a pivotal role in science and technology education, where visual elements such as diagrams, charts, and tables are commonly used to improve the learning experience. This study investigates the performance of ChatGPT-4 Vision, OpenAI’s most advanced visual model at the time the study was conducted, on the Bachelor in Computer Science section of Brazil’s 2021 National Undergraduate Exam (ENADE). By presenting the model with the exam’s open and multiple-choice questions in their original image format and allowing for reassessment in response to differing answer keys, we were able to evaluate the model’s reasoning and self-reflecting capabilities in a large-scale academic assessment involving textual and visual content. ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile. While it excelled in questions that incorporated visual elements, it also encountered challenges with question interpretation, logical reasoning, and visual acuity. The involvement of an independent expert panel to review cases of disagreement between the model and the answer key revealed some poorly constructed questions containing vague or ambiguous statements, calling attention to the critical need for improved question design in future exams. Our findings suggest that while ChatGPT-4 Vision shows promise in multimodal academic evaluations, human oversight remains crucial for verifying the model’s accuracy and ensuring the fairness of high-stakes educational exams. The paper’s research materials are publicly available at this https URL.
摘要：最近在大型语言模型(LLM)中集成了视觉功能，这有可能在科学技术教育中发挥关键作用，在科学技术教育中，图形、图表和表格等视觉元素通常用于改善学习体验。这项研究调查了当时OpenAI最先进的视觉模型ChatGPT-4 Vision在巴西2021年国家本科考试(ENADE)计算机科学学士部分的表现。通过以原始图像格式呈现该模型的开放和多项选择题，并允许根据不同的答案答案进行重新评估，我们能够在涉及文本和视觉内容的大规模学术评估中评估该模型的推理和自我反思能力。ChatGPT-4 Vision的表现明显优于普通考生，位居前十名最佳分数百分位数之列。虽然它在包含视觉元素的问题上表现出色，但在问题解释、逻辑推理和视觉敏锐度方面也遇到了挑战。一个独立专家小组的参与审查了模型和答案之间不一致的案例，揭示了一些结构不佳的问题，其中包含模糊或模棱两可的陈述，提醒人们注意在未来的考试中改进问题设计的迫切需要。我们的发现表明，尽管ChatGPT-4 Vision在多模式学术评估中显示出了希望，但人类监督对于验证模型的准确性和确保高风险教育考试的公平性仍然至关重要。这篇论文的研究材料可以通过这个HTTPS URL公开获得。

[NLP-54] Learning Language Structures through Grounding
[NLP-54] 通过基础学习语言结构

链接: https://arxiv.org/abs/2406.09662
作者: Freda Shi
关键词: Language, highly structured, learn language structures, structures, propose
中文关键词: 语言，高度结构化，学习语言结构，结构，提出
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Ph.D. Thesis

点击查看摘要

Abstract:Language is highly structured, with syntactic and semantic structures, to some extent, agreed upon by speakers of the same language. With implicit or explicit awareness of such structures, humans can learn and use language efficiently and generalize to sentences that contain unseen words. Motivated by human language learning, in this dissertation, we consider a family of machine learning tasks that aim to learn language structures through grounding. We seek distant supervision from other data sources (i.e., grounds), including but not limited to other modalities (e.g., vision), execution results of programs, and other languages. We demonstrate the potential of this task formulation and advocate for its adoption through three schemes. In Part I, we consider learning syntactic parses through visual grounding. We propose the task of visually grounded grammar induction, present the first models to induce syntactic structures from visually grounded text and speech, and find that the visual grounding signals can help improve the parsing quality over language-only models. As a side contribution, we propose a novel evaluation metric that enables the evaluation of speech parsing without text or automatic speech recognition systems involved. In Part II, we propose two execution-aware methods to map sentences into corresponding semantic structures (i.e., programs), significantly improving compositional generalization and few-shot program synthesis. In Part III, we propose methods that learn language structures from annotations in other languages. Specifically, we propose a method that sets a new state of the art on cross-lingual word alignment. We then leverage the learned word alignments to improve the performance of zero-shot cross-lingual dependency parsing, by proposing a novel substructure-based projection method that preserves structural knowledge learned from the source language. Comments: Ph.D. Thesis Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.09662 [cs.CL] (or arXiv:2406.09662v1 [cs.CL] for this version)
摘要：语言是高度结构化的，其句法和语义结构在某种程度上是说同一种语言的人所认同的。通过对这种结构的隐含或显性的认识，人类可以有效地学习和使用语言，并将其概括为包含未见单词的句子。受人类语言学习的启发，在本文中，我们考虑了一类旨在通过基础学习语言结构的机器学习任务。我们寻求来自其他数据源(即理由)的远程监督，包括但不限于其他模式(例如，愿景)、程序的执行结果和其他语言。我们展示了这一任务制定的潜力，并通过三个方案倡导采用。在第一部分，我们考虑通过视觉基础来学习句法分析。我们提出了视觉基础语法归纳的任务，给出了从视觉基础文本和语音中归纳句法结构的第一个模型，并发现视觉基础信号可以帮助提高句法分析的质量。作为一个侧面的贡献，我们提出了一种新的评价指标，使得在不涉及文本或自动语音识别系统的情况下能够对语音分析进行评价。在第二部分中，我们提出了两种执行感知的方法来将句子映射到相应的语义结构(即程序)，显著地改善了组合泛化和少射程序综合。在第三部分中，我们提出了从其他语言的注释中学习语言结构的方法。具体地说，我们提出了一种跨语言单词对齐的新方法。然后，我们利用学习到的单词对齐来提高零命中率跨语言依存句法分析的性能，提出了一种新的基于子结构的投影方法，保留了从源语言学习的结构知识。评论：博士论文主题：计算与语言(cs.CL)；人工智能(cs.AI)；计算机视觉与模式识别(cs.CV)引用AS：arxiv：2406.09662cs.CL

[NLP-55] Multi-Modal Retrieval For Large Language Model Based Speech Recognition
[NLP-55] 基于大语言模型的语音识别的多模式检索

链接: https://arxiv.org/abs/2406.09618
作者: Jari Kolehmainen,Aditya Gourav,Prashanth Gurunath Shivakumar,Yile Gu,Ankur Gandhe,Ariya Rastrow,Grant Strimel,Ivan Bulyko
关键词: widely adopted approach, improving language models, language models leveraging, leveraging external information, models leveraging external
中文关键词: 广泛采用的方法、改进语言模型、利用语言模型、利用外部信息、利用外部信息的模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Retrieval is a widely adopted approach for improving language models leveraging external information. As the field moves towards multi-modal large language models, it is important to extend the pure text based methods to incorporate other modalities in retrieval as well for applications across the wide spectrum of machine learning tasks and data types. In this work, we propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques. We demonstrate the effectiveness of our retrieval approaches empirically by applying them to automatic speech recognition tasks with access to external information. Under this setting, we show that speech-based multi-modal retrieval outperforms text based retrieval, and yields up to 50 % improvement in word error rate over the multi-modal language model baseline. Furthermore, we achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.
摘要：检索是一种广泛采用的利用外部信息改进语言模型的方法。随着该领域向多模式大型语言模型发展，扩展基于纯文本的方法以将其他模式纳入检索以及跨广泛机器学习任务和数据类型的应用程序非常重要。在这项工作中，我们提出了使用两种方法的多模式检索：kNN-LM和交叉注意技术。我们通过将检索方法应用于访问外部信息的自动语音识别任务，以经验证明了检索方法的有效性。在此设置下，我们表明基于语音的多模式检索优于基于文本的检索，并且比多模式语言模型基线的单词错误率提高了高达50%。此外，我们还在Spoken-Squad问答数据集上实现了最先进的识别结果。

[NLP-56] Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
[NLP-56] 用于设备定向语音检测的具有融合低等级自适应的多模式大语言模型

链接: https://arxiv.org/abs/2406.09617
作者: Shruti Palaskar,Oggi Rudovic,Sameer Dharur,Florian Pesce,Gautam Krishna,Aswin Sivaraman,Jack Berkowitz,Ahmed Hussen Abdelaziz,Saurabh Adya,Ahmed Tewfik
关键词: Large Language Models, Large Language, Low Rank Adaptation, Language Models, human-like conversations
中文关键词: 大型语言模型、大型语言、低等级适应、语言模型、类人对话
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2024

点击查看摘要

Abstract:Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach and attains performance parity with its full fine-tuning (FFT) counterpart while needing to tune only a fraction of its parameters. Furthermore, with the newly introduced adapter dropout, FLoRA is robust to missing data, improving over FFT by 20% lower EER and 56% lower false accept rate. The proposed approach scales well for model sizes from 16M to 3B parameters.
摘要：尽管大型语言模型（LLM）已显示出类似人类对话的前景，但它们主要是在文本数据上进行预训练的。合成音频或视频可以提高性能，但收集大规模多模式数据和预训练多模式LLM具有挑战性。为此，我们提出了一种融合低等级自适应（FLoRA）技术，该技术有效地调整预训练的单模式LLM，以通过低等级自适应消费新的、以前未见过的模式。对于设备定向语音检测，使用FLoRA，多模式LLM比纯文本方法实现了22%的等错误率（EER）相对降低，并与完全微调（FT）相媲美，同时只需调整一小部分参数。此外，凭借新引入的适配器丢失功能，FLoRA对丢失数据具有鲁棒性，比快速傅立叶变换提高了20%，错误接受率降低了56%。所提出的方法对于从16 M到3B参数的模型大小具有良好的扩展性。

[NLP-57] Analyzing Gender Polarity in Short Social Media Texts with BERT: The Role of Emojis and Emoticons
[NLP-57] 与BERT分析社交媒体短信中的性别两极：阿济人和阿济人的角色

链接: https://arxiv.org/abs/2406.09573
作者: Saba Yousefian Jazi,Amir Mirzaeinia,Sina Yousefian Jazi
关键词: based on BERT, BERT to detect, effort we fine, fine tuned, polarity of twitter
中文关键词: 基于BERT，BERT来检测，我们的努力进行微调，微调，Twitter的两极
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this effort we fine tuned different models based on BERT to detect the gender polarity of twitter accounts. We specially focused on analyzing the effect of using emojis and emoticons in performance of our model in classifying task. We were able to demonstrate that the use of these none word inputs alongside the mention of other accounts in a short text format like tweet has an impact in detecting the account holder’s gender.
摘要：在这项工作中，我们基于BERT微调了不同的模型，以检测Twitter帐户的性别两极。我们特别重点分析了使用表情符号和表情符号对我们的模型在分类任务中性能的影响。我们能够证明，使用这些无词输入以及以短文本格式（例如推文）提及其他帐户对检测帐户持有者的性别有影响。

[NLP-58] Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
[NLP-58] Speech ReaLLM --通过教学时间流，使用多模式LLM进行实时流语音识别

链接: https://arxiv.org/abs/2406.09569
作者: Frank Seide,Morrie Doulaty,Yangyang Shi,Yashesh Gaur,Junteng Jia,Chunyang Wu
关键词: make multimodal LLM, multimodal LLM architectures, LLM architectures capable, ASR architecture, ASR architecture designed
中文关键词: 制作多模式LLM、多模式LLM架构、LLM架构能力、ASB架构、ASB架构设计
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We introduce Speech ReaLLM, a new ASR architecture that marries “decoder-only” ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first “decoder-only” ASR architecture designed to handle continuous audio without explicit end-pointing. Speech ReaLLM is a special case of the more general ReaLLM (“real-time LLM”) approach, also introduced here for the first time. The idea is inspired by RNN-T: Instead of generating a response only at the end of a user prompt, generate after every input token received in real time (it is often empty). On Librispeech “test”, an 80M Speech ReaLLM achieves WERs of 3.0% and 7.4% in real time (without an external LM or auxiliary loss). This is only slightly above a 3x larger Attention-Encoder-Decoder baseline. We also show that this way, an LLM architecture can learn to represent and reproduce the flow of time; and that a pre-trained 7B LLM can be fine-tuned to do reasonably well on this task.
摘要：我们引入Speech ReaLLM，这是一种新的ASB架构，它将“仅解码器”的ASB与RNN-T结合起来，使多模式LLM架构能够实时流媒体。这是第一个“仅解码器”的ASB架构，旨在处理连续音频，而无需显式端点。Speech ReaLLM是更通用的ReaLLM（“实时LLM”）方法的一个特例，这里也是首次引入。该想法受到RNN-T的启发：与其仅在用户提示结束时生成响应，不如在实时接收到的每个输入令牌后生成（通常为空）。在Librisspeech“测试”中，80 M Speech ReaLLM实时实现了3.0%和7.4%的WER（没有外部LM或辅助损失）。这仅略高于注意力-编码器-解码器基线的3倍。我们还表明，通过这种方式，LLM架构可以学习表示和再现时间流;并且可以对预先训练的7 B LLM进行微调，以在这项任务上做得相当好。

[NLP-59] Decoding the Diversity: A Review of the Indic AI Research Landscape
[NLP-59] 解码多样性：印度人工智能研究格局回顾

链接: https://arxiv.org/abs/2406.09559
作者: Sankalp KJ,Vinija Jain,Sreyoshi Bhaduri,Tamoghna Roy,Aman Chadha
关键词: large language model, Indic languages, Indic, Sri Lanka, comprehensive overview
中文关键词: 大型语言模型，印度语言，印度，斯里兰卡，全面概述
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages, 1 figure

点击查看摘要

Abstract:This review paper provides a comprehensive overview of large language model (LLM) research directions within Indic languages. Indic languages are those spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri Lanka, Nepal, and Bhutan, among others. These languages have a rich cultural and linguistic heritage and are spoken by over 1.5 billion people worldwide. With the tremendous market potential and growing demand for natural language processing (NLP) based applications in diverse languages, generative applications for Indic languages pose unique challenges and opportunities for research. Our paper deep dives into the recent advancements in Indic generative modeling, contributing with a taxonomy of research directions, tabulating 84 recent publications. Research directions surveyed in this paper include LLM development, fine-tuning existing LLMs, development of corpora, benchmarking and evaluation, as well as publications around specific techniques, tools, and applications. We found that researchers across the publications emphasize the challenges associated with limited data availability, lack of standardization, and the peculiar linguistic complexities of Indic languages. This work aims to serve as a valuable resource for researchers and practitioners working in the field of NLP, particularly those focused on Indic languages, and contributes to the development of more accurate and efficient LLM applications for these languages.
摘要：本文对印度语大语言模型的研究方向进行了全面综述。印度语是指印度次大陆所说的语言，包括印度、巴基斯坦、孟加拉国、斯里兰卡、尼泊尔和不丹等。这些语言有着丰富的文化和语言遗产，全世界有超过15亿人在使用这些语言。随着基于自然语言处理(NLP)的各种语言应用程序的巨大市场潜力和日益增长的需求，印度语的生成式应用程序提出了独特的挑战和研究机会。我们的论文深入探讨了印度生成模型的最新进展，对研究方向进行了分类，列出了84种最近的出版物。本文综述的研究方向包括LLM的开发、对现有LLM的微调、语料库的开发、基准测试和评估，以及关于特定技术、工具和应用的出版物。我们发现，所有出版物的研究人员都强调了与有限的数据可获得性、缺乏标准化以及印度语特有的语言复杂性相关的挑战。这项工作旨在为自然语言处理领域的研究人员和实践者提供宝贵的资源，特别是那些专注于印度语的研究人员和实践者，并有助于为这些语言开发更准确和更有效的LLM应用程序。

[NLP-60] Exploring Syntactic Patterns in Urdu: A Deep Dive into Dependency Analysis
[NLP-60] 探索乌尔都语的语法模式：深入研究依赖分析

链接: https://arxiv.org/abs/2406.09549
作者: Nudrat Habib
关键词: process of breaking, components and identifying, Urdu, correct sentence structure, grammatical components
中文关键词: 打破过程、成分和识别、乌尔都语、正确的句子结构、语法成分
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Parsing is the process of breaking a sentence into its grammatical components and identifying the syntactic structure of the sentence. The syntactically correct sentence structure is achieved by assigning grammatical labels to its constituents using lexicon and syntactic rules. In linguistics, parser is extremely useful due to the number of different applications like name entity recognition, QA systems and information extraction, etc. The two most common techniques used for parsing are phrase structure and dependency Structure. Because Urdu is a low-resource language, there has been little progress in building an Urdu parser. A comparison of several parsers revealed that the dependency parsing approach is better suited for order-free languages such as Urdu. We have made significant progress in parsing Urdu, a South Asian language with a complex morphology. For Urdu dependency parsing, a basic feature model consisting of word location, word head, and dependency relation is employed as a starting point, followed by more complex feature models. The dependency tagset is designed after careful consideration of the complex morphological structure of the Urdu language, word order variation, and lexical ambiguity and it contains 22 tags. Our dataset comprises of sentences from news articles, and we tried to include sentences of different complexity (which is quite challenging), to get reliable results. All experiments are performed using MaltParser, exploring all 9 algorithms and classifiers. We have achieved a 70 percent overall best-labeled accuracy (LA), as well as an 84 percent overall best-unlabeled attachment score (UAS) using the Nivreeager algorithm. The comparison of output data with treebank test data that has been manually parsed is then used to carry out error assessment and to identify the errors produced by the parser.
摘要：句法分析是将句子分解成语法成分并识别句子的句法结构的过程。句法上正确的句子结构是通过使用词典和句法规则为其成分分配语法标签来实现的。在语言学中，由于名称实体识别、问答系统和信息抽取等不同应用的数量，解析器是非常有用的。句法分析最常用的两种技术是短语结构和依存结构。由于乌尔都语是一种低资源语言，因此在构建乌尔都语解析器方面进展甚微。对几种解析器的比较表明，依赖项解析方法更适合于无顺序语言，如乌尔都语。我们在分析乌尔都语方面取得了重大进展，乌尔都语是一种具有复杂词法的南亚语言。对于乌尔都语依存句法分析，首先使用由词位置、词头和依存关系组成的基本特征模型作为起点，然后使用更复杂的特征模型。依存标记集是在仔细考虑乌尔都语复杂的形态结构、词序变化和词汇歧义后设计的，它包含22个标记。我们的数据集由新闻文章中的句子组成，我们试图包括不同复杂性的句子(这很有挑战性)，以获得可靠的结果。所有的实验都是使用MaltParser进行的，探索了所有9种算法和分类器。使用Nivreenger算法，我们已经实现了70%的总体最佳标记准确率(LA)和84%的总体最佳未标记依恋得分(UAS)。然后，将输出数据与已手动解析的树库测试数据进行比较，以执行错误评估并识别由解析器产生的错误。

[NLP-61] A Systematic Review of Generative AI for Teaching and Learning Practice
[NLP-61] 用于教学实践的生成性人工智能的系统回顾

链接: https://arxiv.org/abs/2406.09520
作者: Bayode Ogunleye,Kudirat Ibilola Zakariyyah,Oluwaseun Ajao,Olakunle Olayinka,Hemlata Sharma
关键词: generative artificial intelligence, hotly debated topic, artificial intelligence, debated topic, generative artificial
中文关键词: 生成人工智能，热门话题，人工智能，争议话题，生成人工
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 10 figures, article published in Education Sciences

点击查看摘要

Abstract:The use of generative artificial intelligence (GenAI) in academia is a subjective and hotly debated topic. Currently, there are no agreed guidelines towards the usage of GenAI systems in higher education (HE) and, thus, it is still unclear how to make effective use of the technology for teaching and learning practice. This paper provides an overview of the current state of research on GenAI for teaching and learning in HE. To this end, this study conducted a systematic review of relevant studies indexed by Scopus, using the preferred reporting items for systematic reviews and meta-analyses (PRISMA) guidelines. The search criteria revealed a total of 625 research papers, of which 355 met the final inclusion criteria. The findings from the review showed the current state and the future trends in documents, citations, document sources/authors, keywords, and co-authorship. The research gaps identified suggest that while some authors have looked at understanding the detection of AI-generated text, it may be beneficial to understand how GenAI can be incorporated into supporting the educational curriculum for assessments, teaching, and learning delivery. Furthermore, there is a need for additional interdisciplinary, multidimensional studies in HE through collaboration. This will strengthen the awareness and understanding of students, tutors, and other stakeholders, which will be instrumental in formulating guidelines, frameworks, and policies for GenAI usage.
摘要：产生式人工智能(GenAI)在学术界的应用是一个主观而激烈的话题。目前，关于GenAI系统在高等教育(HE)中的使用还没有达成一致的指导方针，因此，如何有效地将该技术用于教学实践仍然不清楚。本文综述了GenAI在高等教育教学中的研究现状。为此，本研究使用系统审查和荟萃分析(PRISMA)指南的首选报告项目，对SCOPUS索引的相关研究进行了系统审查。检索标准共显示625篇研究论文，其中355篇符合最终纳入标准。审查结果显示了文献、引文、文献来源/作者、关键词和合著性方面的现状和未来趋势。发现的研究差距表明，虽然一些作者已经研究了理解人工智能生成的文本的检测，但了解GenAI如何被纳入支持评估、教学和学习交付的教育课程可能是有益的。此外，还需要通过合作在高等教育领域开展更多跨学科、多维度的研究。这将加强学生、教师和其他利益相关者的认识和理解，这将有助于制定GenAI使用的指导方针、框架和政策。

[NLP-62] alking Heads: Understanding Inter-layer Communication in Transformer Language Models
[NLP-62] alking Heads：了解Transformer语言模型中的层间通信

链接: https://arxiv.org/abs/2406.09519
作者: Jack Merullo,Carsten Eickhoff,Ellie Pavlick
关键词: information is represented, represented and routed, transformer language models, pass features, model
中文关键词: 信息被表示、表示和路由、Transformer语言模型、传递特征、模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although it is known that transformer language models (LMs) pass features from early layers to later layers, it is not well understood how this information is represented and routed by the model. By analyzing particular mechanism LMs use to accomplish this, we find that it is also used to recall items from a list, and show that this mechanism can explain an otherwise arbitrary-seeming sensitivity of the model to the order of items in the prompt. Specifically, we find that models write into low-rank subspaces of the residual stream to represent features which are then read out by specific later layers, forming low-rank communication channels between layers. By decomposing attention head weight matrices with the Singular Value Decomposition (SVD), we find that previously described interactions between heads separated by one or more layers can be predicted via analysis of their weight matrices. We show that it is possible to manipulate the internal model representations as well as edit model weights based on the mechanism we discover in order to significantly improve performance on our synthetic Laundry List task, which requires recall from a list, often improving task accuracy by over 20%. Our analysis reveals a surprisingly intricate interpretable structure learned from language model pretraining, and helps us understand why sophisticated LMs sometimes fail in simple domains, facilitating future analysis of more complex behaviors.
摘要：虽然已知转换器语言模型(LMS)将功能从早期层传递到更晚的层，但还不能很好地理解该模型是如何表示和路由这些信息的。通过分析LMS用来实现这一点的特定机制，我们发现LMS也被用来从列表中回忆条目，并表明该机制可以解释模型对提示中条目的顺序的任意性的敏感性。具体地说，我们发现模型写入残差流的低秩子空间来表示特征，然后由特定的后续层读出，形成层之间的低秩子空间。通过使用奇异值分解(SVD)对注意头部权重矩阵进行分解，我们发现前面描述的被一层或多层隔开的头部之间的交互可以通过分析它们的权重矩阵来预测。我们展示了基于我们发现的机制来操作内部模型表示以及编辑模型权重是可能的，以便显著提高我们的合成洗衣单任务的性能，这需要从列表中召回，通常将任务准确率提高20%以上。我们的分析揭示了从语言模型预训练中学习到的令人惊讶的复杂的可解释结构，并帮助我们理解为什么复杂的LMS有时在简单的领域失败，有助于未来对更复杂的行为的分析。

[NLP-63] Newswire: A Large-Scale Structured Database of a Century of Historical News
[NLP-63] Newswire：百年历史新闻的大型结构化数据库

链接: https://arxiv.org/abs/2406.09490
作者: Emily Silcock,Abhishek Arora,Luca D’Amico-Wong,Melissa Dell
关键词: local newspapers drew, content largely, newswire articles, Press, local newspapers
中文关键词: 当地报纸吸引了，内容主要是新闻网文章、新闻界、当地报纸
类目: Computation and Language (cs.CL); General Economics (econ.GN)
备注: arXiv admin note: text overlap with arXiv:2306.17810 , arXiv:2308.12477

点击查看摘要

Abstract:In the U.S. historically, local newspapers drew their content largely from newswires like the Associated Press. Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world, but there is no comprehensive archive of the content sent over newswires. We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers. The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model. To construct the Newswire dataset, we first recognize newspaper layouts and transcribe around 138 millions structured article texts from raw image scans. We then use a customized neural bi-encoder model to de-duplicate reproduced articles, in the presence of considerable abridgement and noise, quantifying how widely each article was reproduced. A text classifier is used to ensure that we only include newswire articles, which historically are in the public domain. The structured data that accompany the texts provide rich information about the who (disambiguated individuals), what (topics), and where (georeferencing) of the news that millions of Americans read over the course of a century. We also include Library of Congress metadata information about the newspapers that ran the articles on their front pages. The Newswire dataset is useful both for large language modeling - expanding training data beyond what is available from modern web texts - and for studying a diversity of questions in computational linguistics, social science, and the digital humanities.
摘要：在美国历史上，地方报纸的内容主要来自美联社等新闻通讯社。历史学家认为，新闻通讯社在建立国家认同感和分享对世界的理解方面发挥了关键作用，但新闻通讯社发送的内容并没有全面的档案。我们通过将定制的深度学习管道应用于来自数千家当地报纸的数百TB原始图像扫描来重建这样的档案。最终得到的数据集包含270万篇独特的公共领域美国新闻通讯社文章，撰写于1878年至1977年之间。这些文章中的位置是地理参考的，主题使用定制的神经主题分类进行标记，命名实体被识别，个人使用新的实体消歧模型对维基百科进行消歧。为了构建Newswire数据集，我们首先识别报纸版面，并从原始图像扫描中转录约1.38亿篇结构化文章文本。然后，我们使用定制的神经双编码器模型，在存在相当大的删节和噪声的情况下，对复制的文章进行去重复，量化每一篇文章被复制的范围。文本分类器用于确保我们只包括新闻通讯社的文章，这些文章历史上属于公共领域。文本附带的结构化数据提供了关于一个世纪以来数百万美国人阅读的新闻的谁(消除歧义的个人)、什么(主题)和哪里(地理参考)的丰富信息。我们还包括国会图书馆关于在头版刊登文章的报纸的元数据信息。Newswire数据集对于大型语言建模–在现代网络文本之外扩展训练数据–以及研究计算语言学、社会科学和数字人文领域的各种问题都很有用。

[NLP-64] Updating CLIP to Prefer Descriptions Over Captions
[NLP-64] 更新CLIP以更喜欢描述而不是标题

链接: https://arxiv.org/abs/2406.09458
作者: Amir Zur,Elisa Kreiss,Karel D’Oosterlinck,Christopher Potts,Atticus Geiger
关键词: powerful generic metric, meant to complement, meant to replace, replace an image, powerful generic
中文关键词: 强大的通用指标，旨在补充，旨在取代，取代图像，强大的通用指标
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although CLIPScore is a powerful generic metric that captures the similarity between a text and an image, it fails to distinguish between a caption that is meant to complement the information in an image and a description that is meant to replace an image entirely, e.g., for accessibility. We address this shortcoming by updating the CLIP model with the Concadia dataset to assign higher scores to descriptions than captions using parameter efficient fine-tuning and a loss objective derived from work on causal interpretability. This model correlates with the judgements of blind and low-vision people while preserving transfer capabilities and has interpretable structure that sheds light on the caption–description distinction.
摘要：尽管CLIPScore是一个强大的通用指标，可以捕捉文本和图像之间的相似性，但它未能区分旨在补充图像中信息的标题和旨在完全替换图像的描述，例如，为了可访问性。我们通过使用Concadia数据集更新CLIP模型来解决这一缺点，使用参数高效微调和从因果可解释性工作中得出的损失目标，为描述赋予比字幕更高的分数。该模型与盲人和低视力者的判断相关，同时保留了转移能力，并且具有可解释的结构，可以揭示标题与描述的区别。

[NLP-65] Pandora: Towards General World Model with Natural Language Actions and Video States
[NLP-65] 潘多拉：通过自然语言动作和视频状态走向通用世界模型

链接: https://arxiv.org/abs/2406.09455
作者: Jiannan Xiang,Guangyi Liu,Yi Gu,Qiyue Gao,Yuting Ning,Yuheng Zha,Zeyu Feng,Tianhua Tao,Shibo Hao,Yemin Shi,Zhengzhong Liu,Eric P. Xing,Zhiting Hu
关键词: World, general world, general world models, World models, simulate future states
中文关键词: 世界，一般世界，一般世界模型，世界模型，模拟未来状态
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Website: this https URL

点击查看摘要

Abstract:World models simulate future states of the world in response to different actions. They facilitate interactive content creation and provides a foundation for grounded, long-horizon reasoning. Current foundation models do not fully meet the capabilities of general world models: large language models (LLMs) are constrained by their reliance on language modality and their limited understanding of the physical world, while video models lack interactive action control over the world simulations. This paper makes a step towards building a general world model by introducing Pandora, a hybrid autoregressive-diffusion model that simulates world states by generating videos and allows real-time control with free-text actions. Pandora achieves domain generality, video consistency, and controllability through large-scale pretraining and instruction tuning. Crucially, Pandora bypasses the cost of training-from-scratch by integrating a pretrained LLM (7B) and a pretrained video model, requiring only additional lightweight finetuning. We illustrate extensive outputs by Pandora across diverse domains (indoor/outdoor, natural/urban, human/robot, 2D/3D, etc.). The results indicate great potential of building stronger general world models with larger-scale training.
摘要：世界模型模拟世界的未来状态，以响应不同的动作。它们促进了交互式内容的创建，并为扎根的、长远的推理提供了基础。当前的基础模型并不完全满足一般世界模型的能力：大型语言模型(LLM)受到对语言情态的依赖和对物理世界的有限理解的限制，而视频模型缺乏对世界模拟的交互动作控制。本文通过引入混合自回归扩散模型Pandora，向建立通用的世界模型迈出了一步。Pandora通过生成视频来模拟世界状态，并允许使用自由文本动作进行实时控制。Pandora通过大规模的预培训和教学调整，实现了领域通用性、视频一致性和可控性。至关重要的是，Pandora通过集成预先训练的LLM(7B)和预先训练的视频模型，绕过了从头开始训练的成本，只需要额外的轻量级微调。我们展示了Pandora在不同领域(室内/室外、自然/城市、人类/机器人、2D/3D等)的广泛输出。结果表明，通过更大规模的训练，建立更强大的通用世界模型的潜力很大。

[NLP-66] Advancing High Resolution Vision-Language Models in Biomedicine
[NLP-66] 推进生物医学中的高分辨率视觉语言模型

链接: https://arxiv.org/abs/2406.09454
作者: Zekai Chen,Arda Pekis,Kevin Brown
关键词: significantly advanced generative, vision-language modeling, learning has significantly, significantly advanced, advanced generative
中文关键词: 显着先进的生成性、视觉语言建模、学习显着先进的生成性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 15 pages

点击查看摘要

Abstract:Multi-modal learning has significantly advanced generative AI, especially in vision-language modeling. Innovations like GPT-4V and open-source projects such as LLaVA have enabled robust conversational agents capable of zero-shot task completions. However, applying these technologies in the biomedical field presents unique challenges. Recent initiatives like LLaVA-Med have started to adapt instruction-tuning for biomedical contexts using large datasets such as PMC-15M. Our research offers three key contributions: (i) we present a new instruct dataset enriched with medical image-text pairs from Claude3-Opus and LLaMA3 70B, (ii) we propose a novel image encoding strategy using hierarchical representations to improve fine-grained biomedical visual comprehension, and (iii) we develop the Llama3-Med model, which achieves state-of-the-art zero-shot performance on biomedical visual question answering benchmarks, with an average performance improvement of over 10% compared to previous methods. These advancements provide more accurate and reliable tools for medical professionals, bridging gaps in current multi-modal conversational assistants and promoting further innovations in medical AI.
摘要：多通道学习极大地促进了生成性人工智能的发展，尤其是在视觉语言建模方面。像GPT-4V这样的创新和LLaVA这样的开源项目使健壮的对话代理能够零距离完成任务。然而，将这些技术应用于生物医学领域面临着独特的挑战。最近像LLaVA-Med这样的倡议已经开始使用PMC-15M等大型数据集来调整指令调优，以适应生物医学背景。我们的研究提供了三个关键贡献：(I)我们提出了一个新的指令数据集，其中丰富了Claude3-Opus和LLaMA3 70B的医学图文对；(Ii)我们提出了一种新的图像编码策略，使用层次表示来提高细粒度的生物医学视觉理解；(Iii)我们开发了Llama3-Med模型，该模型在生物医学视觉问答基准上实现了最先进的零命中率性能，与以前的方法相比，平均性能提高了10%以上。这些进步为医疗专业人员提供了更准确、更可靠的工具，弥合了目前多模式对话助手的差距，推动了医疗人工智能的进一步创新。

[NLP-67] Exploring Traffic Crash Narratives in Jordan Using Text Mining Analytics
[NLP-67] 使用文本挖掘分析探索约旦的交通事故叙述

链接: https://arxiv.org/abs/2406.09438
作者: Shadi Jaradat,Taqwa I. Alhadidi,Huthaifa I. Ashqar,Ahmed Hossain,Mohammed Elhenawy
关键词: enhance effective traffic, study explores traffic, attempt to inform, inform and enhance, enhance effective
中文关键词: 增强有效交通，研究探索交通，尝试告知、告知和增强，增强有效
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This study explores traffic crash narratives in an attempt to inform and enhance effective traffic safety policies using text-mining analytics. Text mining techniques are employed to unravel key themes and trends within the narratives, aiming to provide a deeper understanding of the factors contributing to traffic crashes. This study collected crash data from five major freeways in Jordan that cover narratives of 7,587 records from 2018-2022. An unsupervised learning method was adopted to learn the pattern from crash data. Various text mining techniques, such as topic modeling, keyword extraction, and Word Co-Occurrence Network, were also used to reveal the co-occurrence of crash patterns. Results show that text mining analytics is a promising method and underscore the multifactorial nature of traffic crashes, including intertwining human decisions and vehicular conditions. The recurrent themes across all analyses highlight the need for a balanced approach to road safety, merging both proactive and reactive measures. Emphasis on driver education and awareness around animal-related incidents is paramount.
摘要：本研究探索交通事故叙事，试图利用文本挖掘分析为有效的交通安全政策提供信息并增强其有效性。文本挖掘技术被用来揭示叙事中的关键主题和趋势，旨在提供对导致交通事故的因素的更深层次的理解。这项研究收集了约旦五条主要高速公路的撞车数据，涵盖了2018-2022年7587条记录的叙述。采用一种无监督学习方法从碰撞数据中学习模式。利用主题建模、关键词提取、词共现网络等多种文本挖掘技术，揭示碰撞模式的共现规律。结果表明，文本挖掘分析是一种很有前途的方法，并强调了交通事故的多因素性质，包括交织在一起的人的决策和车辆状况。所有分析中反复出现的主题突出表明，需要对道路安全采取平衡的办法，将主动措施和被动措施结合起来。强调司机教育和对与动物有关的事件的认识是至关重要的。

[NLP-68] Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation
[NLP-68] 针对不流利言语的包容性ASB：具有有针对性的微调和数据增强的级联大规模自我监督学习

链接: https://arxiv.org/abs/2406.10177
作者: Dena Mujtaba,Nihar R. Mahapatra,Megan Arney,J. Scott Yaruss,Caryn Herring,Jia Bin
关键词: Automatic speech recognition, yielding inaccurate transcripts, Automatic speech, systems often falter, yielding inaccurate
中文关键词: 自动语音识别，产生不准确的成绩单，自动语音，系统通常会动摇，产生不准确的成绩单
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:Automatic speech recognition (ASR) systems often falter while processing stuttering-related disfluencies – such as involuntary blocks and word repetitions – yielding inaccurate transcripts. A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets. Therefore, we present an inclusive ASR design approach, leveraging large-scale self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation on a smaller, curated dataset of disfluent speech. Our data augmentation technique enriches training datasets with various disfluencies, enhancing ASR processing of these speech patterns. Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech. Our approach not only advances ASR inclusivity for people who stutter, but also paves the way for ASRs that can accommodate wider speech variations.
摘要：自动语音识别（ASB）系统在处理与口吃相关的不流利（例如无意识的障碍和单词重复）时经常会出现问题，从而产生不准确的成绩单。进步的一个关键障碍是缺乏大型、带注释的不流畅语音数据集。因此，我们提出了一种包容性的ASB设计方法，利用对标准语音的大规模自我监督学习，然后对较小的、精心策划的不流利语音数据集进行有针对性的微调和数据增强。我们的数据增强技术丰富了具有各种不流畅性的训练数据集，增强了这些语音模式的ASB处理。结果表明，即使使用相对较小的标记数据集，对wav2vec 2.0进行微调，再加上数据增强，也可以显着降低不流利语音的单词错误率。我们的方法不仅提高了对口吃者的ASB包容性，而且还为能够适应更广泛的语音变化的ASB铺平了道路。

[NLP-69] Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content
[NLP-69] 检测法国电视和广播内容中言语互动的语音转向边界的终点

链接: https://arxiv.org/abs/2406.10073
作者: Rémi Uro,Marie Tahon,David Doukhan,Antoine Laurent,Albert Rilliard
关键词: Transition Relevance Places, Transition Relevance, Relevance Places, floor without interrupting, interrupting the current
中文关键词: 过渡相关性地点、过渡相关性、相关性地点、楼层不中断、中断当前
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注: keywords : Spoken interaction, Media, TV, Radio, Transition-Relevance Places, Turn Taking, Interruption. Accepted to InterSpeech 2024, Kos Island, Greece

点击查看摘要

Abstract:Transition Relevance Places are defined as the end of an utterance where the interlocutor may take the floor without interrupting the current speaker --i.e., a place where the turn is terminal. Analyzing turn terminality is useful to study the dynamic of turn-taking in spontaneous conversations. This paper presents an automatic classification of spoken utterances as Terminal or Non-Terminal in multi-speaker settings. We compared audio, text, and fusions of both approaches on a French corpus of TV and Radio extracts annotated with turn-terminality information at each speaker change. Our models are based on pre-trained self-supervised representations. We report results for different fusion strategies and varying context sizes. This study also questions the problem of performance variability by analyzing the differences in results for multiple training runs with random initialization. The measured accuracy would allow the use of these models for large-scale analysis of turn-taking.
摘要：转换关联处被定义为话语结束时，对话者可以在不打断当前说话人发言的情况下发言的地方，即话轮结束的地方。话轮终止性分析有助于研究自发会话中话轮转换的动态。本文提出了一种多说话人环境下语音的终端或非终端的自动分类方法。我们在法国电视和广播语料库上比较了音频、文本和两种方法的融合，提取了每个说话人变化时的话轮终止性信息。我们的模型是基于预先训练的自我监督表示。我们报告了不同融合策略和不同上下文大小的结果。本研究还通过分析随机初始化的多次训练结果的差异，对性能变异性问题提出了质疑。测量的准确度将允许使用这些模型进行大规模的话轮转换分析。

[NLP-70] Application of Natural Language Processing in Financial Risk Detection
[NLP-70] 自然语言处理在金融风险检测中的应用

链接: https://arxiv.org/abs/2406.09765
作者: Liyang Wang,Yu Cheng,Ao Xiang,Jingyu Zhang,Haowei Yang
关键词: Natural Language Processing, Language Processing, Natural Language, financial risk detection, application of Natural
中文关键词: 自然语言处理、语言处理、自然语言、金融风险检测、自然应用
类目: Risk Management (q-fin.RM); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores the application of Natural Language Processing (NLP) in financial risk detection. By constructing an NLP-based financial risk detection model, this study aims to identify and predict potential risks in financial documents and communications. First, the fundamental concepts of NLP and its theoretical foundation, including text mining methods, NLP model design principles, and machine learning algorithms, are introduced. Second, the process of text data preprocessing and feature extraction is described. Finally, the effectiveness and predictive performance of the model are validated through empirical research. The results show that the NLP-based financial risk detection model performs excellently in risk identification and prediction, providing effective risk management tools for financial institutions. This study offers valuable references for the field of financial risk management, utilizing advanced NLP techniques to improve the accuracy and efficiency of financial risk detection.
摘要：探讨了自然语言处理在金融风险检测中的应用。通过构建基于自然语言处理的金融风险检测模型，本研究旨在识别和预测金融文件和通信中的潜在风险。首先，介绍了自然语言处理的基本概念及其理论基础，包括文本挖掘方法、自然语言处理模型设计原则和机器学习算法。其次，描述了文本数据的预处理和特征提取的过程。最后，通过实证研究验证了该模型的有效性和预测性能。结果表明，基于NLP的金融风险检测模型具有较好的风险识别和预测能力，为金融机构提供了有效的风险管理工具。本研究为金融风险管理领域利用先进的自然语言处理技术提高金融风险检测的准确性和效率提供了有价值的参考。

[NLP-71] Optimizing Byte-level Representation for End-to-end ASR
[NLP-71] 优化端到端ASB的字节级表示

链接: https://arxiv.org/abs/2406.09676
作者: Roger Hsiao,Liuhui Deng,Erik McDermott,Ruchir Travadi,Xiaodan Zhuang
关键词: automatic speech recognition, byte-level representation, automatic speech, speech recognition, ASR
中文关键词: 自动语音识别、字节级表示、自动语音、语音识别、ASB
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:We propose a novel approach to optimizing a byte-level representation for end-to-end automatic speech recognition (ASR). Byte-level representation is often used by large scale multilingual ASR systems when the character set of the supported languages is large. The compactness and universality of byte-level representation allow the ASR models to use smaller output vocabularies and therefore, provide more flexibility. UTF-8 is a commonly used byte-level representation for multilingual ASR, but it is not designed to optimize machine learning tasks directly. By using auto-encoder and vector quantization, we show that we can optimize a byte-level representation for ASR and achieve better accuracy. Our proposed framework can incorporate information from different modalities, and provides an error correction mechanism. In an English/Mandarin dictation task, we show that a bilingual ASR model built with this approach can outperform UTF-8 representation by 5% relative in error rate.
摘要：我们提出了一种新颖的方法来优化端到端自动语音识别（ASB）的字节级表示。当支持的语言的字符集很大时，大规模多语言ASB系统通常使用字节级表示。字节级表示的紧凑性和通用性允许ASB模型使用更小的输出词汇表，因此提供更大的灵活性。UTF-8是多语言ASB常用的字节级表示，但它并不是为了直接优化机器学习任务而设计的。通过使用自动编码器和量化，我们表明我们可以优化ASB的字节级表示并实现更好的准确性。我们提出的框架可以整合来自不同模式的信息，并提供错误纠正机制。在英语/普通话口述任务中，我们表明使用这种方法构建的双语ASB模型相对错误率比UTF-8表示高5%。

计算机视觉

[CV-0] VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

链接: https://arxiv.org/abs/2406.10228
作者: Chenyu Zhou,Mengdan Zhang,Peixian Chen,Chaoyou Fu,Yunhang Shen,Xiawu Zheng,Xing Sun,Rongrong Ji
关键词: Multi-modal Large Models, Multi-modal Large, progress of Multi-modal, tackle tasks blending, tasks blending vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Project Page: this https URL

点击查看摘要

[CV-1] VideoGUI: A Benchmark for GUI Automation from Instructional Videos

链接: https://arxiv.org/abs/2406.10227
作者: Kevin Qinghong Lin,Linjie Li,Difei Gao,Qinchen WU,Mingyi Yan,Zhengyuan Yang,Lijuan Wang,Mike Zheng Shou
关键词: Graphical User Interface, Graphical User, User Interface, automation holds significant, holds significant promise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 24 pages, 16 tables, 17 figures

点击查看摘要

Abstract:Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as “Insert a new slide.” In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e.g., Adobe Photoshop or Stable Diffusion WebUI) and complex activities (e.g., video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descriptions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i.e., screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements. For each level, we design evaluation metrics across individual dimensions to provide clear signals, such as individual performance in clicking, dragging, typing, and scrolling for atomic action execution. Our evaluation on VideoGUI reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks, especially for high-level planning.

[CV-2] SatDiffMoE: A Mixture of Estimation Method for Satellite Image Super-resolution with Latent Diffusion Models

链接: https://arxiv.org/abs/2406.10225
作者: Zhaoxu Luo,Bowen Song,Liyue Shen
关键词: satellite imaging systems, acquisition frequency, generally a trade-off, trade-off between spatial, onboard sensors
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:During the acquisition of satellite images, there is generally a trade-off between spatial resolution and temporal resolution (acquisition frequency) due to the onboard sensors of satellite imaging systems. High-resolution satellite images are very important for land crop monitoring, urban planning, wildfire management and a variety of applications. It is a significant yet challenging task to achieve high spatial-temporal resolution in satellite imaging. With the advent of diffusion models, we can now learn strong generative priors to generate realistic satellite images with high resolution, which can be utilized to promote the super-resolution task as well. In this work, we propose a novel diffusion-based fusion algorithm called \textbfSatDiffMoE that can take an arbitrary number of sequential low-resolution satellite images at the same location as inputs, and fuse them into one high-resolution reconstructed image with more fine details, by leveraging and fusing the complementary information from different time points. Our algorithm is highly flexible and allows training and inference on arbitrary number of low-resolution images. Experimental results show that our proposed SatDiffMoE method not only achieves superior performance for the satellite image super-resolution tasks on a variety of datasets, but also gets an improved computational efficiency with reduced model parameters, compared with previous methods.

[CV-3] EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

链接: https://arxiv.org/abs/2406.10224
作者: Julian Straub,Daniel DeTone,Tianwei Shen,Nan Yang,Chris Sweeney,Richard Newcombe
关键词: wearable computers enables, egocentric sensor data, advent of wearable, wearable computers, computers enables
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The advent of wearable computers enables a new source of context for AI that is embedded in egocentric sensor data. This new egocentric data comes equipped with fine-grained 3D location information and thus presents the opportunity for a novel class of spatial foundation models that are rooted in 3D space. To measure progress on what we term Egocentric Foundation Models (EFMs) we establish EFM3D, a benchmark with two core 3D egocentric perception tasks. EFM3D is the first benchmark for 3D object detection and surface regression on high quality annotated egocentric data of Project Aria. We propose Egocentric Voxel Lifting (EVL), a baseline for 3D EFMs. EVL leverages all available egocentric modalities and inherits foundational capabilities from 2D foundation models. This model, trained on a large simulated dataset, outperforms existing methods on the EFM3D benchmark.

[CV-4] Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding

链接: https://arxiv.org/abs/2406.10221
作者: Ridouane Ghermi,Xi Wang,Vicky Kalogeiton,Ivan Laptev
关键词: Recent advances, propelled video understanding, advances in vision-language, Recent, video understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[CV-5] PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting

链接: https://arxiv.org/abs/2406.10219
作者: Alex Hanson,Allen Tu,Vasu Singla,Mayuka Jayawardhana,Matthias Zwicker,Tom Goldstein
关键词: Recent advancements, high reconstruction accuracy, enabled real-time rendering, synthesis have enabled, enabled real-time
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Recent advancements in novel view synthesis have enabled real-time rendering speeds and high reconstruction accuracy. 3D Gaussian Splatting (3D-GS), a foundational point-based parametric 3D scene representation, models scenes as large sets of 3D Gaussians. Complex scenes can comprise of millions of Gaussians, amounting to large storage and memory requirements that limit the viability of 3D-GS on devices with limited resources. Current techniques for compressing these pretrained models by pruning Gaussians rely on combining heuristics to determine which ones to remove. In this paper, we propose a principled spatial sensitivity pruning score that outperforms these approaches. It is computed as a second-order approximation of the reconstruction error on the training views with respect to the spatial parameters of each Gaussian. Additionally, we propose a multi-round prune-refine pipeline that can be applied to any pretrained 3D-GS model without changing the training pipeline. After pruning 88.44% of the Gaussians, we observe that our PUP 3D-GS pipeline increases the average rendering speed of 3D-GS by 2.65 \times while retaining more salient foreground information and achieving higher image quality metrics than previous pruning techniques on scenes from the Mip-NeRF 360, Tanks Temples, and Deep Blending datasets.

[CV-6] NeST: Neural Stress Tensor Tomography by leveraging 3D Photoelasticity

链接: https://arxiv.org/abs/2406.10212
作者: Akshat Dave,Tianyi Zhang,Aaron Young,Ramesh Raskar,Wolfgang Heidrich,Ashok Veeraraghavan
关键词: Photoelasticity enables full-field, enables full-field stress, stress-induced birefringence, enables full-field, Photoelasticity enables
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Photoelasticity enables full-field stress analysis in transparent objects through stress-induced birefringence. Existing techniques are limited to 2D slices and require destructively slicing the object. Recovering the internal 3D stress distribution of the entire object is challenging as it involves solving a tensor tomography problem and handling phase wrapping ambiguities. We introduce NeST, an analysis-by-synthesis approach for reconstructing 3D stress tensor fields as neural implicit representations from polarization measurements. Our key insight is to jointly handle phase unwrapping and tensor tomography using a differentiable forward model based on Jones calculus. Our non-linear model faithfully matches real captures, unlike prior linear approximations. We develop an experimental multi-axis polariscope setup to capture 3D photoelasticity and experimentally demonstrate that NeST reconstructs the internal stress distribution for objects with varying shape and force conditions. Additionally, we showcase novel applications in stress analysis, such as visualizing photoelastic fringes by virtually slicing the object and viewing photoelastic fringes from unseen viewpoints. NeST paves the way for scalable non-destructive 3D photoelastic analysis.

[CV-7] DiffusionBlend: Learning 3D Image Prior through Position-aware Diffusion Score Blending for 3D Computed Tomography Reconstruction

链接: https://arxiv.org/abs/2406.10211
作者: Bowen Song,Jason Hu,Zhaoxu Luo,Jeffrey A. Fessler,Liyue Shen
关键词: Computed Tomography, face significant challenges, models face significant, Diffusion models face, face significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models face significant challenges when employed for large-scale medical image reconstruction in real practice such as 3D Computed Tomography (CT). Due to the demanding memory, time, and data requirements, it is difficult to train a diffusion model directly on the entire volume of high-dimensional data to obtain an efficient 3D diffusion prior. Existing works utilizing diffusion priors on single 2D image slice with hand-crafted cross-slice regularization would sacrifice the z-axis consistency, which results in severe artifacts along the z-axis. In this work, we propose a novel framework that enables learning the 3D image prior through position-aware 3D-patch diffusion score blending for reconstructing large-scale 3D medical images. To the best of our knowledge, we are the first to utilize a 3D-patch diffusion prior for 3D medical image reconstruction. Extensive experiments on sparse view and limited angle CT reconstruction show that our DiffusionBlend method significantly outperforms previous methods and achieves state-of-the-art performance on real-world CT reconstruction problems with high-dimensional 3D image (i.e., 256 \times 256 \times 500 ). Our algorithm also comes with better or comparable computational efficiency than previous state-of-the-art methods.

[CV-8] Make It Count: Text-to-Image Generation with an Accurate Number of Objects

链接: https://arxiv.org/abs/2406.10210
作者: Lital Binyamin,Yoad Tewel,Hilit Segev,Eran Hirsch,Royi Rassin,Gal Chechik
关键词: controlling the number, surprisingly hard, unprecedented success, number of depicted, text is surprisingly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project page is at this https URL

点击查看摘要

Abstract:Despite the unprecedented success of text-to-image diffusion models, controlling the number of depicted objects using text is surprisingly hard. This is important for various applications from technical documents, to children’s books to illustrating cooking recipes. Generating object-correct counts is fundamentally challenging because the generative model needs to keep a sense of separate identity for every instance of the object, even if several objects look identical or overlap, and then carry out a global computation implicitly during generation. It is still unknown if such representations exist. To address count-correct generation, we first identify features within the diffusion model that can carry the object identity information. We then use them to separate and count instances of objects during the denoising process and detect over-generation and under-generation. We fix the latter by training a model that predicts both the shape and location of a missing object, based on the layout of existing ones, and show how it can be used to guide denoising with correct object count. Our approach, CountGen, does not depend on external source to determine object layout, but rather uses the prior from the diffusion model itself, creating prompt-dependent and seed-dependent layouts. Evaluated on two benchmark datasets, we find that CountGen strongly outperforms the count-accuracy of existing baselines.

[CV-9] Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

链接: https://arxiv.org/abs/2406.10208
作者: Zeyu Liu,Weicong Liang,Yiming Zhao,Bohan Chen,Ji Li,Yuhui Yuan
关键词: achieved highly accurate, visual text rendering, text rendering performance, achieved highly, graphic design images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recently, Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images. However, it still focuses solely on English and performs relatively poorly in terms of visual appeal. In this work, we address these two fundamental limitations by presenting Glyph-ByT5-v2 and Glyph-SDXL-v2, which not only support accurate visual text rendering for 10 different languages but also achieve much better aesthetic quality. To achieve this, we make the following contributions: (i) creating a high-quality multilingual glyph-text and graphic design dataset consisting of more than 1 million glyph-text pairs and 10 million graphic design image-text pairs covering nine other languages, (ii) building a multilingual visual paragraph benchmark consisting of 1,000 prompts, with 100 for each language, to assess multilingual visual spelling accuracy, and (iii) leveraging the latest step-aware preference learning approach to enhance the visual aesthetic quality. With the combination of these techniques, we deliver a powerful customized multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2, that can support accurate spelling in 10 different languages. We perceive our work as a significant advancement, considering that the latest DALL-E3 and Ideogram 1.0 still struggle with the multilingual visual text rendering task.

[CV-10] SSTFB: Leveraging self-supervised pretext learning and temporal self-attention with feature branching for real-time video polyp segmentation

链接: https://arxiv.org/abs/2406.10200
作者: Ziang Xu,Jens Rittscher,Sharib Ali
关键词: early cancer indicators, cancer indicators, removal is critical, early cancer, assessing occurrences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: 12 pages

点击查看摘要

Abstract:Polyps are early cancer indicators, so assessing occurrences of polyps and their removal is critical. They are observed through a colonoscopy screening procedure that generates a stream of video frames. Segmenting polyps in their natural video screening procedure has several challenges, such as the co-existence of imaging artefacts, motion blur, and floating debris. Most existing polyp segmentation algorithms are developed on curated still image datasets that do not represent real-world colonoscopy. Their performance often degrades on video data. We propose a video polyp segmentation method that performs self-supervised learning as an auxiliary task and a spatial-temporal self-attention mechanism for improved representation learning. Our end-to-end configuration and joint optimisation of losses enable the network to learn more discriminative contextual features in videos. Our experimental results demonstrate an improvement with respect to several state-of-the-art (SOTA) methods. Our ablation study also confirms that the choice of the proposed joint end-to-end training improves network accuracy by over 3% and nearly 10% on both the Dice similarity coefficient and intersection-over-union compared to the recently proposed method PNS+ and Polyp-PVT, respectively. Results on previously unseen video data indicate that the proposed method generalises.

[CV-11] Crafting Parts for Expressive Object Composition

链接: https://arxiv.org/abs/2406.10197
作者: Harsh Rangwani,Aishwarya Agarwal,Kuldeep Kulkarni,R. Venkatesh Babu,Srikrishna Karanam
关键词: extensive knowledge bases, Stable Diffusion, large generative models, large generative, tasks due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project Page Will Be Here: this https URL

点击查看摘要

Abstract:Text-to-image generation from large generative models like Stable Diffusion, DALLE-2, etc., have become a common base for various tasks due to their superior quality and extensive knowledge bases. As image composition and generation are creative processes the artists need control over various parts of the images being generated. We find that just adding details about parts in the base text prompt either leads to an entirely different image (e.g., missing/incorrect identity) or the extra part details simply being ignored. To mitigate these issues, we introduce PartCraft, which enables image generation based on fine-grained part-level details specified for objects in the base text prompt. This allows more control for artists and enables novel object compositions by combining distinctive object parts. PartCraft first localizes object parts by denoising the object region from a specific diffusion process. This enables each part token to be localized to the right object region. After obtaining part masks, we run a localized diffusion process in each of the part regions based on fine-grained part descriptions and combine them to produce the final image. All the stages of PartCraft are based on repurposing a pre-trained diffusion model, which enables it to generalize across various domains without training. We demonstrate the effectiveness of part-level control provided by PartCraft qualitatively through visual examples and quantitatively in comparison to the contemporary baselines.

[CV-12] Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

链接: https://arxiv.org/abs/2406.10185
作者: Jiawei Chen,Dingkang Yang,Tong Wu,Yue Jiang,Xiaolu Hou,Mingcheng Li,Shunli Wang,Dongling Xiao,Ke Li,Lihua Zhang
关键词: Large Vision Language, Vision Language Models, Large Language Models, Vision Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new medical evaluative metric designed to assess LVLMs’ hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, thereby enabling a granular assessment of potential clinical impacts. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection. Through extensive experimental evaluations, we establish baselines for popular LVLMs using our benchmark. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. We hope this work can significantly improve the reliability of LVLMs in medical applications. All resources of this work will be released soon.

[CV-13] MeshPose: Unifying DensePose and 3D Body Mesh reconstruction

链接: https://arxiv.org/abs/2406.10180
作者: Eric-Tuan Lê,Antonis Kakolyris,Petros Koutras,Himmy Tam,Efstratios Skordos,George Papandreou,Rıza Alp Güler,Iasonas Kokkinos
关键词: Human Mesh Reconstruction, DensePose localization metrics, reprojection error, Mesh Reconstruction, localization metrics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

点击查看摘要

Abstract:DensePose provides a pixel-accurate association of images with 3D mesh coordinates, but does not provide a 3D mesh, while Human Mesh Reconstruction (HMR) systems have high 2D reprojection error, as measured by DensePose localization metrics. In this work we introduce MeshPose to jointly tackle DensePose and HMR. For this we first introduce new losses that allow us to use weak DensePose supervision to accurately localize in 2D a subset of the mesh vertices (‘VertexPose’). We then lift these vertices to 3D, yielding a low-poly body mesh (‘MeshPose’). Our system is trained in an end-to-end manner and is the first HMR method to attain competitive DensePose accuracy, while also being lightweight and amenable to efficient inference, making it suitable for real-time AR applications.

[CV-14] Enhancing Incomplete Multi-modal Brain Tumor Segmentation with Intra-modal Asymmetry and Inter-modal Dependency

链接: https://arxiv.org/abs/2406.10175
作者: Weide Liu,Jingwen Hou,Xiaoyang Zhong,Huijing Zhan,Jun Cheng,Yuming Fang,Guanghui Yue
关键词: Deep learning-based brain, Deep learning-based, multi-modal MRI images, recent years, incomplete MRI modalities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning-based brain tumor segmentation (BTS) models for multi-modal MRI images have seen significant advancements in recent years. However, a common problem in practice is the unavailability of some modalities due to varying scanning protocols and patient conditions, making segmentation from incomplete MRI modalities a challenging issue. Previous methods have attempted to address this by fusing accessible multi-modal features, leveraging attention mechanisms, and synthesizing missing modalities using generative models. However, these methods ignore the intrinsic problems of medical image segmentation, such as the limited availability of training samples, particularly for cases with tumors. Furthermore, these methods require training and deploying a specific model for each subset of missing modalities. To address these issues, we propose a novel approach that enhances the BTS model from two perspectives. Firstly, we introduce a pre-training stage that generates a diverse pre-training dataset covering a wide range of different combinations of tumor shapes and brain anatomy. Secondly, we propose a post-training stage that enables the model to reconstruct missing modalities in the prediction results when only partial modalities are available. To achieve the pre-training stage, we conceptually decouple the MRI image into two parts: anatomy' and tumor’. We pre-train the BTS model using synthesized data generated from the anatomy and tumor parts across different training samples. … Extensive experiments demonstrate that our proposed method significantly improves the performance over the baseline and achieves new state-of-the-art results on three brain tumor segmentation datasets: BRATS2020, BRATS2018, and BRATS2015.

[CV-15] 4DRecons: 4D Neural Implicit Deformable Objects Reconstruction from a single RGB-D Camera with Geometrical and Topological Regularizations

链接: https://arxiv.org/abs/2406.10167
作者: Xiaoyan Cong,Haitao Yang,Liyan Chen,Kaifeng Zhang,Li Yi,Chandrajit Bajaj,Qixing Huang
关键词: single camera RGB-D, camera RGB-D sequence, implicit surface, single camera, camera RGB-D
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach 4DRecons that takes a single camera RGB-D sequence of a dynamic subject as input and outputs a complete textured deforming 3D model over time. 4DRecons encodes the output as a 4D neural implicit surface and presents an optimization procedure that combines a data term and two regularization terms. The data term fits the 4D implicit surface to the input partial observations. We address fundamental challenges in fitting a complete implicit surface to partial observations. The first regularization term enforces that the deformation among adjacent frames is as rigid as possible (ARAP). To this end, we introduce a novel approach to compute correspondences between adjacent textured implicit surfaces, which are used to define the ARAP regularization term. The second regularization term enforces that the topology of the underlying object remains fixed over time. This regularization is critical for avoiding self-intersections that are typical in implicit-based reconstructions. We have evaluated the performance of 4DRecons on a variety of datasets. Experimental results show that 4DRecons can handle large deformations and complex inter-part interactions and outperform state-of-the-art approaches considerably.

[CV-16] CarLLaVA: Vision language models for camera-only closed-loop driving

链接: https://arxiv.org/abs/2406.10165
作者: Katrin Renz,Long Chen,Ana-Maria Marcu,Jan Hünermann,Benoit Hanotte,Alice Karnsund,Jamie Shotton,Elahe Arani,Oleg Sinavski
关键词: Vision Language Model, Autonomous Driving Challenge, CARLA Autonomous Driving, Language Model, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Outstanding Champion Innovation Award @ CARLA Autonomous Driving Challenge 2024; Project video: this https URL

点击查看摘要

Abstract:In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.

[CV-17] MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers

链接: https://arxiv.org/abs/2406.10163
作者: Yiwen Chen,Tong He,Di Huang,Weicai Ye,Sijin Chen,Jiaxiang Tang,Xin Chen,Zhongang Cai,Lei Yang,Gang Yu,Guosheng Lin,Chi Zhang
关键词: manually crafted assets, Recently, manually crafted, current mesh extraction, mesh extraction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Recently, 3D assets created via reconstruction and generation have matched the quality of manually crafted assets, highlighting their potential for replacement. However, this potential is largely unrealized because these assets always need to be converted to meshes for 3D industry applications, and the meshes produced by current mesh extraction methods are significantly inferior to Artist-Created Meshes (AMs), i.e., meshes created by human artists. Specifically, current mesh extraction methods rely on dense faces and ignore geometric features, leading to inefficiencies, complicated post-processing, and lower representation quality. To address these issues, we introduce MeshAnything, a model that treats mesh extraction as a generation problem, producing AMs aligned with specified shapes. By converting 3D assets in any 3D representation into AMs, MeshAnything can be integrated with various 3D asset production methods, thereby enhancing their application across the 3D industry. The architecture of MeshAnything comprises a VQ-VAE and a shape-conditioned decoder-only transformer. We first learn a mesh vocabulary using the VQ-VAE, then train the shape-conditioned decoder-only transformer on this vocabulary for shape-conditioned autoregressive mesh generation. Our extensive experiments show that our method generates AMs with hundreds of times fewer faces, significantly improving storage, rendering, and simulation efficiencies, while achieving precision comparable to previous methods.

[CV-18] YOLOv1 to YOLOv10: A comprehensive review of YOLO variants and their application in the agricultural domain

链接: https://arxiv.org/abs/2406.10139
作者: Mujadded Al Rabbani Alif,Muhammad Hussain
关键词: YOLO variants, investigates the transformative, YOLO incremental advancements, YOLO, sustainable agricultural practices
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 31 pages

点击查看摘要

Abstract:This survey investigates the transformative potential of various YOLO variants, from YOLOv1 to the state-of-the-art YOLOv10, in the context of agricultural advancements. The primary objective is to elucidate how these cutting-edge object detection models can re-energise and optimize diverse aspects of agriculture, ranging from crop monitoring to livestock management. It aims to achieve key objectives, including the identification of contemporary challenges in agriculture, a detailed assessment of YOLO’s incremental advancements, and an exploration of its specific applications in agriculture. This is one of the first surveys to include the latest YOLOv10, offering a fresh perspective on its implications for precision farming and sustainable agricultural practices in the era of Artificial Intelligence and automation. Further, the survey undertakes a critical analysis of YOLO’s performance, synthesizes existing research, and projects future trends. By scrutinizing the unique capabilities packed in YOLO variants and their real-world applications, this survey provides valuable insights into the evolving relationship between YOLO variants and agriculture. The findings contribute towards a nuanced understanding of the potential for precision farming and sustainable agricultural practices, marking a significant step forward in the integration of advanced object detection technologies within the agricultural sector.

[CV-19] SmartRSD: An Intelligent Multimodal Approach to Real-Time Road Surface Detection for Safe Driving

链接: https://arxiv.org/abs/2406.10128
作者: Adnan Md Tayeb,Mst Ayesha Khatun,Mohtasin Golam,Md Facklasur Rahaman,Ali Aouto,Oroceo Paul Angelo,Minseon Lee,Dong-Seong Kim,Jae-Min Lee,Jung-Hyeon Kim
关键词: traction control techniques, specific traction control, Precise and prompt, road surface conditions, conditions enables vehicles
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages

点击查看摘要

Abstract:Precise and prompt identification of road surface conditions enables vehicles to adjust their actions, like changing speed or using specific traction control techniques, to lower the chance of accidents and potential danger to drivers and pedestrians. However, most of the existing methods for detecting road surfaces solely rely on visual data, which may be insufficient in certain situations, such as when the roads are covered by debris, in low light conditions, or in the presence of fog. Therefore, we introduce a multimodal approach for the automated detection of road surface conditions by integrating audio and images. The robustness of the proposed method is tested on a diverse dataset collected under various environmental conditions and road surface types. Through extensive evaluation, we demonstrate the effectiveness and reliability of our multimodal approach in accurately identifying road surface conditions in real-time scenarios. Our findings highlight the potential of integrating auditory and visual cues for enhancing road safety and minimizing accident risks

[CV-20] raining-free Camera Control for Video Generation

链接: https://arxiv.org/abs/2406.10126
作者: Chen Hou,Guoqiang Wei,Yan Zeng,Zhibo Chen
关键词: video diffusion models, solution to offer, camera, diffusion models, video diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a training-free and robust solution to offer camera movement control for off-the-shelf video diffusion models. Unlike previous work, our method does not require any supervised finetuning on camera-annotated datasets or self-supervised training via data augmentation. Instead, it can be plugged and played with most pretrained video diffusion models and generate camera controllable videos with a single image or text prompt as input. The inspiration of our work comes from the layout prior that intermediate latents hold towards generated results, thus rearranging noisy pixels in them will make output content reallocated as well. As camera move could also be seen as a kind of pixel rearrangement caused by perspective change, videos could be reorganized following specific camera motion if their noisy latents change accordingly. Established on this, we propose our method CamTrol, which enables robust camera control for video diffusion models. It is achieved by a two-stage process. First, we model image layout rearrangement through explicit camera movement in 3D point cloud space. Second, we generate videos with camera motion using layout prior of noisy latents formed by a series of rearranged images. Extensive experiments have demonstrated the robustness our method holds in controlling camera motion of generated videos. Furthermore, we show that our method can produce impressive results in generating 3D rotation videos with dynamic content. Project page at this https URL.

[CV-21] MapVision: CVPR 2024 Autonomous Grand Challenge Mapless Driving Tech Report

链接: https://arxiv.org/abs/2406.10125
作者: Zhongyu Yang,Mai Liu,Jinluo Xie,Yueming Zhang,Chen Shen,Wei Shao,Jichao Jiao,Tengfei Xing,Runbo Hu,Pengfei Xu
关键词: active scene understanding, Autonomous driving, Bird Eye View, driving without high-definition, level of active
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Autonomous driving without high-definition (HD) maps demands a higher level of active scene understanding. In this competition, the organizers provided the multi-perspective camera images and standard-definition (SD) maps to explore the boundaries of scene reasoning capabilities. We found that most existing algorithms construct Bird’s Eye View (BEV) features from these multi-perspective images and use multi-task heads to delineate road centerlines, boundary lines, pedestrian crossings, and other areas. However, these algorithms perform poorly at the far end of roads and struggle when the primary subject in the image is occluded. Therefore, in this competition, we not only used multi-perspective images as input but also incorporated SD maps to address this issue. We employed map encoder pre-training to enhance the network’s geometric encoding capabilities and utilized YOLOX to improve traffic element detection precision. Additionally, for area detection, we innovatively introduced LDTR and auxiliary tasks to achieve higher precision. As a result, our final OLUS score is 0.58.

[CV-22] Shelf-Supervised Multi-Modal Pre-Training for 3D Object Detection

链接: https://arxiv.org/abs/2406.10115
作者: Mehar Khurana,Neehar Peri,Deva Ramanan,James Hays
关键词: massive labeled datasets, massive labeled, self-supervised, data, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using image-based foundation models trained on internet-scale image data. Specifically, we propose a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data. Pre-training 3D detectors with such pseudo-labels yields significantly better semi-supervised detection accuracy than prior self-supervised pretext tasks. Importantly, we show that image-based shelf-supervision is helpful for training LiDAR-only and multi-modal (RGB + LiDAR) detectors. We demonstrate the effectiveness of our approach on nuScenes and WOD, significantly improving over prior work in limited data settings.

[CV-23] ask-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations

链接: https://arxiv.org/abs/2406.10114
作者: Daan de Geus,Gijs Dubbelman
关键词: Part-aware panoptic segmentation, panoptic segmentation, Part-aware panoptic, PPS, background region
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024. Project page and code: this https URL

点击查看摘要

Abstract:Part-aware panoptic segmentation (PPS) requires (a) that each foreground object and background region in an image is segmented and classified, and (b) that all parts within foreground objects are segmented, classified and linked to their parent object. Existing methods approach PPS by separately conducting object-level and part-level segmentation. However, their part-level predictions are not linked to individual parent objects. Therefore, their learning objective is not aligned with the PPS task objective, which harms the PPS performance. To solve this, and make more accurate PPS predictions, we propose Task-Aligned Part-aware Panoptic Segmentation (TAPPS). This method uses a set of shared queries to jointly predict (a) object-level segments, and (b) the part-level segments within those same objects. As a result, TAPPS learns to predict part-level segments that are linked to individual parent objects, aligning the learning objective with the task objective, and allowing TAPPS to leverage joint object-part representations. With experiments, we show that TAPPS considerably outperforms methods that predict objects and parts separately, and achieves new state-of-the-art PPS results.

[CV-24] GaussianSR: 3D Gaussian Super-Resolution with 2D Diffusion Priors

链接: https://arxiv.org/abs/2406.10111
作者: Xiqian Yu,Hanxin Zhu,Tianyu He,Zhibo Chen
关键词: low-resolution input views, Neural Radiance Field, high-resolution Neural Radiance, challenging task due, Achieving high-resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Achieving high-resolution novel view synthesis (HRNVS) from low-resolution input views is a challenging task due to the lack of high-resolution data. Previous methods optimize high-resolution Neural Radiance Field (NeRF) from low-resolution input views but suffer from slow rendering speed. In this work, we base our method on 3D Gaussian Splatting (3DGS) due to its capability of producing high-quality images at a faster rendering speed. To alleviate the shortage of data for higher-resolution synthesis, we propose to leverage off-the-shelf 2D diffusion priors by distilling the 2D knowledge into 3D with Score Distillation Sampling (SDS). Nevertheless, applying SDS directly to Gaussian-based 3D super-resolution leads to undesirable and redundant 3D Gaussian primitives, due to the randomness brought by generative priors. To mitigate this issue, we introduce two simple yet effective techniques to reduce stochastic disturbances introduced by SDS. Specifically, we 1) shrink the range of diffusion timestep in SDS with an annealing strategy; 2) randomly discard redundant Gaussian primitives during densification. Extensive experiments have demonstrated that our proposed GaussainSR can attain high-quality results for HRNVS with only low-resolution inputs on both synthetic and real-world datasets. Project page: this https URL

[CV-25] Annotation Cost-Efficient Active Learning for Deep Metric Learning Driven Remote Sensing Image Retrieval

链接: https://arxiv.org/abs/2406.10107
作者: Genc Hoxha,Gencer Sumbul,Julia Henkel,Lars Möllenbrok,Begüm Demir
关键词: image pairs, DML driven CBIR, content-based image retrieval, image, dissimilar image pairs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IEEE Transactions on Geoscience and Remote Sensing

点击查看摘要

Abstract:Deep metric learning (DML) has shown to be very effective for content-based image retrieval (CBIR) in remote sensing (RS). Most of DML methods for CBIR rely on many annotated images to accurately learn model parameters of deep neural networks. However, gathering many image annotations is time consuming and costly. To address this, we propose an annotation cost-efficient active learning (ANNEAL) method specifically designed for DML driven CBIR in RS. ANNEAL aims to create a small but informative training set made up of similar and dissimilar image pairs to be utilized for learning a deep metric space. The informativeness of the image pairs is assessed combining uncertainty and diversity criteria. To assess the uncertainty of image pairs, we introduce two algorithms: 1) metric-guided uncertainty estimation (MGUE); and 2) binary classifier guided uncertainty estimation (BCGUE). MGUE automatically estimates a threshold value that acts as a “boundary” between similar and dissimilar image pairs based on the distances in the metric space. The closer the similarity between image pairs to the estimated threshold value the higher their uncertainty. BCGUE estimates the uncertainty of the image pairs based on the confidence of the classifier in assigning the correct similarity label. The diversity criterion is assessed through a clustering-based strategy. ANNEAL selects the most informative image pairs by combining either MGUE or BCGUE with clustering-based strategy. The selected image pairs are sent to expert annotators to be labeled as similar or dissimilar. This way of annotating images significantly reduces the annotation cost compared to the cost of annotating images with LULC labels. Experimental results carried out on two RS benchmark datasets demonstrate the effectiveness of our method. The code of the proposed method will be publicly available upon the acceptance of the paper.

[CV-26] SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

链接: https://arxiv.org/abs/2406.10100
作者: Junwei Luo,Zhen Pang,Yongjun Zhang,Tingzhu Wang,Linlin Wang,Bo Dang,Jiangwei Lao,Jian Wang,Jingdong Chen,Yihua Tan,Yansheng Li
关键词: Large Multi-Modal Models, Sensing Large Multi-Modal, Remote Sensing Large, remote sensing imagery, Multi-Modal Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 30 pages, 5 figures, 19 tables, dataset and code see this https URL

点击查看摘要

Abstract:Remote Sensing Large Multi-Modal Models (RSLMMs) are developing rapidly and showcase significant capabilities in remote sensing imagery (RSI) comprehension. However, due to the limitations of existing datasets, RSLMMs have shortcomings in understanding the rich semantic relations among objects in complex remote sensing scenes. To unlock RSLMMs’ complex comprehension ability, we propose a large-scale instruction tuning dataset FIT-RS, containing 1,800,851 instruction samples. FIT-RS covers common interpretation tasks and innovatively introduces several complex comprehension tasks of escalating difficulty, ranging from relation reasoning to image-level scene graph generation. Based on FIT-RS, we build the FIT-RSFG benchmark. Furthermore, we establish a new benchmark to evaluate the fine-grained relation comprehension capabilities of LMMs, named FIT-RSRC. Based on combined instruction data, we propose SkySenseGPT, which achieves outstanding performance on both public datasets and FIT-RSFG, surpassing existing RSLMMs. We hope the FIT-RS dataset can enhance the relation comprehension capability of RSLMMs and provide a large-scale fine-grained data source for the remote sensing community. The dataset will be available at this https URL

[CV-27] Localizing Events in Videos with Multimodal Queries

链接: https://arxiv.org/abs/2406.10079
作者: Gengyuan Zhang,Mang Ling Ada Fok,Yan Xia,Yansong Tang,Daniel Cremers,Philip Torr,Volker Tresp,Jindong Gu
关键词: digital era, demanding to process, pivotal task, dynamic and multievent, multievent nature
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:Video understanding is a pivotal task in the digital era, yet the dynamic and multievent nature of videos makes them labor-intensive and computationally demanding to process. Thus, localizing a specific event given a semantic query has gained importance in both user-oriented applications like video search and academic research into video foundation models. A significant limitation in current research is that semantic queries are typically in natural language that depicts the semantics of the target event. This setting overlooks the potential for multimodal semantic queries composed of images and texts. To address this gap, we introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries, along with a new evaluation dataset ICQ-Highlight. Our new benchmark aims to evaluate how well models can localize an event given a multimodal semantic query that consists of a reference image, which depicts the event, and a refinement text to adjust the images’ semantics. To systematically benchmark model performance, we include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains. We propose 3 adaptation methods that tailor existing models to our new setting and evaluate 10 SOTA models, ranging from specialized to large-scale foundation models. We believe this benchmark is an initial step toward investigating multimodal queries in video event localization.

[CV-28] D-NPC: Dynamic Neural Point Clouds for Non-Rigid View Synthesis from Monocular Video

链接: https://arxiv.org/abs/2406.10078
作者: Moritz Kappel,Florian Hahlbohm,Timon Scholz,Susana Castillo,Christian Theobalt,Martin Eisemann,Vladislav Golyanik,Marcus Magnor
关键词: gained increased attention, recently gained increased, spatiotemporal novel-view synthesis, deforming scenes recently, scenes recently gained
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 16 pages, 5 figures, 10 tables. Project page: this https URL

点击查看摘要

Abstract:Dynamic reconstruction and spatiotemporal novel-view synthesis of non-rigidly deforming scenes recently gained increased attention. While existing work achieves impressive quality and performance on multi-view or teleporting camera setups, most methods fail to efficiently and faithfully recover motion and appearance from casual monocular captures. This paper contributes to the field by introducing a new method for dynamic novel view synthesis from monocular video, such as casual smartphone captures. Our approach represents the scene as a \textitdynamic neural point cloud , an implicit time-conditioned point distribution that encodes local geometry and appearance in separate hash-encoded neural feature grids for static and dynamic regions. By sampling a discrete point cloud from our model, we can efficiently render high-quality novel views using a fast differentiable rasterizer and neural rendering network. Similar to recent work, we leverage advances in neural scene analysis by incorporating data-driven priors like monocular depth estimation and object segmentation to resolve motion and depth ambiguities originating from the monocular captures. In addition to guiding the optimization process, we show that these priors can be exploited to explicitly initialize our scene representation to drastically improve optimization speed and final image quality. As evidenced by our experimental evaluation, our dynamic point cloud model not only enables fast optimization and real-time frame rates for interactive applications, but also achieves competitive image quality on monocular benchmark sequences. Our project page is available at this https URL. Comments: 16 pages, 5 figures, 10 tables. Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG) Cite as: arXiv:2406.10078 [cs.CV] (or arXiv:2406.10078v1 [cs.CV] for this version)

[CV-29] DurLAR: A High-fidelity 128-channel LiDAR Dataset with Panoramic Ambient and Reflectivity Imagery for Multi-modal Autonomous Driving Applications

链接: https://arxiv.org/abs/2406.10068
作者: Li Li,Khalid N. Ismail,Hubert P. H. Shum,Toby P. Breckon
关键词: autonomous driving applications, driving applications, reflectivity imagery, sample benchmark task, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted by 3DV 2021; 13 pages, 14 figures; Dataset at this https URL

点击查看摘要

Abstract:We present DurLAR, a high-fidelity 128-channel 3D LiDAR dataset with panoramic ambient (near infrared) and reflectivity imagery, as well as a sample benchmark task using depth estimation for autonomous driving applications. Our driving platform is equipped with a high resolution 128 channel LiDAR, a 2MPix stereo camera, a lux meter and a GNSS/INS system. Ambient and reflectivity images are made available along with the LiDAR point clouds to facilitate multi-modal use of concurrent ambient and reflectivity scene information. Leveraging DurLAR, with a resolution exceeding that of prior benchmarks, we consider the task of monocular depth estimation and use this increased availability of higher resolution, yet sparse ground truth scene depth information to propose a novel joint supervised/self-supervised loss formulation. We compare performance over both our new DurLAR dataset, the established KITTI benchmark and the Cityscapes dataset. Our evaluation shows our joint use supervised and self-supervised loss terms, enabled via the superior ground truth resolution and availability within DurLAR improves the quantitative and qualitative performance of leading contemporary monocular depth estimation approaches (RMSE=3.639, Sq Rel=0.936).

[CV-30] First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

链接: https://arxiv.org/abs/2406.10057
作者: Enming Zhang,Ruobing Yao,Huanyong Liu,Junhui Yu,Jiale Wang
关键词: increasingly powerful, general capabilities, capabilities are increasingly, flowcharts, MLLMs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the development of multimodal large language models (MLLMs) technology, its general capabilities are increasingly powerful. To evaluate the various abilities of MLLMs, numerous evaluation systems have emerged. But now there is still a lack of a comprehensive method to evaluate MLLMs in the tasks related to flowcharts, which are very important in daily life and work. We propose the first comprehensive method, FlowCE, to assess MLLMs across various dimensions for tasks related to flowcharts. It encompasses evaluating MLLMs’ abilities in Reasoning, Localization Recognition, Information Extraction, Logical Verification, and Summarization on flowcharts. However, we find that even the GPT4o model achieves only a score of 56.63. Among open-source models, Phi-3-Vision obtained the highest score of 49.97. We hope that FlowCE can contribute to future research on multimodal large language models (MLLMs) for tasks based on flowcharts. We are open-sourcing this project: \urlthis https URL

[CV-31] Comparison of fine-tuning strategies for transfer learning in medical image classification

链接: https://arxiv.org/abs/2406.10050
作者: Ana Davila,Jacinto Colan,Yasuhisa Hasegawa
关键词: specialized medical contexts, pre-trained models, medical imaging, medical, fine-tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at Image and Vision Computing

点击查看摘要

Abstract:In the context of medical imaging and machine learning, one of the most pressing challenges is the effective adaptation of pre-trained models to specialized medical contexts. Despite the availability of advanced pre-trained models, their direct application to the highly specialized and diverse field of medical imaging often falls short due to the unique characteristics of medical data. This study provides a comprehensive analysis on the performance of various fine-tuning methods applied to pre-trained models across a spectrum of medical imaging domains, including X-ray, MRI, Histology, Dermoscopy, and Endoscopic surgery. We evaluated eight fine-tuning strategies, including standard techniques such as fine-tuning all layers or fine-tuning only the classifier layers, alongside methods such as gradually unfreezing layers, regularization based fine-tuning and adaptive learning rates. We selected three well-established CNN architectures (ResNet-50, DenseNet-121, and VGG-19) to cover a range of learning and feature extraction scenarios. Although our results indicate that the efficacy of these fine-tuning methods significantly varies depending on both the architecture and the medical imaging type, strategies such as combining Linear Probing with Full Fine-tuning resulted in notable improvements in over 50% of the evaluated cases, demonstrating general effectiveness across medical domains. Moreover, Auto-RGN, which dynamically adjusts learning rates, led to performance enhancements of up to 11% for specific modalities. Additionally, the DenseNet architecture showed more pronounced benefits from alternative fine-tuning approaches compared to traditional full fine-tuning. This work not only provides valuable insights for optimizing pre-trained models in medical image analysis but also suggests the potential for future research into more advanced architectures and fine-tuning methods.

[CV-32] Unobtrusive Monitoring of Physical Weakness: A Simulated Approach

链接: https://arxiv.org/abs/2406.10045
作者: Chen Long-fei,Muhammad Ahmed Raza,Craig Innes,Subramanian Ramamoorthy,Robert B. Fisher
关键词: making early detection, affect older adults’, chronic conditions affect, conditions affect older, Aging and chronic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Aging and chronic conditions affect older adults’ daily lives, making early detection of developing health issues crucial. Weakness, common in many conditions, alters physical movements and daily activities subtly. However, detecting such changes can be challenging due to their subtle and gradual nature. To address this, we employ a non-intrusive camera sensor to monitor individuals’ daily sitting and relaxing activities for signs of weakness. We simulate weakness in healthy subjects by having them perform physical exercise and observing the behavioral changes in their daily activities before and after workouts. The proposed system captures fine-grained features related to body motion, inactivity, and environmental context in real-time while prioritizing privacy. A Bayesian Network is used to model the relationships between features, activities, and health conditions. We aim to identify specific features and activities that indicate such changes and determine the most suitable time scale for observing the change. Results show 0.97 accuracy in distinguishing simulated weakness at the daily level. Fine-grained behavioral features, including non-dominant upper body motion speed and scale, and inactivity distribution, along with a 300-second window, are found most effective. However, individual-specific models are recommended as no universal set of optimal features and activities was identified across all participants.

[CV-33] ProtoS-ViT: Visual foundation models for sparse self-explainable classifications

链接: https://arxiv.org/abs/2406.10025
作者: Hugues Turbé,Mina Bjelogrlic,Gianmarco Mengaldo,Christian Lovis
关键词: build intrinsically explainable, Prototypical networks aim, intrinsically explainable models, explainable models based, summation of concepts
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prototypical networks aim to build intrinsically explainable models based on the linear summation of concepts. However, important challenges remain in the transparency, compactness, and meaningfulness of the explanations provided by these models. This work demonstrates how frozen pre-trained ViT backbones can be effectively turned into prototypical models for both general and domain-specific tasks, in our case biomedical image classifiers. By leveraging strong spatial features combined with a novel prototypical head, ProtoS-ViT surpasses existing prototypical models showing strong performance in terms of accuracy, compactness, and explainability. Model explainability is evaluated through an extensive set of quantitative and qualitative metrics which serve as a general benchmark for the development of prototypical models. Code is available at this https URL.

[CV-34] Group and Shuffle: Efficient Structured Orthogonal Parametrization

链接: https://arxiv.org/abs/2406.10019
作者: Mikhail Gorbunov,Nikolay Yudin,Vera Soboleva,Aibek Alanov,Alexey Naumov,Maxim Rakhuba
关键词: increasing size, growing demand, orthogonal, efficient fine-tuning, fine-tuning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[CV-35] and Average : Geometric Adjustment of the Last Layer for Recalibration

链接: https://arxiv.org/abs/2406.10017
作者: Gyusang Cho,Chan-Hyun Youn
关键词: produce overconfident predictions, gained significant importance, neural networks tend, overconfident predictions, significant importance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages, 11 figures, to appear in International Conference on Machine Learning (ICML2024)

点击查看摘要

Abstract:After the revelation that neural networks tend to produce overconfident predictions, the problem of calibration, which aims to align confidence with accuracy to enhance the reliability of predictions, has gained significant importance. Several solutions based on calibration maps have been proposed to address the problem of recalibrating a trained classifier using additional datasets. In this paper, we offer an algorithm that transforms the weights of the last layer of the classifier, distinct from the calibration-map-based approach. We concentrate on the geometry of the final linear layer, specifically its angular aspect, and adjust the weights of the corresponding layer. We name the method Tilt and Average(\textscTna), and validate the calibration effect empirically and theoretically. Through this, we demonstrate that our approach, in addition to the existing calibration-map-based techniques, can yield improved calibration performance. Code available : this https URL.

[CV-36] Real-time accurate and open source upper-limb musculoskeletal analysis using a single RGBD camera

链接: https://arxiv.org/abs/2406.10007
作者: Amedeo Ceglia,Kael Facon,Mickaël Begon,Lama Seoud
关键词: objective task evaluation, RGBD camera, task evaluation, enhance rehabilitation, rehabilitation and provide
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Biomechanical biofeedback may enhance rehabilitation and provide clinicians with more objective task evaluation. These feedbacks often rely on expensive motion capture systems, which restricts their widespread use, leading to the development of computer vision-based methods. These methods are subject to large joint angle errors, considering the upper limb, and exclude the scapula and clavicle motion in the analysis. Our open-source approach offers a user-friendly solution for high-fidelity upper-limb kinematics using a single low-cost RGBD camera and includes semi-automatic skin marker labeling. Real-time biomechanical analysis, ranging from kinematics to muscle force estimation, was conducted on eight participants performing a hand-cycling motion to demonstrate the applicability of our approach on the upper limb. Markers were recorded by the RGBD camera and an optoelectronic camera system, considered as a reference. Muscle activity and external load were recorded using eight EMG and instrumented hand pedals, respectively. Bland-Altman analysis revealed significant agreements in the 3D markers’ positions between the two motion capture methods, with errors averaging 3.3 \pm 3.9 mm. For the biomechanical analysis, the level of agreement was sensitive to whether the same marker set was used. For example, joint angle differences averaging 2.3 \pm 2.8° when using the same marker set, compared to 4.5 \pm 2.9° otherwise. Biofeedback from the RGBD camera was provided at 63 Hz. Our study introduces a novel method for using an RGBD camera as a low-cost motion capture solution, emphasizing its potential for accurate kinematic reconstruction and comprehensive upper-limb biomechanical studies.

[CV-37] OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control

链接: https://arxiv.org/abs/2406.10000
作者: Yuzhong Huang,Zhong Li,Zhang Chen,Zhiyuan Ren,Guosheng Lin,Fred Morstatter,Yi Xu
关键词: Score Distillation Sampling, utilizing Score Distillation, Distillation Sampling, utilizing Score, Score Distillation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the evolving landscape of text-to-3D technology, Dreamfusion has showcased its proficiency by utilizing Score Distillation Sampling (SDS) to optimize implicit representations such as NeRF. This process is achieved through the distillation of pretrained large-scale text-to-image diffusion models. However, Dreamfusion encounters fidelity and efficiency constraints: it faces the multi-head Janus issue and exhibits a relatively slow optimization process. To circumvent these challenges, we introduce OrientDream, a camera orientation conditioned framework designed for efficient and multi-view consistent 3D generation from textual prompts. Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module. This feature effectively utilizes data from MVImgNet, an extensive external multi-view dataset, to refine and bolster its functionality. Subsequently, we utilize the pre-conditioned 2D images as a basis for optimizing a randomly initialized implicit representation (NeRF). This process is significantly expedited by a decoupled back-propagation technique, allowing for multiple updates of implicit parameters per optimization cycle. Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods, as quantified by comparative metrics.

[CV-38] Challenges in explaining deep learning models for data with biological variation

链接: https://arxiv.org/abs/2406.09981
作者: Lenka Tětková,Erik Schou Dreier,Robin Malm,Lars Kai Hansen
关键词: learning research progress, research progress, progress is based, based on developing, machine learning research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Much machine learning research progress is based on developing models and evaluating them on a benchmark dataset (e.g., ImageNet for images). However, applying such benchmark-successful methods to real-world data often does not work as expected. This is particularly the case for biological data where we expect variability at multiple time and spatial scales. In this work, we are using grain data and the goal is to detect diseases and damages. Pink fusarium, skinned grains, and other diseases and damages are key factors in setting the price of grains or excluding dangerous grains from food production. Apart from challenges stemming from differences of the data from the standard toy datasets, we also present challenges that need to be overcome when explaining deep learning models. For example, explainability methods have many hyperparameters that can give different results, and the ones published in the papers do not work on dissimilar images. Other challenges are more general: problems with visualization of the explanations and their comparison since the magnitudes of their values differ from method to method. An open fundamental question also is: How to evaluate explanations? It is a non-trivial task because the “ground truth” is usually missing or ill-defined. Also, human annotators may create what they think is an explanation of the task at hand, yet the machine learning model might solve it in a different and perhaps counter-intuitive way. We discuss several of these challenges and evaluate various post-hoc explainability methods on grain data. We focus on robustness, quality of explanations, and similarity to particular “ground truth” annotations made by experts. The goal is to find the methods that overall perform well and could be used in this challenging task. We hope the proposed pipeline will be used as a framework for evaluating explainability methods in specific use cases.

[CV-39] InstructRL4Pix: Training Diffusion for Image Editing by Reinforcement Learning

链接: https://arxiv.org/abs/2406.09973
作者: Tiancheng Li,Jinxiu Liu,Huajun Chen,Qi Liu
关键词: Instruction-based image editing, Instruction-based image, image editing, Guided Image Editing, Image Editing Method
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Instruction-based image editing has made a great process in using natural human language to manipulate the visual content of images. However, existing models are limited by the quality of the dataset and cannot accurately localize editing regions in images with complex object relationships. In this paper, we propose Reinforcement Learning Guided Image Editing Method(InstructRL4Pix) to train a diffusion model to generate images that are guided by the attention maps of the target object. Our method maximizes the output of the reward model by calculating the distance between attention maps as a reward function and fine-tuning the diffusion model using proximal policy optimization (PPO). We evaluate our model in object insertion, removal, replacement, and transformation. Experimental results show that InstructRL4Pix breaks through the limitations of traditional datasets and uses unsupervised learning to optimize editing goals and achieve accurate image editing based on natural human commands.

[CV-40] ChartMimic: Evaluating LMMs Cross-Modal Reasoning Capability via Chart-to-Code Generation

链接: https://arxiv.org/abs/2406.09961
作者: Chufan Shi,Cheng Yang,Yaxin Liu,Bo Shui,Junjie Wang,Mohan Jing,Linran Xu,Xinyu Zhu,Siheng Li,Yuxiang Zhang,Gongye Liu,Xiaomei Nie,Deng Cai,Yujiu Yang
关键词: large multimodal models, aimed at assessing, visually-grounded code generation, assessing the visually-grounded, large multimodal
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Data and code are available at this https URL

点击查看摘要

[CV-41] BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

链接: https://arxiv.org/abs/2406.09952
作者: Imanol Miranda,Ander Salaberria,Eneko Agirre,Gorka Azkune
关键词: Existing Vision-Language Compositionality, Bidirectional Vision-Language Compositionality, correct textual description, Vision-Language Compositionality, Existing Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-42] SemanticSpray: A Multimodal Dataset for Autonomous Driving in Wet Surface Conditions

链接: https://arxiv.org/abs/2406.09945
作者: Aldi Piroli,Vinzenz Dallabetta,Johannes Kopp,Marc Walessa,Daniel Meissner,Klaus Dietmayer
关键词: Autonomous vehicles rely, Autonomous vehicles, navigate the environment, Autonomous, wet surface conditions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at IEEE Intelligent Vehicles Symposium (IV 2024)

点击查看摘要

Abstract:Autonomous vehicles rely on camera, LiDAR, and radar sensors to navigate the environment. Adverse weather conditions like snow, rain, and fog are known to be problematic for both camera and LiDAR-based perception systems. Currently, it is difficult to evaluate the performance of these methods due to the lack of publicly available datasets containing multimodal labeled data. To address this limitation, we propose the SemanticSpray++ dataset, which provides labels for camera, LiDAR, and radar data of highway-like scenarios in wet surface conditions. In particular, we provide 2D bounding boxes for the camera image, 3D bounding boxes for the LiDAR point cloud, and semantic labels for the radar targets. By labeling all three sensor modalities, the SemanticSpray++ dataset offers a comprehensive test bed for analyzing the performance of different perception methods when vehicles travel on wet surface conditions. Together with comprehensive label statistics, we also evaluate multiple baseline methods across different tasks and analyze their performances. The dataset will be available at this https URL .

[CV-43] ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers

链接: https://arxiv.org/abs/2406.09936
作者: Narges Norouzi,Svetlana Orlova,Daan de Geus,Gijs Dubbelman
关键词: plain Vision Transformers, Vision Transformers, plain Vision, work presents Adaptive, merges similar tokens
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024. Project page and code: this https URL

点击查看摘要

Abstract:This work presents Adaptive Local-then-Global Merging (ALGM), a token reduction method for semantic segmentation networks that use plain Vision Transformers. ALGM merges tokens in two stages: (1) In the first network layer, it merges similar tokens within a small local window and (2) halfway through the network, it merges similar tokens across the entire image. This is motivated by an analysis in which we found that, in those situations, tokens with a high cosine similarity can likely be merged without a drop in segmentation quality. With extensive experiments across multiple datasets and network configurations, we show that ALGM not only significantly improves the throughput by up to 100%, but can also enhance the mean IoU by up to +1.1, thereby achieving a better trade-off between segmentation quality and efficiency than existing methods. Moreover, our approach is adaptive during inference, meaning that the same model can be used for optimal efficiency or accuracy, depending on the application. Code is available at this https URL.

[CV-44] Robust compressive tracking via online weighted multiple instance learning

链接: https://arxiv.org/abs/2406.09914
作者: Sandeep Singh Sengar
关键词: motion blur, fast motion, illumination variations, low resolution, sparse representation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Developing a robust object tracker is a challenging task due to factors such as occlusion, motion blur, fast motion, illumination variations, rotation, background clutter, low resolution and deformation across the frames. In the literature, lots of good approaches based on sparse representation have already been presented to tackle the above problems. However, most of the algorithms do not focus on the learning of sparse representation. They only consider the modeling of target appearance and therefore drift away from the target with the imprecise training samples. By considering all the above factors in mind, we have proposed a visual object tracking algorithm by integrating a coarse-to-fine search strategy based on sparse representation and the weighted multiple instance learning (WMIL) algorithm. Compared with the other trackers, our approach has more information of the original signal with less complexity due to the coarse-to-fine search method, and also has weights for important samples. Thus, it can easily discriminate the background features from the foreground. Furthermore, we have also selected the samples from the un-occluded sub-regions to efficiently develop the strong classifier. As a consequence, a stable and robust object tracker is achieved to tackle all the aforementioned problems. Experimental results with quantitative as well as qualitative analysis on challenging benchmark datasets show the accuracy and efficiency of our method.

[CV-45] OpenECAD: An Efficient Visual Language Model for Computer-Aided Design

链接: https://arxiv.org/abs/2406.09913
作者: Zhe Yuan,Jianqi Shi
关键词: Computer-aided design, tools are utilized, cups to spacecraft, CAD, Computer-aided
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Computer-aided design (CAD) tools are utilized in the manufacturing industry for modeling everything from cups to spacecraft. These programs are complex to use and typically require years of training and experience to master. Structured and well-constrained 2D sketches and 3D constructions are crucial components of CAD modeling. A well-executed CAD model can be seamlessly integrated into the manufacturing process, thereby enhancing production efficiency. Deep generative models of 3D shapes and 3D object reconstruction models has garnered significant research interest. However, most of these models are represented in discrete forms. Moreover, the few models based on CAD operations often have substantial input restrictions. In this work, we fine-tuned pre-trained models to create OpenECAD (0.55B, 0.89B, and 4.2B), leveraging the visual, logical, coding, and general capabilities of visual language models. OpenECAD can process images of 3D designs as input and generate highly structured 2D sketches and 3D construction commands. These outputs can be directly used with existing CAD tools’ APIs to generate project files. To train our network, we created a new CAD dataset. This dataset is based on existing public CAD datasets, with adjustments and augmentations to meet the requirements of ~VLM training.

[CV-46] What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?

链接: https://arxiv.org/abs/2406.09908
作者: Weijie Tu,Weijian Deng,Liang Zheng,Tom Gedeon
关键词: work aims, aims to develop, OOD, Softmax prediction probability, OOD contexts
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: TMLR 2024 ( this https URL )

点击查看摘要

Abstract:This work aims to develop a measure that can accurately rank the performance of various classifiers when they are tested on unlabeled data from out-of-distribution (OOD) distributions. We commence by demonstrating that conventional uncertainty metrics, notably the maximum Softmax prediction probability, possess inherent utility in forecasting model generalization across certain OOD contexts. Building on this insight, we introduce a new measure called Softmax Correlation (SoftmaxCorr). It calculates the cosine similarity between a class-class correlation matrix, constructed from Softmax output vectors across an unlabeled test dataset, and a predefined reference matrix that embodies ideal class correlations. A high resemblance of predictions to the reference matrix signals that the model delivers confident and uniform predictions across all categories, reflecting minimal uncertainty and confusion. Through rigorous evaluation across a suite of datasets, including ImageNet, CIFAR-10, and WILDS, we affirm the predictive validity of SoftmaxCorr in accurately forecasting model performance within both in-distribution (ID) and OOD settings. Furthermore, we discuss the limitations of our proposed measure and suggest avenues for future research.

[CV-47] Label-Efficient Semantic Segmentation of LiDAR Point Clouds in Adverse Weather Conditions

链接: https://arxiv.org/abs/2406.09906
作者: Aldi Piroli,Vinzenz Dallabetta,Johannes Kopp,Marc Walessa,Daniel Meissner,Klaus Dietmayer
关键词: introducing unwanted noise, Adverse weather, severely affect, introducing unwanted, weather
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:Adverse weather conditions can severely affect the performance of LiDAR sensors by introducing unwanted noise in the measurements. Therefore, differentiating between noise and valid points is crucial for the reliable use of these sensors. Current approaches for detecting adverse weather points require large amounts of labeled data, which can be difficult and expensive to obtain. This paper proposes a label-efficient approach to segment LiDAR point clouds in adverse weather. We develop a framework that uses few-shot semantic segmentation to learn to segment adverse weather points from only a few labeled examples. Then, we use a semi-supervised learning approach to generate pseudo-labels for unlabelled point clouds, significantly increasing the amount of training data without requiring any additional labeling. We also integrate good weather data in our training pipeline, allowing for high performance in both good and adverse weather conditions. Results on real and synthetic datasets show that our method performs well in detecting snow, fog, and spray. Furthermore, we achieve competitive performance against fully supervised methods while using only a fraction of labeled data.

[CV-48] Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

链接: https://arxiv.org/abs/2406.09905
作者: Lingni Ma,Yuting Ye,Fangzhou Hong,Vladimir Guzov,Yifeng Jiang,Rowan Postyeni,Luis Pesqueira,Alexander Gamino,Vijay Baiyya,Hyo Jin Kim,Kevin Bailey,David Soriano Fosas,C. Karen Liu,Ziwei Liu,Jakob Engel,Renzo De Nardi,Richard Newcombe
关键词: richly annotated human, Project Aria devices, multiple multimodal egocentric, richly annotated, motion dataset collected
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body 3D motion ground truth; b) egocentric multimodal recordings from Project Aria devices with RGB, grayscale, eye-tracking cameras, IMUs, magnetometer, barometer, and microphones; and c) an additional “observer” device providing a third-person viewpoint. We compute world-aligned 6DoF transformations for all sensors, across devices and capture sessions. The dataset also provides 3D scene point clouds and calibrated gaze estimation. We derive a protocol to annotate hierarchical language descriptions of in-context human motion, from fine-grain pose narrations, to atomic actions and activity summarization. To the best of our knowledge, the Nymeria dataset is the world largest in-the-wild collection of human motion with natural and diverse activities; first of its kind to provide synchronized and localized multi-device multimodal egocentric data; and the world largest dataset with motion-language descriptions. It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km. The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545. To demonstrate the potential of the dataset we define key research tasks for egocentric body tracking, motion synthesis, and action recognition and evaluate several state-of-the-art baseline algorithms. Data and code will be open-sourced.

[CV-49] Exploring the Benefits of Vision Foundation Models for Unsupervised Domain Adaptation

链接: https://arxiv.org/abs/2406.09896
作者: Brunó B. Englert,Fabrizio J. Piva,Tommie Kerssies,Daan de Geus,Gijs Dubbelman
关键词: Achieving robust generalization, diverse data domains, data domains remains, Unsupervised Domain Adaptation, Achieving robust
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024 Workshop Proceedings for the Second Workshop on Foundation Models

点击查看摘要

Abstract:Achieving robust generalization across diverse data domains remains a significant challenge in computer vision. This challenge is important in safety-critical applications, where deep-neural-network-based systems must perform reliably under various environmental conditions not seen during training. Our study investigates whether the generalization capabilities of Vision Foundation Models (VFMs) and Unsupervised Domain Adaptation (UDA) methods for the semantic segmentation task are complementary. Results show that combining VFMs with UDA has two main benefits: (a) it allows for better UDA performance while maintaining the out-of-distribution performance of VFMs, and (b) it makes certain time-consuming UDA components redundant, thus enabling significant inference speedups. Specifically, with equivalent model sizes, the resulting VFM-UDA method achieves an 8.4 \times speed increase over the prior non-VFM state of the art, while also improving performance by +1.2 mIoU in the UDA setting and by +6.1 mIoU in terms of out-of-distribution generalization. Moreover, when we use a VFM with 3.6 \times more parameters, the VFM-UDA approach maintains a 3.3 \times speed up, while improving the UDA performance by +3.1 mIoU and the out-of-distribution performance by +10.3 mIoU. These results underscore the significant benefits of combining VFMs with UDA, setting new standards and baselines for Unsupervised Domain Adaptation in semantic segmentation.

[CV-50] Rethinking the Evaluation of Out-of-Distribution Detection: A Sorites Paradox

链接: https://arxiv.org/abs/2406.09867
作者: Xingming Long,Jie Zhang,Shiguang Shan,Xilin Chen
关键词: OOD, Incremental Shift OOD, OOD detection methods, OOD detection, marginal OOD samples
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: v1

点击查看摘要

Abstract:Most existing out-of-distribution (OOD) detection benchmarks classify samples with novel labels as the OOD data. However, some marginal OOD samples actually have close semantic contents to the in-distribution (ID) sample, which makes determining the OOD sample a Sorites Paradox. In this paper, we construct a benchmark named Incremental Shift OOD (IS-OOD) to address the issue, in which we divide the test samples into subsets with different semantic and covariate shift degrees relative to the ID dataset. The data division is achieved through a shift measuring method based on our proposed Language Aligned Image feature Decomposition (LAID). Moreover, we construct a Synthetic Incremental Shift (Syn-IS) dataset that contains high-quality generated images with more diverse covariate contents to complement the IS-OOD benchmark. We evaluate current OOD detection methods on our benchmark and find several important insights: (1) The performance of most OOD detection methods significantly improves as the semantic shift increases; (2) Some methods like GradNorm may have different OOD detection mechanisms as they rely less on semantic shifts to make decisions; (3) Excessive covariate shifts in the image are also likely to be considered as OOD for some methods. Our code and data are released in this https URL.

[CV-51] LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data

链接: https://arxiv.org/abs/2406.09864
作者: Grigor Bezirganyan,Sana Sellami,Laure Berti-Équille,Sébastien Fournier
关键词: diverse information sources, integrating diverse information, Learning enhances decision-making, Multimodal Deep Learning, Deep Learning enhances
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-52] Dataset Condensation with Latent Quantile Matching

链接: https://arxiv.org/abs/2406.09860
作者: Wei Wei,Tom De Schepper,Kevin Mets
关键词: informative data records, smaller synthesized dataset, machine learning models, synthesized dataset, informative data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CVPR Workshop 2024: 1st Workshop on Dataset Distillation for Computer Vision

点击查看摘要

Abstract:Dataset condensation (DC) methods aim to learn a smaller synthesized dataset with informative data records to accelerate the training of machine learning models. Current distribution matching (DM) based DC methods learn a synthesized dataset by matching the mean of the latent embeddings between the synthetic and the real dataset. However two distributions with the same mean can still be vastly different. In this work we demonstrate the shortcomings of using Maximum Mean Discrepancy to match latent distributions i.e. the weak matching power and lack of outlier regularization. To alleviate these shortcomings we propose our new method: Latent Quantile Matching (LQM) which matches the quantiles of the latent embeddings to minimize the goodness of fit test statistic between two distributions. Empirical experiments on both image and graph-structured datasets show that LQM matches or outperforms previous state of the art in distribution matching based DC. Moreover we show that LQM improves the performance in continual graph learning (CGL) setting where memory efficiency and privacy can be important. Our work sheds light on the application of DM based DC for CGL.

[CV-53] Vision Language Modeling of Content Distortion and Appearance for Image Quality Assessment

链接: https://arxiv.org/abs/2406.09858
作者: Fei Zhou,Zhicong Huang,Tianhao Gu,Guoping Qiu
关键词: intertwined factors including, Image Quality Assessment, distortion characteristics, appearance properties, images semantic contents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The visual quality of an image is confounded by a number of intertwined factors including its semantic content, distortion characteristics and appearance properties such as brightness, contrast, sharpness, and colourfulness. Distilling high level knowledge about all these quality bearing attributes is crucial for developing objective Image Quality Assessment (IQA).While existing solutions have modeled some of these aspects, a comprehensive solution that involves all these important quality related attributes has not yet been developed. In this paper, we present a new blind IQA (BIQA) model termed Self-supervision and Vision-Language supervision Image QUality Evaluator (SLIQUE) that features a joint vision-language and visual contrastive representation learning framework for acquiring high level knowledge about the images semantic contents, distortion characteristics and appearance properties for IQA. For training SLIQUE, we have developed a systematic approach to constructing a first of its kind large image database annotated with all three categories of quality relevant texts. The Text Annotated Distortion, Appearance and Content (TADAC) database has over 1.6 million images annotated with textual descriptions of their semantic contents, distortion characteristics and appearance properties. The method for constructing TADAC and the database itself will be particularly useful for exploiting vision-language modeling for advanced IQA applications. Extensive experimental results show that SLIQUE has superior performances over state of the art, demonstrating the soundness of its design principle and the effectiveness of its implementation.

[CV-54] GradeADreamer: Enhanced Text-to-3D Generation Using Gaussian Splatting and Multi-View Diffusion

链接: https://arxiv.org/abs/2406.09850
作者: Trapoom Ukarapol,Kevin Pruvost
关键词: Multi-face Janus problem, shown promising results, Multi-face Janus, extended generation time, shown promising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code: this https URL

点击查看摘要

Abstract:Text-to-3D generation has shown promising results, yet common challenges such as the Multi-face Janus problem and extended generation time for high-quality assets. In this paper, we address these issues by introducing a novel three-stage training pipeline called GradeADreamer. This pipeline is capable of producing high-quality assets with a total generation time of under 30 minutes using only a single RTX 3090 GPU. Our proposed method employs a Multi-view Diffusion Model, MVDream, to generate Gaussian Splats as a prior, followed by refining geometry and texture using StableDiffusion. Experimental results demonstrate that our approach significantly mitigates the Multi-face Janus problem and achieves the highest average user preference ranking compared to previous state-of-the-art methods. The project code is available at this https URL.

[CV-55] Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps

链接: https://arxiv.org/abs/2406.09838
作者: Jian Chen,Peilin Zhou,Yining Hua,Dading Chong,Meng Cao,Yaowei Li,Zixuan Yuan,Bing Zhu,Junwei Liang
关键词: protect human lives, weather protect human, extreme weather protect, Extreme Weather Events, Real-time detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-time detection and prediction of extreme weather protect human lives and infrastructure. Traditional methods rely on numerical threshold setting and manual interpretation of weather heatmaps with Geographic Information Systems (GIS), which can be slow and error-prone. Our research redefines Extreme Weather Events Detection (EWED) by framing it as a Visual Question Answering (VQA) problem, thereby introducing a more precise and automated solution. Leveraging Vision-Language Models (VLM) to simultaneously process visual and textual data, we offer an effective aid to enhance the analysis process of weather heatmaps. Our initial assessment of general-purpose VLMs (e.g., GPT-4-Vision) on EWED revealed poor performance, characterized by low accuracy and frequent hallucinations due to inadequate color differentiation and insufficient meteorological knowledge. To address these challenges, we introduce ClimateIQA, the first meteorological VQA dataset, which includes 8,760 wind gust heatmaps and 254,040 question-answer pairs covering four question types, both generated from the latest climate reanalysis data. We also propose Sparse Position and Outline Tracking (SPOT), an innovative technique that leverages OpenCV and K-Means clustering to capture and depict color contours in heatmaps, providing ClimateIQA with more accurate color spatial location information. Finally, we present Climate-Zoo, the first meteorological VLM collection, which adapts VLMs to meteorological applications using the ClimateIQA dataset. Experiment results demonstrate that models from Climate-Zoo substantially outperform state-of-the-art general VLMs, achieving an accuracy increase from 0% to over 90% in EWED verification. The datasets and models in this study are publicly available for future climate science research: this https URL.

[CV-56] Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

链接: https://arxiv.org/abs/2406.09829
作者: Xiangheng Shan,Dongyue Wu,Guilin Zhu,Yuanjie Shao,Nong Sang,Changxin Gao
关键词: Open-vocabulary semantic segmentation, output semantic masks, Open-vocabulary semantic, Adaptively Balanced Decoder, Semantic Structure Consistency
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR2024

点击查看摘要

Abstract:Open-vocabulary semantic segmentation is a challenging task, which requires the model to output semantic masks of an image beyond a close-set vocabulary. Although many efforts have been made to utilize powerful CLIP models to accomplish this task, they are still easily overfitting to training classes due to the natural gaps in semantic information between training and new classes. To overcome this challenge, we propose a novel framework for openvocabulary semantic segmentation called EBSeg, incorporating an Adaptively Balanced Decoder (AdaB Decoder) and a Semantic Structure Consistency loss (SSC Loss). The AdaB Decoder is designed to generate different image embeddings for both training and new classes. Subsequently, these two types of embeddings are adaptively balanced to fully exploit their ability to recognize training classes and generalization ability for new classes. To learn a consistent semantic structure from CLIP, the SSC Loss aligns the inter-classes affinity in the image feature space with that in the text feature space of CLIP, thereby improving the generalization ability of our model. Furthermore, we employ a frozen SAM image encoder to complement the spatial information that CLIP features lack due to the low training image resolution and image-level supervision inherent in CLIP. Extensive experiments conducted across various benchmarks demonstrate that the proposed EBSeg outperforms the state-of-the-art methods. Our code and trained models will be here: this https URL.

[CV-57] HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning

链接: https://arxiv.org/abs/2406.09827
作者: Heejun Lee,Geon Park,Youngwan Lee,Jina Kim,Wonyoung Jeong,Myeongjae Jeon,Sung Ju Hwang
关键词: multi-modal question answering, increasing sequence lengths, handling complex tasks, modern large language, question answering
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 26 pages, 15 figures

点击查看摘要

[CV-58] An I2I Inpainting Approach for Efficient Channel Knowledge Map Construction

链接: https://arxiv.org/abs/2406.09822
作者: Zhenzhou Jin,Li You,Jue Wang,Xiang-Gen Xia,Xiqi Gao
关键词: emerging enabling technology, Channel knowledge map, Channel knowledge, received widespread attention, location-specific channel knowledge
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:Channel knowledge map (CKM) has received widespread attention as an emerging enabling technology for environment-aware wireless communications. It involves the construction of databases containing location-specific channel knowledge, which are then leveraged to facilitate channel state information (CSI) acquisition and transceiver design. In this context, a fundamental challenge lies in efficiently constructing the CKM based on a given wireless propagation environment. Most existing methods are based on stochastic modeling and sequence prediction, which do not fully exploit the inherent physical characteristics of the propagation environment, resulting in low accuracy and high computational complexity. To address these limitations, we propose a Laplacian pyramid (LP)-based CKM construction scheme to predict the channel knowledge at arbitrary locations in a targeted area. Specifically, we first view the channel knowledge as a 2-D image and transform the CKM construction problem into an image-to-image (I2I) inpainting task, which predicts the channel knowledge at a specific location by recovering the corresponding pixel value in the image matrix. Then, inspired by the reversible and closed-form structure of the LP, we show its natural suitability for our task in designing a fast I2I mapping network. For different frequency components of LP decomposition, we design tailored networks accordingly. Besides, to encode the global structural information of the propagation environment, we introduce self-attention and cross-covariance attention mechanisms in different layers, respectively. Finally, experimental results show that the proposed scheme outperforms the benchmark, achieving higher reconstruction accuracy while with lower computational complexity. Moreover, the proposed approach has a strong generalization ability and can be implemented in different wireless communication scenarios.

[CV-59] RaNeuS: Ray-adaptive Neural Surface Reconstruction

链接: https://arxiv.org/abs/2406.09801
作者: Yida Wang,David Joseph Tan,Nassir Navab,Federico Tombari
关键词: differentiable radiance field, leverage a differentiable, addition to producing, producing the standard, signed distance field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 3DV 2024, oral. In: Proceedings of the IEEE/CVF International Conference on 3D Vision (2023)

点击查看摘要

Abstract:Our objective is to leverage a differentiable radiance field \eg NeRF to reconstruct detailed 3D surfaces in addition to producing the standard novel view renderings. There have been related methods that perform such tasks, usually by utilizing a signed distance field (SDF). However, the state-of-the-art approaches still fail to correctly reconstruct the small-scale details, such as the leaves, ropes, and textile surfaces. Considering that different methods formulate and optimize the projection from SDF to radiance field with a globally constant Eikonal regularization, we improve with a ray-wise weighting factor to prioritize the rendering and zero-crossing surface fitting on top of establishing a perfect SDF. We propose to adaptively adjust the regularization on the signed distance field so that unsatisfying rendering rays won’t enforce strong Eikonal regularization which is ineffective, and allow the gradients from regions with well-learned radiance to effectively back-propagated to the SDF. Consequently, balancing the two objectives in order to generate accurate and detailed surfaces. Additionally, concerning whether there is a geometric bias between the zero-crossing surface in SDF and rendering points in the radiance field, the projection becomes adjustable as well depending on different 3D locations during optimization. Our proposed \textitRaNeuS are extensively evaluated on both synthetic and real datasets, achieving state-of-the-art results on both novel view synthesis and geometric reconstruction.

[CV-60] Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation

链接: https://arxiv.org/abs/2406.09798
作者: Zihan Wang,Xiangyang Li,Jiahao Yang,Yeqi,Shuqiang Jiang
关键词: natural language instruction, VLN, SOTA monocular VLN, language instruction, monocular robots
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to CoRL 2024. The code is available at this https URL

点击查看摘要

Abstract:Vision-and-language navigation (VLN) enables the agent to navigate to a remote location in 3D environments following the natural language instruction. In this field, the agent is usually trained and evaluated in the navigation simulators, lacking effective approaches for sim-to-real transfer. The VLN agents with only a monocular camera exhibit extremely limited performance, while the mainstream VLN models trained with panoramic observation, perform better but are difficult to deploy on most monocular robots. For this case, we propose a sim-to-real transfer approach to endow the monocular robots with panoramic traversability perception and panoramic semantic understanding, thus smoothly transferring the high-performance panoramic VLN models to the common monocular robots. In this work, the semantic traversable map is proposed to predict agent-centric navigable waypoints, and the novel view representations of these navigable waypoints are predicted through the 3D feature fields. These methods broaden the limited field of view of the monocular robots and significantly improve navigation performance in the real world. Our VLN system outperforms previous SOTA monocular VLN methods in R2R-CE and RxR-CE benchmarks within the simulation environments and is also validated in real-world environments, providing a practical and high-performance solution for real-world VLN.

[CV-61] SuperSVG: Superpixel-based Scalable Vector Graphics Synthesis

链接: https://arxiv.org/abs/2406.09794
作者: Teng Hu,Ran Yi,Baihong Qian,Jiangning Zhang,Paul L. Rosin,Yu-Kun Lai
关键词: Scalable Vector Graphics, Scalable Vector, possesses excellent scalability, Vector Graphics, scalability and editability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024

点击查看摘要

Abstract:SVG (Scalable Vector Graphics) is a widely used graphics format that possesses excellent scalability and editability. Image vectorization, which aims to convert raster images to SVGs, is an important yet challenging problem in computer vision and graphics. Existing image vectorization methods either suffer from low reconstruction accuracy for complex images or require long computation time. To address this issue, we propose SuperSVG, a superpixel-based vectorization model that achieves fast and high-precision image vectorization. Specifically, we decompose the input image into superpixels to help the model focus on areas with similar colors and textures. Then, we propose a two-stage self-training framework, where a coarse-stage model is employed to reconstruct the main structure and a refinement-stage model is used for enriching the details. Moreover, we propose a novel dynamic path warping loss to help the refinement-stage model to inherit knowledge from the coarse-stage model. Extensive qualitative and quantitative experiments demonstrate the superior performance of our method in terms of reconstruction accuracy and inference time compared to state-of-the-art approaches. The code is available in \urlthis https URL.

[CV-62] A Two-Stage Masked Autoencoder Based Network for Indoor Depth Completion

链接: https://arxiv.org/abs/2406.09792
作者: Kailai Sun,Zhou Yang,Qianchuan Zhao
关键词: autonomous driving, augmented reality, robot navigation, range of applications, scene understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop

点击查看摘要

Abstract:Depth images have a wide range of applications, such as 3D reconstruction, autonomous driving, augmented reality, robot navigation, and scene understanding. Commodity-grade depth cameras are hard to sense depth for bright, glossy, transparent, and distant surfaces. Although existing depth completion methods have achieved remarkable progress, their performance is limited when applied to complex indoor scenarios. To address these problems, we propose a two-step Transformer-based network for indoor depth completion. Unlike existing depth completion approaches, we adopt a self-supervision pre-training encoder based on the masked autoencoder to learn an effective latent representation for the missing depth value; then we propose a decoder based on a token fusion mechanism to complete (i.e., reconstruct) the full depth from the jointly RGB and incomplete depth image. Compared to the existing methods, our proposed network, achieves the state-of-the-art performance on the Matterport3D dataset. In addition, to validate the importance of the depth completion task, we apply our methods to indoor 3D reconstruction. The code, dataset, and demo are available at this https URL.

[CV-63] OpenCapBench: A Benchmark to Bridge Pose Estimation and Biomechanics

链接: https://arxiv.org/abs/2406.09788
作者: Yoni Gozlan,Antoine Falisse,Scott Uhlrich,Anthony Gatti,Michael Black,Akshay Chaudhari
关键词: Pose estimation, current pose estimation, Pose, promised to impact, impact healthcare
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pose estimation has promised to impact healthcare by enabling more practical methods to quantify nuances of human movement and biomechanics. However, despite the inherent connection between pose estimation and biomechanics, these disciplines have largely remained disparate. For example, most current pose estimation benchmarks use metrics such as Mean Per Joint Position Error, Percentage of Correct Keypoints, or mean Average Precision to assess performance, without quantifying kinematic and physiological correctness - key aspects for biomechanics. To alleviate this challenge, we develop OpenCapBench to offer an easy-to-use unified benchmark to assess common tasks in human pose estimation, evaluated under physiological constraints. OpenCapBench computes consistent kinematic metrics through joints angles provided by an open-source musculoskeletal modeling software (OpenSim). Through OpenCapBench, we demonstrate that current pose estimation models use keypoints that are too sparse for accurate biomechanics analysis. To mitigate this challenge, we introduce SynthPose, a new approach that enables finetuning of pre-trained 2D human pose models to predict an arbitrarily denser set of keypoints for accurate kinematic analysis through the use of synthetic data. Incorporating such finetuning on synthetic data of prior models leads to twofold reduced joint angle errors. Moreover, OpenCapBench allows users to benchmark their own developed models on our clinically relevant cohort. Overall, OpenCapBench bridges the computer vision and biomechanics communities, aiming to drive simultaneous advances in both areas.

[CV-64] Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

链接: https://arxiv.org/abs/2406.09782
作者: Runze Liu,Dongchen Zhu,Guanghui Zhang,Yue Xu,Wenjun Shi,Xiaolin Zhang,Lei Wang,Jiamao Li
关键词: received widespread attention, Unsupervised monocular depth, ground truth, monocular depth estimation, received widespread
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unsupervised monocular depth estimation has received widespread attention because of its capability to train without ground truth. In real-world scenarios, the images may be blurry or noisy due to the influence of weather conditions and inherent limitations of the camera. Therefore, it is particularly important to develop a robust depth estimation model. Benefiting from the training strategies of generative networks, generative-based methods often exhibit enhanced robustness. In light of this, we employ a well-converging diffusion model among generative networks for unsupervised monocular depth estimation. Additionally, we propose a hierarchical feature-guided denoising module. This model significantly enriches the model’s capacity for learning and interpreting depth distribution by fully leveraging image features to guide the denoising process. Furthermore, we explore the implicit depth within reprojection and design an implicit depth consistency loss. This loss function serves to enhance the performance of the model and ensure the scale consistency of depth within a video sequence. We conduct experiments on the KITTI, Make3D, and our self-collected SIMIT datasets. The results indicate that our approach stands out among generative-based models, while also showcasing remarkable robustness.

[CV-65] GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding

链接: https://arxiv.org/abs/2406.09781
作者: Yiqi Wu,Xiaodan Hu,Ziming Fu,Siling Zhou,Jiangong Li
关键词: Animal, animal behavior, multimodal, crucial aspect, foundation for studying
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Animal ethology is an crucial aspect of animal research, and animal behavior labeling is the foundation for studying animal behavior. This process typically involves labeling video clips with behavioral semantic tags, a task that is complex, subjective, and multimodal. With the rapid development of multimodal large language models(LLMs), new application have emerged for animal behavior understanding tasks in livestock scenarios. This study evaluates the visual perception capabilities of multimodal LLMs in animal activity recognition. To achieve this, we created piglet test data comprising close-up video clips of individual piglets and annotated full-shot video clips. These data were used to assess the performance of four multimodal LLMs-Video-LLaMA, MiniGPT4-Video, Video-Chat2, and GPT-4 omni (GPT-4o)-in piglet activity understanding. Through comprehensive evaluation across five dimensions, including counting, actor referring, semantic correspondence, time perception, and robustness, we found that while current multimodal LLMs require improvement in semantic correspondence and time perception, they have initially demonstrated visual perception capabilities for animal activity recognition. Notably, GPT-4o showed outstanding performance, with Video-Chat2 and GPT-4o exhibiting significantly better semantic correspondence and time perception in close-up video clips compared to full-shot clips. The initial evaluation experiments in this study validate the potential of multimodal large language models in livestock scene video understanding and provide new directions and references for future research on animal behavior video understanding. Furthermore, by deeply exploring the influence of visual prompts on multimodal large language models, we expect to enhance the accuracy and efficiency of animal behavior recognition in livestock scenarios through human visual processing methods.

[CV-66] OSPC: Detecting Harmful Memes with Large Language Model as a Catalyst

链接: https://arxiv.org/abs/2406.09779
作者: Jingtao Cao,Zheng Zhang,Hongru Wang,Bin Liang,Hao Wang,Kam-Fai Wong
关键词: rapidly disseminate personal, disseminate personal opinions, propagating social bias, pose significant challenges, bias and prejudice
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-67] A lightweight residual network for unsupervised deformable image registration

链接: https://arxiv.org/abs/2406.09774
作者: Ahsan Raza Siyal,Astrid Ellen Grams,Markus Haltmeier
关键词: Accurate volumetric image, Accurate volumetric, computer-aided medical diagnosis, volumetric image registration, highly relevant
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Accurate volumetric image registration is highly relevant for clinical routines and computer-aided medical diagnosis. Recently, researchers have begun to use transformers in learning-based methods for medical image registration, and have achieved remarkable success. Due to the strong global modeling capability, Transformers are considered a better option than convolutional neural networks (CNNs) for registration. However, they use bulky models with huge parameter sets, which require high computation edge devices for deployment as portable devices or in hospitals. Transformers also need a large amount of training data to produce significant results, and it is often challenging to collect suitable annotated data. Although existing CNN-based image registration can offer rich local information, their global modeling capability is poor for handling long-distance information interaction and limits registration performance. In this work, we propose a CNN-based registration method with an enhanced receptive field, a low number of parameters, and significant results on a limited training dataset. For this, we propose a residual U-Net with embedded parallel dilated-convolutional blocks to enhance the receptive field. The proposed method is evaluated on inter-patient and atlas-based datasets. We show that the performance of the proposed method is comparable and slightly better than transformer-based methods by using only \SI1.5\percent of its number of parameters.

[CV-68] Research on Edge Detection of LiDAR Images Based on Artificial Intelligence Technology

链接: https://arxiv.org/abs/2406.09773
作者: Haowei Yang,Liyang Wang,Jingyu Zhang,Yu Cheng,Ao Xiang
关键词: edge detection, Light Detection, robot navigation, Detection, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the widespread application of Light Detection and Ranging (LiDAR) technology in fields such as autonomous driving, robot navigation, and terrain mapping, the importance of edge detection in LiDAR images has become increasingly prominent. Traditional edge detection methods often face challenges in accuracy and computational complexity when processing LiDAR images. To address these issues, this study proposes an edge detection method for LiDAR images based on artificial intelligence technology. This paper first reviews the current state of research on LiDAR technology and image edge detection, introducing common edge detection algorithms and their applications in LiDAR image processing. Subsequently, a deep learning-based edge detection model is designed and implemented, optimizing the model training process through preprocessing and enhancement of the LiDAR image dataset. Experimental results indicate that the proposed method outperforms traditional methods in terms of detection accuracy and computational efficiency, showing significant practical application value. Finally, improvement strategies are proposed for the current method’s shortcomings, and the improvements are validated through experiments.

[CV-69] Bayesian Conditioned Diffusion Models for Inverse Problems

链接: https://arxiv.org/abs/2406.09768
作者: Alper Güngör,Bahri Batuhan Bilecen,Tolga Çukur
关键词: forward measurement operator, measurement operator, recently been shown, shown to excel, forward measurement
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Diffusion models have recently been shown to excel in many image reconstruction tasks that involve inverse problems based on a forward measurement operator. A common framework uses task-agnostic unconditional models that are later post-conditioned for reconstruction, an approach that typically suffers from suboptimal task performance. While task-specific conditional models have also been proposed, current methods heuristically inject measured data as a naive input channel that elicits sampling inaccuracies. Here, we address the optimal conditioning of diffusion models for solving challenging inverse problems that arise during image reconstruction. Specifically, we propose a novel Bayesian conditioning technique for diffusion models, BCDM, based on score-functions associated with the conditional distribution of desired images given measured data. We rigorously derive the theory to express and train the conditional score-function. Finally, we show state-of-the-art performance in image dealiasing, deblurring, super-resolution, and inpainting with the proposed technique.

[CV-70] Full-reference Point Cloud Quality Assessment Using Spectral Graph Wavelets

链接: https://arxiv.org/abs/2406.09762
作者: Ryosuke Watanabe,Keisuke Nonaka,Eduardo Pavez,Tatsuya Kobayashi,Antonio Ortega
关键词: applications frequently experience, frequently experience quality, experience quality degradation, applications frequently, degradation during processing
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Point clouds in 3D applications frequently experience quality degradation during processing, e.g., scanning and compression. Reliable point cloud quality assessment (PCQA) is important for developing compression algorithms with good bitrate-quality trade-offs and techniques for quality improvement (e.g., denoising). This paper introduces a full-reference (FR) PCQA method utilizing spectral graph wavelets (SGWs). First, we propose novel SGW-based PCQA metrics that compare SGW coefficients of coordinate and color signals between reference and distorted point clouds. Second, we achieve accurate PCQA by integrating several conventional FR metrics and our SGW-based metrics using support vector regression. To our knowledge, this is the first study to introduce SGWs for PCQA. Experimental results demonstrate the proposed PCQA metric is more accurately correlated with subjective quality scores compared to conventional PCQA metrics.

[CV-71] Grounding Image Matching in 3D with MASt3R

链接: https://arxiv.org/abs/2406.09756
作者: Vincent Leroy,Yohann Cabon,Jérôme Revaud
关键词: Image Matching, core component, best-performing algorithms, algorithms and pipelines, Matching
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image Matching is a core component of all best-performing algorithms and pipelines in 3D vision. Yet despite matching being fundamentally a 3D problem, intrinsically linked to camera pose and scene geometry, it is typically treated as a 2D problem. This makes sense as the goal of matching is to establish correspondences between 2D pixel fields, but also seems like a potentially hazardous choice. In this work, we take a different stance and propose to cast matching as a 3D task with DUSt3R, a recent and powerful 3D reconstruction framework based on Transformers. Based on pointmaps regression, this method displayed impressive robustness in matching views with extreme viewpoint changes, yet with limited accuracy. We aim here to improve the matching capabilities of such an approach while preserving its robustness. We thus propose to augment the DUSt3R network with a new head that outputs dense local features, trained with an additional matching loss. We further address the issue of quadratic complexity of dense matching, which becomes prohibitively slow for downstream applications if not carefully treated. We introduce a fast reciprocal matching scheme that not only accelerates matching by orders of magnitude, but also comes with theoretical guarantees and, lastly, yields improved results. Extensive experiments show that our approach, coined MASt3R, significantly outperforms the state of the art on multiple matching tasks. In particular, it beats the best published methods by 30% (absolute improvement) in VCRE AUC on the extremely challenging Map-free localization dataset.

[CV-72] LAVIB: A Large-scale Video Interpolation Benchmark

链接: https://arxiv.org/abs/2406.09754
作者: Alexandros Stergiou
关键词: LArge-scale Video Interpolation, Video Interpolation Benchmark, video frame interpolation, Interpolation Benchmark, frame interpolation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Website: this https URL

点击查看摘要

Abstract:This paper introduces a LArge-scale Video Interpolation Benchmark (LAVIB) for the low-level video task of video frame interpolation (VFI). LAVIB comprises a large collection of high-resolution videos sourced from the web through an automated pipeline with minimal requirements for human verification. Metrics are computed for each video’s motion magnitudes, luminance conditions, frame sharpness, and contrast. The collection of videos and the creation of quantitative challenges based on these metrics are under-explored by current low-level video task datasets. In total, LAVIB includes 283K clips from 17K ultra-HD videos, covering 77.6 hours. Benchmark train, val, and test sets maintain similar video metric distributions. Further splits are also created for out-of-distribution (OOD) challenges, with train and test splits including videos of dissimilar attributes.

[CV-73] ControlVAR: Exploring Controllable Visual Autoregressive Modeling

链接: https://arxiv.org/abs/2406.09750
作者: Xiang Li,Kai Qiu,Hao Chen,Jason Kuen,Zhe Lin,Rita Singh,Bhiksha Raj
关键词: witnessed remarkable progress, witnessed remarkable, remarkable progress, advent of diffusion, Conditional
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 24 pages, 19 figures, 4 tables

点击查看摘要

Abstract:Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs), especially in tasks like control-to-image generation. However, challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs. This paper introduces ControlVAR, a novel framework that explores pixel-level controls in visual autoregressive (VAR) modeling for flexible and efficient conditional generation. In contrast to traditional conditional models that learn the conditional distribution, ControlVAR jointly models the distribution of image and pixel-level conditions during training and imposes conditional controls during testing. To enhance the joint modeling, we adopt the next-scale AR prediction paradigm and unify control and image representations. A teacher-forcing guidance strategy is proposed to further facilitate controllable generation with joint modeling. Extensive experiments demonstrate the superior efficacy and flexibility of ControlVAR across various conditional generation tasks against popular conditional DMs, \eg, ControlNet and T2I-Adaptor.

[CV-74] Decoupling Forgery Semantics for Generalizable Deepfake Detection

链接: https://arxiv.org/abs/2406.09739
作者: Wei Ye,Xinan He,Feng Ding
关键词: forgery semantics, unique forgery semantics, forgery, common forgery semantics, semantics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel method for detecting DeepFakes, enhancing the generalization of detection through semantic decoupling. There are now multiple DeepFake forgery technologies that not only possess unique forgery semantics but may also share common forgery semantics. The unique forgery semantics and irrelevant content semantics may promote over-fitting and hamper generalization for DeepFake detectors. For our proposed method, after decoupling, the common forgery semantics could be extracted from DeepFakes, and subsequently be employed for developing the generalizability of DeepFake detectors. Also, to pursue additional generalizability, we designed an adaptive high-pass module and a two-stage training strategy to improve the independence of decoupled semantics. Evaluation on FF++, Celeb-DF, DFD, and DFDC datasets showcases our method’s excellent detection and generalization performance. Code is available at: https://anonymous.4open.science/r/DFS-GDD-0F42.

[CV-75] Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation

链接: https://arxiv.org/abs/2406.09738
作者: Teli Ma,Jiaming Zhou,Zifan Wang,Ronghe Qiu,Junwei Liang
关键词: Developing robots capable, natural language instructions, Developing robots, intricate real-world environments, guided by natural
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Developing robots capable of executing various manipulation tasks, guided by natural language instructions and visual observations of intricate real-world environments, remains a significant challenge in robotics. Such robot agents need to understand linguistic commands and distinguish between the requirements of different tasks. In this work, we present Sigma-Agent, an end-to-end imitation learning agent for multi-task robotic manipulation. Sigma-Agent incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations. An effective and efficient multi-view querying Transformer (MVQ-Former) for aggregating representative semantic information is introduced. Sigma-Agent shows substantial improvement over state-of-the-art methods under diverse settings in 18 RLBench tasks, surpassing RVT by an average of 5.2% and 5.9% in 10 and 100 demonstration training, respectively. Sigma-Agent also achieves 62% success rate with a single policy in 5 real-world manipulation tasks. The code will be released upon acceptance.

[CV-76] Automated GIS-Based Framework for Detecting Crosswalk Changes from Bi-Temporal High-Resolution Aerial Images

链接: https://arxiv.org/abs/2406.09731
作者: Richard Boadu Antwi,Samuel Takyi,Alican Karaer,Eren Erman Ozguven,Michael Kimollo,Ren Moses,Maxim A. Dulebenets,Thobias Sando
关键词: Orange County, infrastructure monitoring, pavement markings, crucial for infrastructure, crosswalk
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Identification of changes in pavement markings has become crucial for infrastructure monitoring, maintenance, development, traffic management, and safety. Automated extraction of roadway geometry is critical in helping with this, given the increasing availability of high-resolution images and advancements in computer vision and object detection. Specifically, due to the substantial volume of satellite and high-resolution aerial images captured at different time instances, change detection has become a viable solution. In this study, an automated framework is developed to detect changes in crosswalks of Orange, Osceola, and Seminole counties in Florida, utilizing data extracted from high-resolution images obtained at various time intervals. Specifically, for Orange County, crosswalk changes between 2019 and 2021 were manually extracted, verified, and categorized as either new or modified crosswalks. For Seminole County, the developed model was used to automatically extract crosswalk changes between 2018 and 2021, while for Osceola County, changes between 2019 and 2020 were extracted. Findings indicate that Orange County witnessed approximately 2,094 crosswalk changes, with 312 occurring on state roads. In Seminole and Osceola counties, on the other hand, 1,040 and 1,402 crosswalk changes were observed on both local and state roads, respectively. Among these, 340 and 344 were identified on state roads in Seminole and Osceola, respectively. Spatiotemporal changes observed in crosswalks can be utilized to regularly update the existing crosswalk inventories, which is essential for agencies engaged in traffic and safety studies. Data extracted from these crosswalk changes can be combined with traffic and crash data to provide valuable insights to policymakers.

[CV-77] Neural Pose Representation Learning for Generating and Transferring Non-Rigid Object Poses

链接: https://arxiv.org/abs/2406.09728
作者: Seungwoo Yoo,Juil Koo,Kyeongmin Yeo,Minhyuk Sung
关键词: pose, disentangling pose information, transferring pose information, pose information, deformable objects
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We propose a novel method for learning representations of poses for 3D deformable objects, which specializes in 1) disentangling pose information from the object’s identity, 2) facilitating the learning of pose variations, and 3) transferring pose information to other object identities. Based on these properties, our method enables the generation of 3D deformable objects with diversity in both identities and poses, using variations of a single object. It does not require explicit shape parameterization such as skeletons or joints, point-level or shape-level correspondence supervision, or variations of the target object for pose transfer. To achieve pose disentanglement, compactness for generative models, and transferability, we first design the pose extractor to represent the pose as a keypoint-based hybrid representation and the pose applier to learn an implicit deformation field. To better distill pose information from the object’s geometry, we propose the implicit pose applier to output an intrinsic mesh property, the face Jacobian. Once the extracted pose information is transferred to the target object, the pose applier is fine-tuned in a self-supervised manner to better describe the target object’s shapes with pose variations. The extracted poses are also used to train a cascaded diffusion model to enable the generation of novel poses. Our experiments with the DeformThings4D and Human datasets demonstrate state-of-the-art performance in pose transfer and the ability to generate diverse deformed shapes with various objects and poses.

[CV-78] PixRO: Pixel-Distributed Rotational Odometry with Gaussian Belief Propagation

链接: https://arxiv.org/abs/2406.09726
作者: Ignacio Alzugaray,Riku Murai,Andrew Davison
关键词: capturing high-quality images, Visual sensors, capturing high-quality, steadily increased, increased their capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Visual sensors are not only becoming better at capturing high-quality images but also they have steadily increased their capabilities in processing data on their own on-chip. Yet the majority of VO pipelines rely on the transmission and processing of full images in a centralized unit (e.g. CPU or GPU), which often contain much redundant and low-quality information for the task. In this paper, we address the task of frame-to-frame rotational estimation but, instead of reasoning about relative motion between frames using the full images, distribute the estimation at pixel-level. In this paradigm, each pixel produces an estimate of the global motion by only relying on local information and local message-passing with neighbouring pixels. The resulting per-pixel estimates can then be communicated to downstream tasks, yielding higher-level, informative cues instead of the original raw pixel-readings. We evaluate the proposed approach on real public datasets, where we offer detailed insights about this novel technique and open-source our implementation for the future benefit of the community.

[CV-79] Cross-view geo-localization: a survey

链接: https://arxiv.org/abs/2406.09722
作者: Abhilash Durgam,Sidike Paheding,Vikas Dhiman,Vijay Devabhaktuni
关键词: garnered notable attention, machine learning techniques, copious geotagged datasets, computer vision, garnered notable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-view geo-localization has garnered notable attention in the realm of computer vision, spurred by the widespread availability of copious geotagged datasets and the advancements in machine learning techniques. This paper provides a thorough survey of cutting-edge methodologies, techniques, and associated challenges that are integral to this domain, with a focus on feature-based and deep learning strategies. Feature-based methods capitalize on unique features to establish correspondences across disparate viewpoints, whereas deep learning-based methodologies deploy convolutional neural networks to embed view-invariant attributes. This work also delineates the multifaceted challenges encountered in cross-view geo-localization, such as variations in viewpoints and illumination, the occurrence of occlusions, and it elucidates innovative solutions that have been formulated to tackle these issues. Furthermore, we delineate benchmark datasets and relevant evaluation metrics, and also perform a comparative analysis of state-of-the-art techniques. Finally, we conclude the paper with a discussion on prospective avenues for future research and the burgeoning applications of cross-view geo-localization in an intricately interconnected global landscape.

[CV-80] AnimalFormer: Multimodal Vision Framework for Behavior-based Precision Livestock Farming

链接: https://arxiv.org/abs/2406.09711
作者: Ahmed Qazi,Taha Razzaq,Asim Iqbal
关键词: precision livestock farming, multimodal vision framework, harnessing the power, introduce a multimodal, multimodal vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024

点击查看摘要

Abstract:We introduce a multimodal vision framework for precision livestock farming, harnessing the power of GroundingDINO, HQSAM, and ViTPose models. This integrated suite enables comprehensive behavioral analytics from video data without invasive animal tagging. GroundingDINO generates accurate bounding boxes around livestock, while HQSAM segments individual animals within these boxes. ViTPose estimates key body points, facilitating posture and movement analysis. Demonstrated on a sheep dataset with grazing, running, sitting, standing, and walking activities, our framework extracts invaluable insights: activity and grazing patterns, interaction dynamics, and detailed postural evaluations. Applicable across species and video resolutions, this framework revolutionizes non-invasive livestock monitoring for activity detection, counting, health assessments, and posture analyses. It empowers data-driven farm management, optimizing animal welfare and productivity through AI-powered behavioral understanding.

[CV-81] Fine-Grained Urban Flow Inference with Multi-scale Representation Learning

链接: https://arxiv.org/abs/2406.09710
作者: Shilu Yuan,Dongfeng Li,Wei Liu,Xinxin Zhang,Meng Chen,Junjie Zhang,Yongshun Gong
关键词: crucial transportation service, transportation service aimed, improving traffic efficiency, Fine-grained urban flow, urban flow inference
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-grained urban flow inference (FUFI) is a crucial transportation service aimed at improving traffic efficiency and safety. FUFI can infer fine-grained urban traffic flows based solely on observed coarse-grained data. However, most of existing methods focus on the influence of single-scale static geographic information on FUFI, neglecting the interactions and dynamic information between different-scale regions within the city. Different-scale geographical features can capture redundant information from the same spatial areas. In order to effectively learn multi-scale information across time and space, we propose an effective fine-grained urban flow inference model called UrbanMSR, which uses self-supervised contrastive learning to obtain dynamic multi-scale representations of neighborhood-level and city-level geographic information, and fuses multi-scale representations to improve fine-grained accuracy. The fusion of multi-scale representations enhances fine-grained. We validate the performance through extensive experiments on three real-world datasets. The resutls compared with state-of-the-art methods demonstrate the superiority of the proposed model.

[CV-82] Compressed Video Quality Enhancement with Temporal Group Alignment and Fusion

链接: https://arxiv.org/abs/2406.09693
作者: Qiang Zhu,Yajun Qiu,Yu Liu,Shuyuan Zhu,Bing Zeng
关键词: long-short term correlations, temporal group alignment, quality of compressed, long-short term, term correlations
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a temporal group alignment and fusion network to enhance the quality of compressed videos by using the long-short term correlations between frames. The proposed model consists of the intra-group feature alignment (IntraGFA) module, the inter-group feature fusion (InterGFF) module, and the feature enhancement (FE) module. We form the group of pictures (GoP) by selecting frames from the video according to their temporal distances to the target enhanced frame. With this grouping, the composed GoP can contain either long- or short-term correlated information of neighboring frames. We design the IntraGFA module to align the features of frames of each GoP to eliminate the motion existing between frames. We construct the InterGFF module to fuse features belonging to different GoPs and finally enhance the fused features with the FE module to generate high-quality video frames. The experimental results show that our proposed method achieves up to 0.05dB gain and lower complexity compared to the state-of-the-art method.

[CV-83] Asymmetrical Siamese Network for Point Clouds Normal Estimation

链接: https://arxiv.org/abs/2406.09681
作者: Wei Jin,Jun Zhou
关键词: made great progress, deep learning-based point, recent years, deep learning-based, great progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, deep learning-based point cloud normal estimation has made great progress. However, existing methods mainly rely on the PCPNet dataset, leading to overfitting. In addition, the correlation between point clouds with different noise scales remains unexplored, resulting in poor performance in cross-domain scenarios. In this paper, we explore the consistency of intrinsic features learned from clean and noisy point clouds using an Asymmetric Siamese Network architecture. By applying reasonable constraints between features extracted from different branches, we enhance the quality of normal estimation. Moreover, we introduce a novel multi-view normal estimation dataset that includes a larger variety of shapes with different noise levels. Evaluation of existing methods on this new dataset reveals their inability to adapt to different types of shapes, indicating a degree of overfitting. Extensive experiments show that the proposed dataset poses significant challenges for point cloud normal estimation and that our feature constraint mechanism effectively improves upon existing methods and reduces overfitting in current architectures.

[CV-84] Exploring Training on Heterogeneous Data with Mixture of Low-rank Adapters

链接: https://arxiv.org/abs/2406.09679
作者: Yuhang Zhou,Zihua Zhao,Haolin Li,Siyuan Du,Jiangchao Yao,Ya Zhang,Yanfeng Wang
关键词: artificial general intelligence, general intelligence, unified model, targets into account, trend towards artificial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML2024

点击查看摘要

Abstract:Training a unified model to take multiple targets into account is a trend towards artificial general intelligence. However, how to efficiently mitigate the training conflicts among heterogeneous data collected from different domains or tasks remains under-explored. In this study, we explore to leverage Mixture of Low-rank Adapters (MoLA) to mitigate conflicts in heterogeneous data training, which requires to jointly train the multiple low-rank adapters and their shared backbone. Specifically, we introduce two variants of MoLA, namely, MoLA-Grad and MoLA-Router, to respectively handle the target-aware and target-agnostic scenarios during inference. The former uses task identifiers to assign personalized low-rank adapters to each task, disentangling task-specific knowledge towards their adapters, thereby mitigating heterogeneity conflicts. The latter uses a novel Task-wise Decorrelation (TwD) loss to intervene the router to learn oriented weight combinations of adapters to homogeneous tasks, achieving similar effects. We conduct comprehensive experiments to verify the superiority of MoLA over previous state-of-the-art methods and present in-depth analysis on its working mechanism. Source code is available at: this https URL

[CV-85] Learning Language Structures through Grounding

链接: https://arxiv.org/abs/2406.09662
作者: Freda Shi
关键词: Language, highly structured, learn language structures, structures, propose
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Ph.D. Thesis

点击查看摘要

[CV-86] RSEND: Retinex-based Squeeze and Excitation Network with Dark Region Detection for Efficient Low Light Image Enhancement

链接: https://arxiv.org/abs/2406.09656
作者: Jingcheng Li,Ye Qiao,Haocheng Xu,Sitao Huang
关键词: low quality, scenarios often suffer, suffer from low, Retinex theory, Images captured
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Images captured under low-light scenarios often suffer from low quality. Previous CNN-based deep learning methods often involve using Retinex theory. Nevertheless, most of them cannot perform well in more complicated datasets like LOL-v2 while consuming too much computational resources. Besides, some of these methods require sophisticated training at different stages, making the procedure even more time-consuming and tedious. In this paper, we propose a more accurate, concise, and one-stage Retinex theory based framework, RSEND. RSEND first divides the low-light image into the illumination map and reflectance map, then captures the important details in the illumination map and performs light enhancement. After this step, it refines the enhanced gray-scale image and does element-wise matrix multiplication with the reflectance map. By denoising the output it has from the previous step, it obtains the final result. In all the steps, RSEND utilizes Squeeze and Excitation network to better capture the details. Comprehensive quantitative and qualitative experiments show that our Efficient Retinex model significantly outperforms other CNN-based models, achieving a PSNR improvement ranging from 0.44 dB to 4.2 dB in different datasets and even outperforms transformer-based models in the LOL-v2-real dataset.

[CV-87] An Intrinsic Vector Heat Network

链接: https://arxiv.org/abs/2406.09648
作者: Alexander Gao,Maurice Chu,Mubbasir Kapadia,Ming C. Lin,Hsueh-Ti Derek Liu
关键词: Vector fields, represent and model, model flows, science and engineering, Vector
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vector fields are widely used to represent and model flows for many science and engineering applications. This paper introduces a novel neural network architecture for learning tangent vector fields that are intrinsically defined on manifold surfaces embedded in 3D. Previous approaches to learning vector fields on surfaces treat vectors as multi-dimensional scalar fields, using traditional scalar-valued architectures to process channels individually, thus fail to preserve fundamental intrinsic properties of the vector field. The core idea of this work is to introduce a trainable vector heat diffusion module to spatially propagate vector-valued feature data across the surface, which we incorporate into our proposed architecture that consists of vector-valued neurons. Our architecture is invariant to rigid motion of the input, isometric deformation, and choice of local tangent bases, and is robust to discretizations of the surface. We evaluate our Vector Heat Network on triangle meshes, and empirically validate its invariant properties. We also demonstrate the effectiveness of our method on the useful industrial application of quadrilateral mesh generation.

[CV-88] OpenAnimalTracks: A Dataset for Animal Track Recognition

链接: https://arxiv.org/abs/2406.09647
作者: Risa Shinoda,Kaede Shiohara
关键词: habitat surveys play, Animal habitat surveys, Animal, surveys play, play a critical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICIP 2024. Dataset and code: this https URL

点击查看摘要

Abstract:Animal habitat surveys play a critical role in preserving the biodiversity of the land. One of the effective ways to gain insights into animal habitats involves identifying animal footprints, which offers valuable information about species distribution, abundance, and behavior. However, due to the scarcity of animal footprint images, there are no well-maintained public datasets, preventing recent advanced techniques in computer vision from being applied to animal tracking. In this paper, we introduce OpenAnimalTracks dataset, the first publicly available labeled dataset designed to facilitate the automated classification and detection of animal footprints. It contains various footprints from 18 wild animal species. Moreover, we build benchmarks for species classification and detection and show the potential of automated footprint identification with representative classifiers and detection models. We find SwinTransformer achieves a promising classification result, reaching 69.41% in terms of the averaged accuracy. Faster-RCNN achieves mAP of 0.295. We hope our dataset paves the way for automated animal tracking techniques, enhancing our ability to protect and manage biodiversity. Our dataset and code are available at this https URL.

[CV-89] A Survey of Video Datasets for Grounded Event Understanding

链接: https://arxiv.org/abs/2406.09646
作者: Kate Sanders,Benjamin Van Durme
关键词: well-rounded common-sense reasoning, common-sense reasoning akin, specialized downstream tasks, retrieval or question-answering, contemporary multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While existing video benchmarks largely consider specialized downstream tasks like retrieval or question-answering (QA), contemporary multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding. A critical component of human temporal-visual perception is our ability to identify and cognitively model “things happening”, or events. Historically, video benchmark tasks have implicitly tested for this ability (e.g., video captioning, in which models describe visual events with natural language), but they do not consider video event understanding as a task in itself. Recent work has begun to explore video analogues to textual event extraction but consists of competing task definitions and datasets limited to highly specific event types. Therefore, while there is a rich domain of event-centric video research spanning the past 10+ years, it is unclear how video event understanding should be framed and what resources we have to study it. In this paper, we survey 105 video datasets that require event understanding capability, consider how they contribute to the study of robust event understanding in video, and assess proposed video event extraction tasks in the context of this body of research. We propose suggestions informed by this survey for dataset curation and task framing, with an emphasis on the uniquely temporal nature of video events and ambiguity in visual content.

[CV-90] Industrial Language-Image Dataset (ILID): Adapting Vision Foundation Models for Industrial Settings

链接: https://arxiv.org/abs/2406.09637
作者: Keno Moenck,Duc Trung Thieu,Julian Koch,Thorsten Schüppstuhl
关键词: Large Language Models, Contrastive Language-Image Pre-training, computer vision community, Large Language, Vision Foundation Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Dataset at this https URL training- and evaluation-related code at this https URL

点击查看摘要

Abstract:In recent years, the upstream of Large Language Models (LLM) has also encouraged the computer vision community to work on substantial multimodal datasets and train models on a scale in a self-/semi-supervised manner, resulting in Vision Foundation Models (VFM), as, e.g., Contrastive Language-Image Pre-training (CLIP). The models generalize well and perform outstandingly on everyday objects or scenes, even on downstream tasks, tasks the model has not been trained on, while the application in specialized domains, as in an industrial context, is still an open research question. Here, fine-tuning the models or transfer learning on domain-specific data is unavoidable when objecting to adequate performance. In this work, we, on the one hand, introduce a pipeline to generate the Industrial Language-Image Dataset (ILID) based on web-crawled data; on the other hand, we demonstrate effective self-supervised transfer learning and discussing downstream tasks after training on the cheaply acquired ILID, which does not necessitate human labeling or intervention. With the proposed approach, we contribute by transferring approaches from state-of-the-art research around foundation models, transfer learning strategies, and applications to the industrial domain.

[CV-91] Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

链接: https://arxiv.org/abs/2406.09630
作者: Mehreen Saeed,Adrian Chan,Anupam Mijar,Joseph Moukarzel,Georges Habchi,Carlos Younes,Amin Elias,Chau-Wai Wong,Akram Khater
关键词: historic handwritten page, page images transcribed, machine learning dataset, learning dataset consisting, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.

[CV-92] RobustSAM: Segment Anything Robustly on Degraded Images

链接: https://arxiv.org/abs/2406.09627
作者: Wei-Ting Chen,Yu-Jiet Vong,Sy-Yen Kuo,Sizhuo Ma,Jian Wang
关键词: flexible prompting system, prompting system, transformative approach, capabilities and flexible, flexible prompting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: Accepted by CVPR2024 (Highlight); Project Page: this https URL

点击查看摘要

Abstract:Segment Anything Model (SAM) has emerged as a transformative approach in image segmentation, acclaimed for its robust zero-shot segmentation capabilities and flexible prompting system. Nonetheless, its performance is challenged by images with degraded quality. Addressing this limitation, we propose the Robust Segment Anything Model (RobustSAM), which enhances SAM’s performance on low-quality images while preserving its promptability and zero-shot generalization. Our method leverages the pre-trained SAM model with only marginal parameter increments and computational requirements. The additional parameters of RobustSAM can be optimized within 30 hours on eight GPUs, demonstrating its feasibility and practicality for typical research laboratories. We also introduce the Robust-Seg dataset, a collection of 688K image-mask pairs with different degradations designed to train and evaluate our model optimally. Extensive experiments across various segmentation tasks and datasets confirm RobustSAM’s superior performance, especially under zero-shot conditions, underscoring its potential for extensive real-world application. Additionally, our method has been shown to effectively improve the performance of SAM-based downstream tasks such as single image dehazing and deblurring.

[CV-93] DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer

链接: https://arxiv.org/abs/2406.09622
作者: Wei-Ting Chen,Gurunandan Krishnan,Qiang Gao,Sy-Yen Kuo,Sizhuo Ma,Jian Wang
关键词: Image Quality Assessment, selecting high-quality face, Quality Assessment, improving image restoration, image restoration algorithms
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: Accepted by CVPR 2024, Project Page: this https URL

点击查看摘要

Abstract:Generic Face Image Quality Assessment (GFIQA) evaluates the perceptual quality of facial images, which is crucial in improving image restoration algorithms and selecting high-quality face images for downstream tasks. We present a novel transformer-based method for GFIQA, which is aided by two unique mechanisms. First, a Dual-Set Degradation Representation Learning (DSL) mechanism uses facial images with both synthetic and real degradations to decouple degradation from content, ensuring generalizability to real-world scenarios. This self-supervised method learns degradation features on a global scale, providing a robust alternative to conventional methods that use local patch information in degradation learning. Second, our transformer leverages facial landmarks to emphasize visually salient parts of a face image in evaluating its perceptual quality. We also introduce a balanced and diverse Comprehensive Generic Face IQA (CGFIQA-40k) dataset of 40K images carefully designed to overcome the biases, in particular the imbalances in skin tone and gender representation, in existing datasets. Extensive analysis and evaluation demonstrate the robustness of our method, marking a significant improvement over prior methods.

[CV-94] ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

链接: https://arxiv.org/abs/2406.09613
作者: Wufei Ma,Guanning Zeng,Guofeng Zhang,Qihao Liu,Letian Zhang,Adam Kortylewski,Yaoyao Liu,Alan Yuille
关键词: arbitrary rigid objects, rigid objects, object-level, general-purpose object-level, arbitrary rigid
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A vision model with general-purpose object-level 3D understanding should be capable of inferring both 2D (e.g., class name and bounding box) and 3D information (e.g., 3D location and 3D viewpoint) for arbitrary rigid objects in natural images. This is a challenging task, as it involves inferring 3D information from 2D signals and most importantly, generalizing to rigid objects from unseen categories. However, existing datasets with object-level 3D annotations are often limited by the number of categories or the quality of annotations. Models developed on these datasets become specialists for certain categories or domains, and fail to generalize. In this work, we present ImageNet3D, a large dataset for general-purpose object-level 3D understanding. ImageNet3D augments 200 categories from the ImageNet dataset with 2D bounding box, 3D pose, 3D location annotations, and image captions interleaved with 3D information. With the new annotations available in ImageNet3D, we could (i) analyze the object-level 3D awareness of visual foundation models, and (ii) study and develop general-purpose models that infer both 2D and 3D information for arbitrary rigid objects in natural images, and (iii) integrate unified 3D models with large language models for 3D-related reasoning… We consider two new tasks, probing of object-level 3D awareness and open vocabulary pose estimation, besides standard classification and pose estimation. Experimental results on ImageNet3D demonstrate the potential of our dataset in building vision models with stronger general-purpose object-level 3D understanding.

[CV-95] urns Out Im Not Real: Towards Robust Detection of AI-Generated Videos

链接: https://arxiv.org/abs/2406.09601
作者: Qingyuan Liu,Pengyuan Shi,Yun-Yun Tsai,Chengzhi Mao,Junfeng Yang
关键词: creating high-quality videos, privacy vulnerabilities, videos, impressive achievements, creating high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples. However, the robustness of these detectors on diffusion-generated videos generated from video creation tools (e.g., SORA by OpenAI, Runway Gen-2, and Pika, etc.) is still unexplored. In this paper, we propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models, such as Stable Video Diffusion. We find that the SOTA methods for detecting diffusion-generated images lack robustness in identifying diffusion-generated videos. Our analysis reveals that the effectiveness of these detectors diminishes when applied to out-of-domain videos, primarily because they struggle to track the temporal features and dynamic variations between frames. To address the above-mentioned challenge, we collect a new benchmark video dataset for diffusion-generated videos using SOTA video creation tools. We extract representation within explicit knowledge from the diffusion model for video frames and train our detector with a CNN + LSTM architecture. The evaluation shows that our framework can well capture the temporal features between frames, achieves 93.7% detection accuracy for in-domain videos, and improves the accuracy of out-domain videos by up to 16 points.

[CV-96] Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking

链接: https://arxiv.org/abs/2406.09598
作者: Prithviraj Banerjee,Sindi Shkodrani,Pierre Moulon,Shreyas Hampali,Fan Zhang,Jade Fountain,Edward Miller,Selen Basol,Richard Newcombe,Robert Wang,Jakob Julian Engel,Tomas Hodan
关键词: dataset, object tracking, objects, multi-view RGB, diverse rigid objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment. The dataset is recorded by two head-mounted devices from Meta: Project Aria, a research prototype of light-weight AR/AI glasses, and Quest 3, a production VR headset sold in millions of units. Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. We aim to accelerate research on egocentric hand-object interaction by making the HOT3D dataset publicly available and by co-organizing public challenges on the dataset at ECCV 2024. The dataset can be downloaded from the project website: this https URL.

[CV-97] Color Equivariant Network

链接: https://arxiv.org/abs/2406.09588
作者: Felix O’Mahony,Yulong Yang,Christine Allen-Blanchette
关键词: convolutional neural networks, Group equivariant convolutional, variety of geometric, equivariant convolutional neural, group equivariant networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Group equivariant convolutional neural networks have been designed for a variety of geometric transformations from 2D and 3D rotation groups, to semi-groups such as scale. Despite the improved interpretability, accuracy and generalizability afforded by these architectures, group equivariant networks have seen limited application in the context of perceptual quantities such as hue and saturation, even though their variation can lead to significant reductions in classification performance. In this paper, we introduce convolutional neural networks equivariant to variations in hue and saturation by design. To achieve this, we leverage the observation that hue and saturation transformations can be identified with the 2D rotation and 1D translation groups respectively. Our hue-, saturation-, and fully color-equivariant networks achieve equivariance to these perceptual transformations without an increase in network parameters. We demonstrate the utility of our networks on synthetic and real world datasets where color and lighting variations are commonplace.

[CV-98] CARLOR @ Ego4D Step Grounding Challenge: Bayesian temporal-order priors for test time refinement

链接: https://arxiv.org/abs/2406.09575
作者: Carlos Plou,Lorenzo Mur-Labadia,Ruben Martinez-Cantin,Ana C.Murillo
关键词: Step Grounding task, natural language descriptions, Step Grounding, locate temporal boundaries, Grounding task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The goal of the Step Grounding task is to locate temporal boundaries of activities based on natural language descriptions. This technical report introduces a Bayesian-VSLNet to address the challenge of identifying such temporal segments in lengthy, untrimmed egocentric videos. Our model significantly improves upon traditional models by incorporating a novel Bayesian temporal-order prior during inference, enhancing the accuracy of moment predictions. This prior adjusts for cyclic and repetitive actions within videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results on the Ego4D Goal-Step dataset with a 35.18 Recall Top-1 at 0.3 IoU and 20.48 Recall Top-1 at 0.5 IoU on the test set.

[CV-99] Improving Consistency Models with Generator-Induced Coupling

链接: https://arxiv.org/abs/2406.09570
作者: Thibaut Issenhuth,Ludovic Dos Santos,Jean-Yves Franceschi,Alain Rakotomamonjy
关键词: promising generative models, single forward pass, neural network, promising generative, distill the multi-step
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Consistency models are promising generative models as they distill the multi-step sampling of score-based diffusion in a single forward pass of a neural network. Without access to sampling trajectories of a pre-trained diffusion model, consistency training relies on proxy trajectories built on an independent coupling between the noise and data distributions. Refining this coupling is a key area of improvement to make it more adapted to the task and reduce the resulting randomness in the training process. In this work, we introduce a novel coupling associating the input noisy data with their generated output from the consistency model itself, as a proxy to the inaccessible diffusion flow output. Our affordable approach exploits the inherent capacity of consistency models to compute the transport map in a single step. We provide intuition and empirical evidence of the relevance of our generator-induced coupling (GC), which brings consistency training closer to score distillation. Consequently, our method not only accelerates consistency training convergence by significant amounts but also enhances the resulting performance. The code is available at: this https URL.

[CV-100] owards Domain Adaptive Neural Contextual Bandits

链接: https://arxiv.org/abs/2406.09564
作者: Ziyan Wang,Hao Wang
关键词: decision making problems, Contextual bandit, solving real-world decision, real-world decision making, Contextual bandit algorithms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Contextual bandit algorithms are essential for solving real-world decision making problems. In practice, collecting a contextual bandit’s feedback from different domains may involve different costs. For example, measuring drug reaction from mice (as a source domain) and humans (as a target domain). Unfortunately, adapting a contextual bandit algorithm from a source domain to a target domain with distribution shift still remains a major challenge and largely unexplored. In this paper, we introduce the first general domain adaptation method for contextual bandits. Our approach learns a bandit model for the target domain by collecting feedback from the source domain. Our theoretical analysis shows that our algorithm maintains a sub-linear regret bound even adapting across domains. Empirical results show that our approach outperforms the state-of-the-art contextual bandit algorithms on real-world datasets.

[CV-101] My Body My Choice: Human-Centric Full-Body Anonymization

链接: https://arxiv.org/abs/2406.09553
作者: Umur Aybars Ciftci,Ali Kemal Tanriverdi,Ilke Demir
关键词: increasing privacy concerns, online presence, era of increasing, increasing privacy, piece of content
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: AI for Content Creation Workshop @ CVPR 2024

点击查看摘要

Abstract:In an era of increasing privacy concerns for our online presence, we propose that the decision to appear in a piece of content should only belong to the owner of the body. Although some automatic approaches for full-body anonymization have been proposed, human-guided anonymization can adapt to various contexts, such as cultural norms, personal relations, esthetic concerns, and security issues. ‘‘My Body My Choice’’ (MBMC) enables physical and adversarial anonymization by removal and swapping approaches aimed for four tasks, designed by single or multi, ControlNet or GAN modules, combining several diffusion models. We evaluate anonymization on seven datasets; compare with SOTA inpainting and anonymization methods; evaluate by image, adversarial, and generative metrics; and conduct reidentification experiments.

[CV-102] Q-Mamba: On First Exploration of Vision Mamba for Image Quality Assessment

链接: https://arxiv.org/abs/2406.09546
作者: Fengbin Guan,Xin Li,Zihao Yu,Yiting Lu,Zhibo Chen
关键词: State Space Model, State Space, image quality assessment, recently popular foundation, popular foundation model
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 17 pages,3 figures

点击查看摘要

Abstract:In this work, we take the first exploration of the recently popular foundation model, i.e., State Space Model/Mamba, in image quality assessment, aiming at observing and excavating the perception potential in vision Mamba. A series of works on Mamba has shown its significant potential in various fields, e.g., segmentation and classification. However, the perception capability of Mamba has been under-explored. Consequently, we propose Q-Mamba by revisiting and adapting the Mamba model for three crucial IQA tasks, i.e., task-specific, universal, and transferable IQA, which reveals that the Mamba model has obvious advantages compared with existing foundational models, e.g., Swin Transformer, ViT, and CNNs, in terms of perception and computational cost for IQA. To increase the transferability of Q-Mamba, we propose the StylePrompt tuning paradigm, where the basic lightweight mean and variance prompts are injected to assist the task-adaptive transfer learning of pre-trained Q-Mamba for different downstream IQA tasks. Compared with existing prompt tuning strategies, our proposed StylePrompt enables better perception transfer capability with less computational cost. Extensive experiments on multiple synthetic, authentic IQA datasets, and cross IQA datasets have demonstrated the effectiveness of our proposed Q-Mamba.

[CV-103] Language-driven Grasp Detection

链接: https://arxiv.org/abs/2406.09489
作者: An Dinh Vuong,Minh Nhat Vu,Baoru Huang,Nghia Nguyen,Hieu Le,Thieu Vo,Anh Nguyen
关键词: Grasp detection, language-driven grasp detection, Grasp, industrial applications, persistent and intricate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages. Accepted to CVPR24

点击查看摘要

Abstract:Grasp detection is a persistent and intricate challenge with various industrial applications. Recently, many methods and datasets have been proposed to tackle the grasp detection problem. However, most of them do not consider using natural language as a condition to detect the grasp poses. In this paper, we introduce Grasp-Anything++, a new language-driven grasp detection dataset featuring 1M samples, over 3M objects, and upwards of 10M grasping instructions. We utilize foundation models to create a large-scale scene corpus with corresponding images and grasp prompts. We approach the language-driven grasp detection task as a conditional generation problem. Drawing on the success of diffusion models in generative tasks and given that language plays a vital role in this task, we propose a new language-driven grasp detection method based on diffusion models. Our key contribution is the contrastive training objective, which explicitly contributes to the denoising process to detect the grasp pose given the language instructions. We illustrate that our approach is theoretically supportive. The intensive experiments show that our method outperforms state-of-the-art approaches and allows real-world robotic grasping. Finally, we demonstrate our large-scale dataset enables zero-short grasp detection and is a challenging benchmark for future work. Project website: this https URL

[CV-104] SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

链接: https://arxiv.org/abs/2406.09486
作者: Shenghua Wan,Ziyuan Chen,Le Gan,Shuai Feng,De-Chuan Zhan
关键词: offline reinforcement Learning, involving high-dimensional inputs, Model-based offline reinforcement, reinforcement Learning, Separated Model-based Offline
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 23 pages, 10 figures

点击查看摘要

Abstract:Model-based offline reinforcement Learning (RL) is a promising approach that leverages existing data effectively in many real-world applications, especially those involving high-dimensional inputs like images and videos. To alleviate the distribution shift issue in offline RL, existing model-based methods heavily rely on the uncertainty of learned dynamics. However, the model uncertainty estimation becomes significantly biased when observations contain complex distractors with non-trivial dynamics. To address this challenge, we propose a new approach - \emphSeparated Model-based Offline Policy Optimization (SeMOPO) - decomposing latent states into endogenous and exogenous parts via conservative sampling and estimating model uncertainty on the endogenous states only. We provide a theoretical guarantee of model uncertainty and performance bound of SeMOPO. To assess the efficacy, we construct the Low-Quality Vision Deep Data-Driven Datasets for RL (LQV-D4RL), where the data are collected by non-expert policy and the observations include moving distractors. Experimental results show that our method substantially outperforms all baseline methods, and further analytical experiments validate the critical designs in our method. The project website is \hrefthis https URLthis https URL.

[CV-105] Is Diffusion Model Safe? Severe Data Leakage via Gradient-Guided Diffusion Model

链接: https://arxiv.org/abs/2406.09484
作者: Jiayang Meng,Tao Huang,Hong Chen,Cuiping Li
关键词: image processing systems, processing systems, modern image processing, image processing, potential source
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Gradient leakage has been identified as a potential source of privacy breaches in modern image processing systems, where the adversary can completely reconstruct the training images from leaked gradients. However, existing methods are restricted to reconstructing low-resolution images where data leakage risks of image processing systems are not sufficiently explored. In this paper, by exploiting diffusion models, we propose an innovative gradient-guided fine-tuning method and introduce a new reconstruction attack that is capable of stealing private, high-resolution images from image processing systems through leaked gradients where severe data leakage encounters. Our attack method is easy to implement and requires little prior knowledge. The experimental results indicate that current reconstruction attacks can steal images only up to a resolution of 128 \times 128 pixels, while our attack method can successfully recover and steal images with resolutions up to 512 \times 512 pixels. Our attack method significantly outperforms the SOTA attack baselines in terms of both pixel-wise accuracy and time efficiency of image reconstruction. Furthermore, our attack can render differential privacy ineffective to some extent.

[CV-106] ELF-UA: Efficient Label-Free User Adaptation in Gaze Estimation

链接: https://arxiv.org/abs/2406.09481
作者: Yong Wu,Yang Wang,Sanqing Qu,Zhijun Li,Guang Chen
关键词: gaze estimation, gaze, data, estimation, user-adaptive gaze estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted by IJCAI’24

点击查看摘要

Abstract:We consider the problem of user-adaptive 3D gaze estimation. The performance of person-independent gaze estimation is limited due to interpersonal anatomical differences. Our goal is to provide a personalized gaze estimation model specifically adapted to a target user. Previous work on user-adaptive gaze estimation requires some labeled images of the target person data to fine-tune the model at test time. However, this can be unrealistic in real-world applications, since it is cumbersome for an end-user to provide labeled images. In addition, previous work requires the training data to have both gaze labels and person IDs. This data requirement makes it infeasible to use some of the available data. To tackle these challenges, this paper proposes a new problem called efficient label-free user adaptation in gaze estimation. Our model only needs a few unlabeled images of a target user for the model adaptation. During offline training, we have some labeled source data without person IDs and some unlabeled person-specific data. Our proposed method uses a meta-learning approach to learn how to adapt to a new user with only a few unlabeled images. Our key technical innovation is to use a generalization bound from domain adaptation to define the loss function in meta-learning, so that our method can effectively make use of both the labeled source data and the unlabeled person-specific data during training. Extensive experiments validate the effectiveness of our method on several challenging benchmarks.

[CV-107] SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

链接: https://arxiv.org/abs/2406.09462
作者: Hector A. Valdez,Kyle Min,Subarna Tripathi
关键词: improving downstream egocentric, Pretraining egocentric vision-language, egocentric video-text tasks, downstream egocentric video-text, essential to improving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. We pretrain on the EgoClip dataset and incorporate the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large, with no additional data augmentation techniques other than standard image augmentations, yet pretrainable on memory-limited devices.

[CV-108] Updating CLIP to Prefer Descriptions Over Captions

链接: https://arxiv.org/abs/2406.09458
作者: Amir Zur,Elisa Kreiss,Karel D’Oosterlinck,Christopher Potts,Atticus Geiger
关键词: powerful generic metric, meant to complement, meant to replace, replace an image, powerful generic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[CV-109] Pandora: Towards General World Model with Natural Language Actions and Video States

链接: https://arxiv.org/abs/2406.09455
作者: Jiannan Xiang,Guangyi Liu,Yi Gu,Qiyue Gao,Yuting Ning,Yuheng Zha,Zeyu Feng,Tianhua Tao,Shibo Hao,Yemin Shi,Zhengzhong Liu,Eric P. Xing,Zhiting Hu
关键词: World, general world, general world models, World models, simulate future states
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Website: this https URL

点击查看摘要

[CV-110] Advancing High Resolution Vision-Language Models in Biomedicine

链接: https://arxiv.org/abs/2406.09454
作者: Zekai Chen,Arda Pekis,Kevin Brown
关键词: significantly advanced generative, vision-language modeling, learning has significantly, significantly advanced, advanced generative
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注: 15 pages

点击查看摘要

[CV-111] Advancing Roadway Sign Detection with YOLO Models and Transfer Learning

链接: https://arxiv.org/abs/2406.09437
作者: Selvia Nafaa,Hafsa Essam,Karim Ashour,Doaa Emad,Rana Mohamed,Mohammed Elhenawy,Huthaifa I. Ashqar,Abdallah A. Hassan,Taqwa I. Alhadidi
关键词: Driving Assistant Systems, Advanced Driving Assistant, Assistant Systems, Advanced Driving, Driving Assistant
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Roadway signs detection and recognition is an essential element in the Advanced Driving Assistant Systems (ADAS). Several artificial intelligence methods have been used widely among of them YOLOv5 and YOLOv8. In this paper, we used a modified YOLOv5 and YOLOv8 to detect and classify different roadway signs under different illumination conditions. Experimental results indicated that for the YOLOv8 model, varying the number of epochs and batch size yields consistent MAP50 scores, ranging from 94.6% to 97.1% on the testing set. The YOLOv5 model demonstrates competitive performance, with MAP50 scores ranging from 92.4% to 96.9%. These results suggest that both models perform well across different training setups, with YOLOv8 generally achieving slightly higher MAP50 scores. These findings suggest that both models can perform well under different training setups, offering valuable insights for practitioners seeking reliable and adaptable solutions in object detection applications.

[CV-112] Modified Risk Formulation for Improving the Prediction of Knee Osteoarthritis Progression

链接: https://arxiv.org/abs/2406.10119
作者: Haresh Rengaraj Rajamohan,Richard Kijowski,Kyunghyun Cho,Cem M. Deniz
关键词: incorporate disease specific, disease specific prior, specific prior knowledge, AUPRC, AUROC
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Current methods for predicting osteoarthritis (OA) outcomes do not incorporate disease specific prior knowledge to improve the outcome prediction models. We developed a novel approach that effectively uses consecutive imaging studies to improve OA outcome predictions by incorporating an OA severity constraint. This constraint ensures that the risk of OA for a knee should either increase or remain the same over time. DL models were trained to predict TKR within multiple time periods (1 year, 2 years, and 4 years) using knee radiographs and MRI scans. Models with and without the risk constraint were evaluated using the area under the receiver operator curve (AUROC) and the area under the precision recall curve (AUPRC) analysis. The novel RiskFORM2 method, leveraging a dual model risk constraint architecture, demonstrated superior performance, yielding an AUROC of 0.87 and AUPRC of 0.47 for 1 year TKR prediction on the OAI radiograph test set, a marked improvement over the 0.79 AUROC and 0.34 AUPRC of the baseline approach. The performance advantage extended to longer followup periods, with RiskFORM2 maintaining a high AUROC of 0.86 and AUPRC of 0.75 in predicting TKR within 4 years. Additionally, when generalizing to the external MOST radiograph test set, RiskFORM2 generalized better with an AUROC of 0.77 and AUPRC of 0.25 for 1 year predictions, which was higher than the 0.71 AUROC and 0.19 AUPRC of the baseline approach. In the MRI test sets, similar patterns emerged, with RiskFORM2 outperforming the baseline approach consistently. However, RiskFORM1 exhibited the highest AUROC of 0.86 and AUPRC of 0.72 for 4 year predictions on the OAI set.

[CV-113] Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

链接: https://arxiv.org/abs/2406.10082
作者: Andrew Rouditchenko,Yuan Gong,Samuel Thomas,Leonid Karlinsky,Hilde Kuehne,Rogerio Feris,James Glass
关键词: performance in noise, improve performance, Speech Recognition, AVSR, Whisper speech recognition
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
*备注: Interspeech 2024. Code this https URL

点击查看摘要

Abstract:Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data difference motivates us to adapt Whisper to handle video inputs. Inspired by Flamingo which injects visual features into language models, we propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention. Our audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and En-X translation for 6 languages in noisy conditions. Moreover, Whisper-Flamingo is a versatile model and conducts all of these tasks using one set of parameters, while prior methods are trained separately on each language.

[CV-114] Deep Learning Models to Automate the Scoring of Hand Radiographs for Rheumatoid Arthritis

链接: https://arxiv.org/abs/2406.09980
作者: Zhiyan Bo,Laura C. Coates,Bartlomiej W. Papiez
关键词: der Heijde modification, van der Heijde, Rheumatoid Arthritis, der Heijde, Heijde modification
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 5 figures, accepted by MIUA 2024

点击查看摘要

Abstract:The van der Heijde modification of the Sharp (SvdH) score is a widely used radiographic scoring method to quantify damage in Rheumatoid Arthritis (RA) in clinical trials. However, its complexity with a necessity to score each individual joint, and the expertise required limit its application in clinical practice, especially in disease progression measurement. In this work, we addressed this limitation by developing a bespoke, automated pipeline that is capable of predicting the SvdH score and RA severity from hand radiographs without the need to localise the joints first. Using hand radiographs from RA and suspected RA patients, we first investigated the performance of the state-of-the-art architectures in predicting the total SvdH score for hands and wrists and its corresponding severity class. Secondly, we leveraged publicly available data sets to perform transfer learning with different finetuning schemes and ensemble learning, which resulted in substantial improvement in model performance being on par with an experienced human reader. The best model for RA scoring achieved a Pearson’s correlation coefficient (PCC) of 0.925 and root mean squared error (RMSE) of 18.02, while the best model for RA severity classification achieved an accuracy of 0.358 and PCC of 0.859. Our score prediction model attained almost comparable accuracy with experienced radiologists (PCC = 0.97, RMSE = 18.75). Finally, using Grad-CAM, we showed that our models could focus on the anatomical structures in hands and wrists which clinicians deemed as relevant to RA progression in the majority of cases.

[CV-115] SCKansformer: Fine-Grained Classification of Bone Marrow Cells via Kansformer Backbone and Hierarchical Attention Mechanisms

链接: https://arxiv.org/abs/2406.09931
作者: Yifei Chen,Zhu Zhu,Shenghao Zhu,Linwei Qiu,Binfeng Zou,Fan Jia,Yunpeng Zhu,Chenyan Zhang,Zhaojie Fang,Feiwei Qin,Jin Fan,Changmiao Wang,Yu Gao,Gang Yu
关键词: bone marrow blood, Global-Local Attention Encoder, malignant tumors, diagnose malignant tumors, bone marrow
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:The incidence and mortality rates of malignant tumors, such as acute leukemia, have risen significantly. Clinically, hospitals rely on cytological examination of peripheral blood and bone marrow smears to diagnose malignant tumors, with accurate blood cell counting being crucial. Existing automated methods face challenges such as low feature expression capability, poor interpretability, and redundant feature extraction when processing high-dimensional microimage data. We propose a novel fine-grained classification model, SCKansformer, for bone marrow blood cells, which addresses these challenges and enhances classification accuracy and efficiency. The model integrates the Kansformer Encoder, SCConv Encoder, and Global-Local Attention Encoder. The Kansformer Encoder replaces the traditional MLP layer with the KAN, improving nonlinear feature representation and interpretability. The SCConv Encoder, with its Spatial and Channel Reconstruction Units, enhances feature representation and reduces redundancy. The Global-Local Attention Encoder combines Multi-head Self-Attention with a Local Part module to capture both global and local features. We validated our model using the Bone Marrow Blood Cell Fine-Grained Classification Dataset (BMCD-FGCD), comprising over 10,000 samples and nearly 40 classifications, developed with a partner hospital. Comparative experiments on our private dataset, as well as the publicly available PBC and ALL-IDB datasets, demonstrate that SCKansformer outperforms both typical and advanced microcell classification methods across all datasets. Our source code and private BMCD-FGCD dataset are available at this https URL.

[CV-116] owards Full Integration of Artificial Intelligence in Colon Capsule Endoscopys Pathway

链接: https://arxiv.org/abs/2406.09761
作者: Esmaeil S. Nadimi,Jan-Matthias Braun,Benedicte Schelde-Olesen,Emile Prudhomme,Victoria Blanes-Vidal,Gunnar Baatrup
关键词: colon capsule endoscopy, counterpart optical colonoscopy, deploying colon capsule, current state, CCE pathway
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite recent surge of interest in deploying colon capsule endoscopy (CCE) for early diagnosis of colorectal diseases, there remains a large gap between the current state of CCE in clinical practice, and the state of its counterpart optical colonoscopy (OC). Our study is aimed at closing this gap, by focusing on the full integration of AI in CCE’s pathway, where image processing steps linked to the detection, localization and characterisation of important findings are carried out autonomously using various AI algorithms. We developed a recognition network, that with an impressive sensitivity of 99.9%, a specificity of 99.4%, and a negative predictive value (NPV) of 99.8%, detected colorectal polyps. After recognising a polyp within a sequence of images, only those images containing polyps were fed into two parallel independent networks for characterisation, and estimation of the size of those important findings. The characterisation network reached a sensitivity of 82% and a specificity of 80% in classifying polyps to two groups, namely neoplastic vs. non-neoplastic. The size estimation network reached an accuracy of 88% in correctly segmenting the polyps. By automatically incorporating this crucial information into CCE’s pathway, we moved a step closer towards the full integration of AI in CCE’s routine clinical practice.

[CV-117] MoME: Mixture of Multimodal Experts for Cancer Survival Prediction

链接: https://arxiv.org/abs/2406.09696
作者: Conghao Xiong,Hao Chen,Hao Zheng,Dong Wei,Yefeng Zheng,Joseph J. Y. Sung,Irwin King
关键词: Slide Images, integrating Whole Slide, requires integrating, comprehensive decision-making, data for comprehensive
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 + 1/2 pages, early accepted to MICCAI2024

点击查看摘要

Abstract:Survival analysis, as a challenging task, requires integrating Whole Slide Images (WSIs) and genomic data for comprehensive decision-making. There are two main challenges in this task: significant heterogeneity and complex inter- and intra-modal interactions between the two modalities. Previous approaches utilize co-attention methods, which fuse features from both modalities only once after separate encoding. However, these approaches are insufficient for modeling the complex task due to the heterogeneous nature between the modalities. To address these issues, we propose a Biased Progressive Encoding (BPE) paradigm, performing encoding and fusion simultaneously. This paradigm uses one modality as a reference when encoding the other. It enables deep fusion of the modalities through multiple alternating iterations, progressively reducing the cross-modal disparities and facilitating complementary interactions. Besides modality heterogeneity, survival analysis involves various biomarkers from WSIs, genomics, and their combinations. The critical biomarkers may exist in different modalities under individual variations, necessitating flexible adaptation of the models to specific scenarios. Therefore, we further propose a Mixture of Multimodal Experts (MoME) layer to dynamically selects tailored experts in each stage of the BPE paradigm. Experts incorporate reference information from another modality to varying degrees, enabling a balanced or biased focus on different modalities during the encoding process. Extensive experimental results demonstrate the superior performance of our method on various datasets, including TCGA-BLCA, TCGA-UCEC and TCGA-LUAD. Codes are available at this https URL.

机器学习

[LG-0] Quantifying Variance in Evaluation Benchmarks

链接: https://arxiv.org/abs/2406.10229
作者: Lovish Madaan,Aaditya K. Singh,Rylan Schaeffer,Andrew Poulton,Sanmi Koyejo,Pontus Stenetorp,Sharan Narang,Dieuwke Hupkes
关键词: Evaluation benchmarks, driving progress, Evaluation, variance, benchmarks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models – both openly available and pretrained from scratch – we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ( \sim 7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.

[LG-1] Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

链接: https://arxiv.org/abs/2406.10223
作者: Nameer Hirschkind,Xiao Yu,Mahesh Kumar Nandwana,Joseph Liu,Eloi DuBois,Dao Le,Nicolas Thiebaut,Colin Sinclair,Kyle Spence,Charles Shang,Zoe Abrams,Morgan McGuire
关键词: translation system capable, multiple source languages, input speaker voice, speaker voice zero-shot, languages into English
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Published in Interspeech 2024

点击查看摘要

Abstract:We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker’s voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23% each and speaker similarity by 5% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5 \times faster than real-time.

[LG-2] Semantic Membership Inference Attack against Large Language Models

链接: https://arxiv.org/abs/2406.10218
作者: Hamid Mozaffari,Virendra J. Marathe
关键词: Membership Inference Attacks, Semantic Membership Inference, specific data point, Membership Inference, Inference Attacks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) determine whether a specific data point was included in the training set of a target model. In this paper, we introduce the Semantic Membership Inference Attack (SMIA), a novel approach that enhances MIA performance by leveraging the semantic content of inputs and their perturbations. SMIA trains a neural network to analyze the target model’s behavior on perturbed inputs, effectively capturing variations in output probability distributions between members and non-members. We conduct comprehensive evaluations on the Pythia and GPT-Neo model families using the Wikipedia dataset. Our results show that SMIA significantly outperforms existing MIAs; for instance, SMIA achieves an AUC-ROC of 67.39% on Pythia-12B, compared to 58.90% by the second-best attack.

[LG-3] DevBench: A multimodal developmental benchmark for language learning

链接: https://arxiv.org/abs/2406.10215
作者: Alvin Wei Ming Tan,Sunny Yu,Bria Long,Wanjing Anya Ma,Tonya Murray,Rebecca D. Silverman,Jason D. Yeatman,Michael C. Frank
关键词: models, response patterns, data, trajectories of vision-language, language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-4] Universal randomised signatures for generative time series modelling

链接: https://arxiv.org/abs/2406.10214
作者: Francesca Biagini,Lukas Gonon,Niklas Walter
关键词: easily implementable alternative, well-established path signature, flexible and easily, easily implementable, implementable alternative
类目: Machine Learning (cs.LG); Mathematical Finance (q-fin.MF); Machine Learning (stat.ML)
*备注: 33 pages

点击查看摘要

Abstract:Randomised signature has been proposed as a flexible and easily implementable alternative to the well-established path signature. In this article, we employ randomised signature to introduce a generative model for financial time series data in the spirit of reservoir computing. Specifically, we propose a novel Wasserstein-type distance based on discrete-time randomised signatures. This metric on the space of probability measures captures the distance between (conditional) distributions. Its use is justified by our novel universal approximation results for randomised signatures on the space of continuous functions taking the underlying path as an input. We then use our metric as the loss function in a non-adversarial generator model for synthetic time series data based on a reservoir neural stochastic differential equation. We compare the results of our model to benchmarks from the existing literature.

[LG-5] Selecting Interpretability Techniques for Healthcare Machine Learning models

链接: https://arxiv.org/abs/2406.10213
作者: Daniel Sierra-Botero,Ana Molina-Taborda,Mario S. Valdés-Tresanco,Alejandro Hernández-Arango,Leonardo Espinosa-Leal,Alexander Karpenko,Olga Lopez-Acevedo
关键词: assist healthcare professionals, employing interpretable algorithms, decision scenarios, pursuit for employing, assist healthcare
类目: Machine Learning (cs.LG)
*备注: 26 pages, 5 figures

点击查看摘要

Abstract:In healthcare there is a pursuit for employing interpretable algorithms to assist healthcare professionals in several decision scenarios. Following the Predictive, Descriptive and Relevant (PDR) framework, the definition of interpretable machine learning as a machine-learning model that explicitly and in a simple frame determines relationships either contained in data or learned by the model that are relevant for its functioning and the categorization of models by post-hoc, acquiring interpretability after training, or model-based, being intrinsically embedded in the algorithm design. We overview a selection of eight algorithms, both post-hoc and model-based, that can be used for such purposes.

[LG-6] Crafting Parts for Expressive Object Composition

点击查看摘要

[LG-7] Misam: Using ML in Dataflow Selection of Sparse-Sparse Matrix Multiplication

链接: https://arxiv.org/abs/2406.10166
作者: Sanjali Yadav,Bahar Asgari
关键词: including scientific computing, Sparse matrix-matrix multiplication, graph analytics, matrix-matrix multiplication, numerous fields
类目: Machine Learning (cs.LG)
*备注: Accepted to ISCA 2024 MLArchSys workshop this https URL

点击查看摘要

Abstract:Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning. These applications exploit the sparsity of matrices to reduce storage and computational demands. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Traditional hardware accelerators are tailored for specific sparsity patterns with fixed dataflow schemes - inner, outer, and row-wise but often perform suboptimally when the actual sparsity deviates from these predetermined patterns. As the use of SpGEMM expands across various domains, each with distinct sparsity characteristics, the demand for hardware accelerators that can efficiently handle a range of sparsity patterns is increasing. This paper presents a machine learning based approach for adaptively selecting the most appropriate dataflow scheme for SpGEMM tasks with diverse sparsity patterns. By employing decision trees and deep reinforcement learning, we explore the potential of these techniques to surpass heuristic-based methods in identifying optimal dataflow schemes. We evaluate our models by comparing their performance with that of a heuristic, highlighting the strengths and weaknesses of each approach. Our findings suggest that using machine learning for dynamic dataflow selection in hardware accelerators can provide upto 28 times gains.

[LG-8] On the Computability of Robust PAC Learning

链接: https://arxiv.org/abs/2406.10161
作者: Pascale Gourdeau,Tosca Lechner,Ruth Urner
关键词: robust, robust CPAC, CPAC, initiate the study, robust CPAC learnability
类目: Machine Learning (cs.LG)
*备注: To appear in Conference on Learning Theory (COLT) 2024

点击查看摘要

Abstract:We initiate the study of computability requirements for adversarially robust learning. Adversarially robust PAC-type learnability is by now an established field of research. However, the effects of computability requirements in PAC-type frameworks are only just starting to emerge. We introduce the problem of robust computable PAC (robust CPAC) learning and provide some simple sufficient conditions for this. We then show that learnability in this setup is not implied by the combination of its components: classes that are both CPAC and robustly PAC learnable are not necessarily robustly CPAC learnable. Furthermore, we show that the novel framework exhibits some surprising effects: for robust CPAC learnability it is not required that the robust loss is computably evaluable! Towards understanding characterizing properties, we introduce a novel dimension, the computable robust shattering dimension. We prove that its finiteness is necessary, but not sufficient for robust CPAC learnability. This might yield novel insights for the corresponding phenomenon in the context of robust PAC learnability, where insufficiency of the robust shattering dimension for learnability has been conjectured, but so far a resolution has remained elusive.

[LG-9] Automated Design of Linear Bounding Functions for Sigmoidal Nonlinearities in Neural Networks

链接: https://arxiv.org/abs/2406.10154
作者: Matthias König,Xiyue Zhang,Holger H. Hoos,Marta Kwiatkowska,Jan N. van Rijn
关键词: small input perturbations, deep learning algorithms, adversarial attacks, ubiquity of deep, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:The ubiquity of deep learning algorithms in various applications has amplified the need for assuring their robustness against small input perturbations such as those occurring in adversarial attacks. Existing complete verification techniques offer provable guarantees for all robustness queries but struggle to scale beyond small neural networks. To overcome this computational intractability, incomplete verification methods often rely on convex relaxation to over-approximate the nonlinearities in neural networks. Progress in tighter approximations has been achieved for piecewise linear functions. However, robustness verification of neural networks for general activation functions (e.g., Sigmoid, Tanh) remains under-explored and poses new challenges. Typically, these networks are verified using convex relaxation techniques, which involve computing linear upper and lower bounds of the nonlinear activation functions. In this work, we propose a novel parameter search method to improve the quality of these linear approximations. Specifically, we show that using a simple search method, carefully adapted to the given verification problem through state-of-the-art algorithm configuration techniques, improves the average global lower bound by 25% on average over the current state of the art on several commonly used local robustness verification benchmarks.

[LG-10] Compressed Sensor Caching and Collaborative Sparse Data Recovery with Anchor Alignment

链接: https://arxiv.org/abs/2406.10137
作者: Yi-Jen Yang,Ming-Hsun Yang,Jwo-Yuh Wu,Y.-W. Peter Hong
关键词: compressed sensor caching, devises efficient distributed, distributed sparse data, sparse data recovery, sensor caching problem
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: v1 was submitted to IEEE Transactions on Signal Processing on Sept. 18, 2023

点击查看摘要

Abstract:This work examines the compressed sensor caching problem in wireless sensor networks and devises efficient distributed sparse data recovery algorithms to enable collaboration among multiple caches. In this problem, each cache is only allowed to access measurements from a small subset of sensors within its vicinity to reduce both cache size and data acquisition overhead. To enable reliable data recovery with limited access to measurements, we propose a distributed sparse data recovery method, called the collaborative sparse recovery by anchor alignment (CoSR-AA) algorithm, where collaboration among caches is enabled by aligning their locally recovered data at a few anchor nodes. The proposed algorithm is based on the consensus alternating direction method of multipliers (ADMM) algorithm but with message exchange that is reduced by considering the proposed anchor alignment strategy. Then, by the deep unfolding of the ADMM iterations, we further propose the Deep CoSR-AA algorithm that can be used to significantly reduce the number of iterations. We obtain a graph neural network architecture where message exchange is done more efficiently by an embedded autoencoder. Simulations are provided to demonstrate the effectiveness of the proposed collaborative recovery algorithms in terms of the improved reconstruction quality and the reduced communication overhead due to anchor alignment.

[LG-11] Linear Contextual Bandits with Hybrid Payoff: Revisited

链接: https://arxiv.org/abs/2406.10131
作者: Nirjhar Das,Gaurav Sinha
关键词: Linear Contextual Bandit, Contextual Bandit problem, Linear Contextual, Contextual Bandit, texttt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ECML PKDD 2024 as a Research Track Paper

点击查看摘要

Abstract:We study the Linear Contextual Bandit problem in the hybrid reward setting. In this setting every arm’s reward model contains arm specific parameters in addition to parameters shared across the reward models of all the arms. We can reduce this setting to two closely related settings (a) Shared - no arm specific parameters, and (b) Disjoint - only arm specific parameters, enabling the application of two popular state of the art algorithms - \textttLinUCB and \textttDisLinUCB (Algorithm 1 in (Li et al. 2010)). When the arm features are stochastic and satisfy a popular diversity condition, we provide new regret analyses for both algorithms, significantly improving on the known regret guarantees of these algorithms. Our novel analysis critically exploits the hybrid reward structure and the diversity condition. Moreover, we introduce a new algorithm \textttHyLinUCB that crucially modifies \textttLinUCB (using a new exploration coefficient) to account for sparsity in the hybrid setting. Under the same diversity assumptions, we prove that \textttHyLinUCB also incurs only O(\sqrtT) regret for T rounds. We perform extensive experiments on synthetic and real-world datasets demonstrating strong empirical performance of \textttHyLinUCB .For number of arm specific parameters much larger than the number of shared parameters, we observe that \textttDisLinUCB incurs the lowest regret. In this case, regret of \textttHyLinUCB is the second best and extremely competitive to \textttDisLinUCB . In all other situations, including our real-world dataset, \textttHyLinUCB has significantly lower regret than \textttLinUCB , \textttDisLinUCB and other SOTA baselines we considered. We also empirically observe that the regret of \textttHyLinUCB grows much slower with the number of arms compared to baselines, making it suitable even for very large action spaces.

[LG-12] rustworthy Artificial Intelligence in the Context of Metrology

链接: https://arxiv.org/abs/2406.10117
作者: Tameem Adel,Sam Bilson,Mark Levene,Andrew Thompson
关键词: National Physical Laboratory, Physical Laboratory, National Physical, trustworthy artificial intelligence, trustworthy machine learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We review research at the National Physical Laboratory (NPL) in the area of trustworthy artificial intelligence (TAI), and more specifically trustworthy machine learning (TML), in the context of metrology, the science of measurement. We describe three broad themes of TAI: technical, socio-technical and social, which play key roles in ensuring that the developed models are trustworthy and can be relied upon to make responsible decisions. From a metrology perspective we emphasise uncertainty quantification (UQ), and its importance within the framework of TAI to enhance transparency and trust in the outputs of AI systems. We then discuss three research areas within TAI that we are working on at NPL, and examine the certification of AI systems in terms of adherence to the characteristics of TAI.

[LG-13] Shelf-Supervised Multi-Modal Pre-Training for 3D Object Detection

点击查看摘要

[LG-14] Precipitation Nowcasting Using Physics Informed Discriminator Generative Models

链接: https://arxiv.org/abs/2406.10108
作者: Junzhe Yin,Cristian Meo,Ankush Roy,Zeineh Bou Cher,Yanbo Wang,Ruben Imhoff,Remko Uijlenhoet,Justin Dauwels
关键词: leverages real-time atmospheric, real-time atmospheric conditions, Nowcasting leverages real-time, short periods, Netherlands Meteorological Institute
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Nowcasting leverages real-time atmospheric conditions to forecast weather over short periods. State-of-the-art models, including PySTEPS, encounter difficulties in accurately forecasting extreme weather events because of their unpredictable distribution patterns. In this study, we design a physics-informed neural network to perform precipitation nowcasting using the precipitation and meteorological data from the Royal Netherlands Meteorological Institute (KNMI). This model draws inspiration from the novel Physics-Informed Discriminator GAN (PID-GAN) formulation, directly integrating physics-based supervision within the adversarial learning framework. The proposed model adopts a GAN structure, featuring a Vector Quantization Generative Adversarial Network (VQ-GAN) and a Transformer as the generator, with a temporal discriminator serving as the discriminator. Our findings demonstrate that the PID-GAN model outperforms numerical and SOTA deep generative models in terms of precipitation nowcasting downstream metrics.

[LG-15] ECGMamba: Towards Efficient ECG Classification with BiSSM

链接: https://arxiv.org/abs/2406.10098
作者: Yupeng Qiang,Xunde Dong,Xiuling Liu,Yang Yang,Yihai Fang,Jianhong Dou
关键词: represents a pivotal, ECG, signal analysis represents, Electrocardiogram, ECG classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures. arXiv admin note: text overlap with arXiv:2404.17858 by other authors

点击查看摘要

Abstract:Electrocardiogram (ECG) signal analysis represents a pivotal technique in the diagnosis of cardiovascular diseases. Although transformer-based models have made significant progress in ECG classification, they exhibit inefficiencies in the inference phase. The issue is primarily attributable to the secondary computational complexity of Transformer’s self-attention mechanism. particularly when processing lengthy sequences. To address this issue, we propose a novel model, ECGMamba, which employs a bidirectional state-space model (BiSSM) to enhance classification efficiency. ECGMamba is based on the innovative Mamba-based block, which incorporates a range of time series modeling techniques to enhance performance while maintaining the efficiency of inference. The experimental results on two publicly available ECG datasets demonstrate that ECGMamba effectively balances the effectiveness and efficiency of classification, achieving competitive performance. This study not only contributes to the body of knowledge in the field of ECG classification but also provides a new research path for efficient and accurate ECG signal analysis. This is of guiding significance for the development of diagnostic models for cardiovascular diseases.

[LG-16] BiKC: Keypose-Conditioned Consistency Policy for Bimanual Robotic Manipulation

链接: https://arxiv.org/abs/2406.10093
作者: Dongjie Yu,Hang Xu,Yizhou Chen,Yi Ren,Jia Pan
关键词: typically involve multiple, involve multiple stages, require efficient interactions, tasks typically involve, manipulation tasks typically
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bimanual manipulation tasks typically involve multiple stages which require efficient interactions between two arms, posing step-wise and stage-wise challenges for imitation learning systems. Specifically, failure and delay of one step will broadcast through time, hinder success and efficiency of each sub-stage task, and thereby overall task performance. Although recent works have made strides in addressing certain challenges, few approaches explicitly consider the multi-stage nature of bimanual tasks while simultaneously emphasizing the importance of inference speed. In this paper, we introduce a novel keypose-conditioned consistency policy tailored for bimanual manipulation. It is a hierarchical imitation learning framework that consists of a high-level keypose predictor and a low-level trajectory generator. The predicted keyposes provide guidance for trajectory generation and also mark the completion of one sub-stage task. The trajectory generator is designed as a consistency model trained from scratch without distillation, which generates action sequences conditioning on current observations and predicted keyposes with fast inference speed. Simulated and real-world experimental results demonstrate that the proposed approach surpasses baseline methods in terms of success rate and operational efficiency.

[LG-17] Over-parameterization and Adversarial Robustness in Neural Networks: An Overview and Empirical Analysis

链接: https://arxiv.org/abs/2406.10090
作者: Zhang Chen,Luca Demetrio,Srishti Gupta,Xiaoyi Feng,Zhaoqiang Xia,Antonio Emanuele Cinà,Maura Pintor,Luca Oneto,Ambra Demontis,Battista Biggio,Fabio Roli
关键词: exhibit superior predictive, superior predictive capabilities, networks exhibit superior, neural networks exhibit, extensive capacity
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Thanks to their extensive capacity, over-parameterized neural networks exhibit superior predictive capabilities and generalization. However, having a large parameter space is considered one of the main suspects of the neural networks’ vulnerability to adversarial example – input samples crafted ad-hoc to induce a desired misclassification. Relevant literature has claimed contradictory remarks in support of and against the robustness of over-parameterized networks. These contradictory findings might be due to the failure of the attack employed to evaluate the networks’ robustness. Previous research has demonstrated that depending on the considered model, the algorithm employed to generate adversarial examples may not function properly, leading to overestimating the model’s robustness. In this work, we empirically study the robustness of over-parameterized networks against adversarial examples. However, unlike the previous works, we also evaluate the considered attack’s reliability to support the results’ veracity. Our results show that over-parameterized networks are robust against adversarial attacks as opposed to their under-parameterized counterparts.

[LG-18] Biomarker based Cancer Classification using an Ensemble with Pre-trained Models

链接: https://arxiv.org/abs/2406.10087
作者: Chongmin Lee,Jihie Kim
关键词: identify cancer efficiently, early stage, sparking the importance, difficult to detect, importance of discovering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted to the AIAA Workshop at IJCAI 2024

点击查看摘要

Abstract:Certain cancer types, namely pancreatic cancer is difficult to detect at an early stage; sparking the importance of discovering the causal relationship between biomarkers and cancer to identify cancer efficiently. By allowing for the detection and monitoring of specific biomarkers through a non-invasive method, liquid biopsies enhance the precision and efficacy of medical interventions, advocating the move towards personalized healthcare. Several machine learning algorithms such as Random Forest, SVM are utilized for classification, yet causing inefficiency due to the need for conducting hyperparameter tuning. We leverage a meta-trained Hyperfast model for classifying cancer, accomplishing the highest AUC of 0.9929 and simultaneously achieving robustness especially on highly imbalanced datasets compared to other ML algorithms in several binary classification tasks (e.g. breast invasive carcinoma; BRCA vs. non-BRCA). We also propose a novel ensemble model combining pre-trained Hyperfast model, XGBoost, and LightGBM for multi-class classification tasks, achieving an incremental increase in accuracy (0.9464) while merely using 500 PCA features; distinguishable from previous studies where they used more than 2,000 features for similar results.

[LG-19] Discovering influential text using convolutional neural networks

链接: https://arxiv.org/abs/2406.10086
作者: Megan Ayers,Luke Sanford,Margaret Roberts,Eddie Yang
关键词: social sciences, estimating the impacts, text, text treatments, Experimental
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: To be published in ACL 2024 Findings

点击查看摘要

[LG-20] D-NPC: Dynamic Neural Point Clouds for Non-Rigid View Synthesis from Monocular Video

点击查看摘要

[LG-21] ACCO: Task-guided Co-clustering of Clinical Concepts and Patient Visits for Disease Subtyping based on EHR Data

链接: https://arxiv.org/abs/2406.10061
作者: Ziyang Zhang,Hejie Cui,Ran Xu,Yuzhang Xie,Joyce C. Ho,Carl Yang
关键词: Electronic Health Records, well-organized Electronic Health, Health Records, Electronic Health, well-organized Electronic
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures, to be published in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

点击查看摘要

Abstract:The growing availability of well-organized Electronic Health Records (EHR) data has enabled the development of various machine learning models towards disease risk prediction. However, existing risk prediction methods overlook the heterogeneity of complex diseases, failing to model the potential disease subtypes regarding their corresponding patient visits and clinical concept subgroups. In this work, we introduce TACCO, a novel framework that jointly discovers clusters of clinical concepts and patient visits based on a hypergraph modeling of EHR data. Specifically, we develop a novel self-supervised co-clustering framework that can be guided by the risk prediction task of specific diseases. Furthermore, we enhance the hypergraph model of EHR data with textual embeddings and enforce the alignment between the clusters of clinical concepts and patient visits through a contrastive objective. Comprehensive experiments conducted on the public MIMIC-III dataset and Emory internal CRADLE dataset over the downstream clinical tasks of phenotype classification and cardiovascular risk prediction demonstrate an average 31.25% performance improvement compared to traditional ML baselines and a 5.26% improvement on top of the vanilla hypergraph model without our co-clustering mechanism. In-depth model analysis, clustering results analysis, and clinical case studies further validate the improved utilities and insightful interpretations delivered by TACCO. Code is available at this https URL.

[LG-22] PRIMER: Perception-Aware Robust Learning-based Multiagent Trajectory Planner

链接: https://arxiv.org/abs/2406.10060
作者: Kota Kondo,Claudius T. Tewari,Andrea Tagliabue,Jesus Tordesillas,Parker C. Lusk,Jonathan P. How
关键词: generate collision-free trajectories, multiagent trajectory planners, decentralized multiagent trajectory, communicate and exchange, exchange their positions
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:In decentralized multiagent trajectory planners, agents need to communicate and exchange their positions to generate collision-free trajectories. However, due to localization errors/uncertainties, trajectory deconfliction can fail even if trajectories are perfectly shared between agents. To address this issue, we first present PARM and PARM*, perception-aware, decentralized, asynchronous multiagent trajectory planners that enable a team of agents to navigate uncertain environments while deconflicting trajectories and avoiding obstacles using perception information. PARM* differs from PARM as it is less conservative, using more computation to find closer-to-optimal solutions. While these methods achieve state-of-the-art performance, they suffer from high computational costs as they need to solve large optimization problems onboard, making it difficult for agents to replan at high rates. To overcome this challenge, we present our second key contribution, PRIMER, a learning-based planner trained with imitation learning (IL) using PARM* as the expert demonstrator. PRIMER leverages the low computational requirements at deployment of neural networks and achieves a computation speed up to 5500 times faster than optimization-based approaches.

[LG-23] Comparison of fine-tuning strategies for transfer learning in medical image classification

点击查看摘要

[LG-24] Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation

链接: https://arxiv.org/abs/2406.10043
作者: Federico Tavella,Aphrodite Galata,Angelo Cangelosi
关键词: people using cameras, physical presence, sign language, Artificial agents, American Sign Language
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Artificial agents, particularly humanoid robots, interact with their environment, objects, and people using cameras, actuators, and physical presence. Their communication methods are often pre-programmed, limiting their actions and interactions. Our research explores acquiring non-verbal communication skills through learning from demonstrations, with potential applications in sign language comprehension and expression. In particular, we focus on imitation learning for artificial agents, exemplified by teaching a simulated humanoid American Sign Language. We use computer vision and deep learning to extract information from videos, and reinforcement learning to enable the agent to replicate observed actions. Compared to other methods, our approach eliminates the need for additional hardware to acquire information. We demonstrate how the combination of these different techniques offers a viable way to learn sign language. Our methodology successfully teaches 5 different signs involving the upper body (i.e., arms and hands). This research paves the way for advanced communication skills in artificial agents.

[LG-25] Intepretative Deep Learning using Domain Adaptation for Fluorescence Spectroscopy

链接: https://arxiv.org/abs/2406.10031
作者: Umberto Michelucci,Francesca Venturini
关键词: food quality control, sciences and chemistry, environmental monitoring, biomedical diagnostics, life sciences
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Fluorescence spectroscopy is a fundamental tool in life sciences and chemistry, widely used for applications such as environmental monitoring, food quality control, and biomedical diagnostics. However, analysis of spectroscopic data with deep learning, in particular of fluorescence excitation-emission matrices (EEMs), presents significant challenges due mainly to the typically small and sparse datasets available. Furthermore, the analysis of EEMs is difficult due to their high dimensionality and overlapping spectral features. This study proposes a new approach that exploits domain adaptation with pretrained vision models, alongside a novel interpretability algorithm to address these challenges. Thanks to specialised feature engineering of the neural networks described in this work, we are now able to provide deeper and meaningful insights into the physico-chemical processes underlying the data. The proposed approach is demonstrated through the analysis of the oxidation process in extra virgin olive oil (EVOO), showing its effectiveness in predicting quality indicators and identifying relevant spectral bands. This work describes significantly innovative results in the use of deep learning for spectroscopy, transforming it from a black box into a tool for understanding complex biological and chemical processes.

[LG-26] Off-Policy Evaluation from Logged Human Feedback

链接: https://arxiv.org/abs/2406.10030
作者: Aniruddha Bhargava,Lalit Jain,Branislav Kveton,Ge Liu,Subhojyoti Mukherjee
关键词: machine learning, human feedback, central to recent, recent advances, advances in artificial
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning from human feedback has been central to recent advances in artificial intelligence and machine learning. Since the collection of human feedback is costly, a natural question to ask is if the new feedback always needs to collected. Or could we evaluate a new model with the human feedback on responses of another model? This motivates us to study off-policy evaluation from logged human feedback. We formalize the problem, propose both model-based and model-free estimators for policy values, and show how to optimize them. We analyze unbiasedness of our estimators and evaluate them empirically. Our estimators can predict the absolute values of evaluated policies, rank them, and be optimized.

[LG-27] ProtoS-ViT: Visual foundation models for sparse self-explainable classifications

点击查看摘要

[LG-28] Deep Bayesian Active Learning for Preference Modeling in Large Language Models

链接: https://arxiv.org/abs/2406.10023
作者: Luckeciano C. Melo,Panagiotis Tigas,Alessandro Abate,Yarin Gal
关键词: Large Language Models, Leveraging human preferences, Large Language, Leveraging human, Language Models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-29] Group and Shuffle: Efficient Structured Orthogonal Parametrization

链接: https://arxiv.org/abs/2406.10019
作者: Mikhail Gorbunov,Nikolay Yudin,Vera Soboleva,Aibek Alanov,Alexey Naumov,Maxim Rakhuba
关键词: increasing size, growing demand, orthogonal, efficient fine-tuning, fine-tuning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[LG-30] Gradient-based Learning in State-based Potential Games for Self-Learning Production Systems

链接: https://arxiv.org/abs/2406.10015
作者: Steve Yuwono,Marlon Löppenberg,Dorothea Schwung,Andreas Schwung
关键词: state-based potential games, gradient-based optimization methods, potential games, optimization methods, methods for state-based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:In this paper, we introduce novel gradient-based optimization methods for state-based potential games (SbPGs) within self-learning distributed production systems. SbPGs are recognised for their efficacy in enabling self-optimizing distributed multi-agent systems and offer a proven convergence guarantee, which facilitates collaborative player efforts towards global objectives. Our study strives to replace conventional ad-hoc random exploration-based learning in SbPGs with contemporary gradient-based approaches, which aim for faster convergence and smoother exploration dynamics, thereby shortening training duration while upholding the efficacy of SbPGs. Moreover, we propose three distinct variants for estimating the objective function of gradient-based learning, each developed to suit the unique characteristics of the systems under consideration. To validate our methodology, we apply it to a laboratory testbed, namely Bulk Good Laboratory Plant, which represents a smart and flexible distributed multi-agent production system. The incorporation of gradient-based learning in SbPGs reduces training times and achieves more optimal policies than its baseline.

[LG-31] Beyond Slow Signs in High-fidelity Model Extraction

链接: https://arxiv.org/abs/2406.10011
作者: Hanna Foerster,Robert Mullins,Ilia Shumailov,Jamie Hayes
关键词: Deep neural networks, Deep neural, neural networks, costly to train, compromise their confidentiality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Deep neural networks, costly to train and rich in intellectual property value, are increasingly threatened by model extraction attacks that compromise their confidentiality. Previous attacks have succeeded in reverse-engineering model parameters up to a precision of float64 for models trained on random data with at most three hidden layers using cryptanalytical techniques. However, the process was identified to be very time consuming and not feasible for larger and deeper models trained on standard benchmarks. Our study evaluates the feasibility of parameter extraction methods of Carlini et al. [1] further enhanced by Canales-Martínez et al. [2] for models trained on standard benchmarks. We introduce a unified codebase that integrates previous methods and reveal that computational tools can significantly influence performance. We develop further optimisations to the end-to-end attack and improve the efficiency of extracting weight signs by up to 14.8 times compared to former methods through the identification of easier and harder to extract neurons. Contrary to prior assumptions, we identify extraction of weights, not extraction of weight signs, as the critical bottleneck. With our improvements, a 16,721 parameter model with 2 hidden layers trained on MNIST is extracted within only 98 minutes compared to at least 150 minutes previously. Finally, addressing methodological deficiencies observed in previous studies, we propose new ways of robust benchmarking for future model extraction attacks.

[LG-32] An elementary proof of a universal approximation theorem

链接: https://arxiv.org/abs/2406.10002
作者: Chris Monico
关键词: bounded activation function, universal approximation theorem, short note, layers and increasing, bounded activation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this short note, we give an elementary proof of a universal approximation theorem for neural networks with three hidden layers and increasing, continuous, bounded activation function. The result is weaker than the best known results, but the proof is elementary in the sense that no machinery beyond undergraduate analysis is used.

[LG-33] owards Scalable and Versatile Weight Space Learning

链接: https://arxiv.org/abs/2406.09997
作者: Konstantin Schürholt,Michael W. Mahoney,Damian Borth
关键词: well-trained neural network, holds the promise, promise to provide, provide an understanding, neural network
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:Learning representations of well-trained neural network models holds the promise to provide an understanding of the inner workings of those models. However, previous work has either faced limitations when processing larger networks or was task-specific to either discriminative or generative tasks. This paper introduces the SANE approach to weight-space learning. SANE overcomes previous limitations by learning task-agnostic representations of neural networks that are scalable to larger models of varying architectures and that show capabilities beyond a single task. Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights, thus allowing one to embed larger neural networks as a set of tokens into the learned representation space. SANE reveals global model information from layer-wise embeddings, and it can sequentially generate unseen neural network models, which was unattainable with previous hyper-representation learning methods. Extensive empirical evaluation demonstrates that SANE matches or exceeds state-of-the-art performance on several weight representation learning benchmarks, particularly in initialization for new tasks and larger ResNet architectures.

[LG-34] Self-Supervised and Few-Shot Learning for Robust Bioaerosol Monitoring

链接: https://arxiv.org/abs/2406.09984
作者: Adrian Willi,Pascal Baumann,Sophie Erb,Fabian Gröger,Yanick Zeder,Simone Lionetti
关键词: affected by allergies, widespread adoption, improving the quality, quality of life, life for people
类目: Machine Learning (cs.LG)
*备注: Short communication, 8 pages, 2 figures, 1 table

点击查看摘要

Abstract:Real-time bioaerosol monitoring is improving the quality of life for people affected by allergies, but it often relies on deep-learning models which pose challenges for widespread adoption. These models are typically trained in a supervised fashion and require considerable effort to produce large amounts of annotated data, an effort that must be repeated for new particles, geographical regions, or measurement systems. In this work, we show that self-supervised learning and few-shot learning can be combined to classify holographic images of bioaerosol particles using a large collection of unlabelled data and only a few examples for each particle type. We first demonstrate that self-supervision on pictures of unidentified particles from ambient air measurements enhances identification even when labelled data is abundant. Most importantly, it greatly improves few-shot classification when only a handful of labelled images are available. Our findings suggest that real-time bioaerosol monitoring workflows can be substantially optimized, and the effort required to adapt models for different situations considerably reduced.

[LG-35] Challenges in explaining deep learning models for data with biological variation

点击查看摘要

[LG-36] Robust Model-Based Reinforcement Learning with an Adversarial Auxiliary Model

链接: https://arxiv.org/abs/2406.09976
作者: Siemen Herremans,Ali Anwar,Siegfried Mercelis
关键词: classical arcade games, board games, demonstrated impressive performance, arcade games, Reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Will be presented at the RL Safety Workshop at RLC 2024

点击查看摘要

Abstract:Reinforcement learning has demonstrated impressive performance in various challenging problems such as robotics, board games, and classical arcade games. However, its real-world applications can be hindered by the absence of robustness and safety in the learned policies. More specifically, an RL agent that trains in a certain Markov decision process (MDP) often struggles to perform well in nearly identical MDPs. To address this issue, we employ the framework of Robust MDPs (RMDPs) in a model-based setting and introduce a novel learned transition model. Our method specifically incorporates an auxiliary pessimistic model, updated adversarially, to estimate the worst-case MDP within a Kullback-Leibler uncertainty set. In comparison to several existing works, our work does not impose any additional conditions on the training environment, such as the need for a parametric simulator. To test the effectiveness of the proposed pessimistic model in enhancing policy robustness, we integrate it into a practical RL algorithm, called Robust Model-Based Policy Optimization (RMBPO). Our experimental results indicate a notable improvement in policy robustness on high-dimensional MuJoCo control tasks, with the auxiliary model enhancing the performance of the learned policy in distorted MDPs. We further explore the learned deviation between the proposed auxiliary world model and the nominal model, to examine how pessimism is achieved. By learning a pessimistic world model and demonstrating its role in improving policy robustness, our research contributes towards making (model-based) RL more robust.

[LG-37] Impact of Speech Mode in Automatic Pathological Speech Detection

链接: https://arxiv.org/abs/2406.09968
作者: Shakeel A. Sheikh,Ina Kodrasi
关键词: speech, Automatic pathological speech, pathological speech detection, approaches yield promising, yield promising results
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted in EUSIPCO 2024

点击查看摘要

Abstract:Automatic pathological speech detection approaches yield promising results in identifying various pathologies. These approaches are typically designed and evaluated for phonetically-controlled speech scenarios, where speakers are prompted to articulate identical phonetic content. While gathering controlled speech recordings can be laborious, spontaneous speech can be conveniently acquired as potential patients navigate their daily routines. Further, spontaneous speech can be valuable in detecting subtle and abstract cues of pathological speech. Nonetheless, the efficacy of automatic pathological speech detection for spontaneous speech remains unexplored. This paper analyzes the influence of speech mode on pathological speech detection approaches, examining two distinct categories of approaches, i.e., classical machine learning and deep learning. Results indicate that classical approaches may struggle to capture pathology-discriminant cues in spontaneous speech. In contrast, deep learning approaches demonstrate superior performance, managing to extract additional cues that were previously inaccessible in non-spontaneous speech

[LG-38] Outlier detection in maritime environments using AIS data and deep recurrent architectures

链接: https://arxiv.org/abs/2406.09966
作者: Constantine Maganaris,Eftychios Protopapadakis,Nikolaos Doulamis
关键词: Automatic Identification System, Identification System, Automatic Identification, Recurrent Neural Network, publicly available Automatic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Presented in PETRA '24 The PErvasive Technologies Related to Assistive Environments Conference June 26–28, 2024 Crete, Greece

点击查看摘要

Abstract:A methodology based on deep recurrent models for maritime surveillance, over publicly available Automatic Identification System (AIS) data, is presented in this paper. The setup employs a deep Recurrent Neural Network (RNN)-based model, for encoding and reconstructing the observed ships’ motion patterns. Our approach is based on a thresholding mechanism, over the calculated errors between observed and reconstructed motion patterns of maritime vessels. Specifically, a deep-learning framework, i.e. an encoder-decoder architecture, is trained using the observed motion patterns, enabling the models to learn and predict the expected trajectory, which will be compared to the effective ones. Our models, particularly the bidirectional GRU with recurrent dropouts, showcased superior performance in capturing the temporal dynamics of maritime data, illustrating the potential of deep learning to enhance maritime surveillance capabilities. Our work lays a solid foundation for future research in this domain, highlighting a path toward improved maritime safety through the innovative application of technology.

[LG-39] H-Fac: Memory-Efficient Optimization with Factorized Hamiltonian Descent

链接: https://arxiv.org/abs/2406.09958
作者: Son Nguyen,Lizhang Chen,Bo Liu,Qiang Liu
关键词: adaptive optimizer, scaling parameters, incorporates a factorized, factorized approach, approach to momentum
类目: Machine Learning (cs.LG)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:In this study, we introduce a novel adaptive optimizer, H-Fac, which incorporates a factorized approach to momentum and scaling parameters. Our algorithm demonstrates competitive performances on both ResNets and Vision Transformers, while achieving sublinear memory costs through the use of rank-1 parameterizations for moment estimators. We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings. These optimization algorithms are designed to be both straightforward and adaptable, facilitating easy implementation in diverse settings.

[LG-40] Rule Based Learning with Dynamic (Graph) Neural Networks

链接: https://arxiv.org/abs/2406.09954
作者: Florian Seiffarth
关键词: learning process, neural network architectures, graph neural networks, common problem, additional information
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A common problem of classical neural network architectures is that additional information or expert knowledge cannot be naturally integrated into the learning process. To overcome this limitation, we propose a two-step approach consisting of (1) generating rule functions from knowledge and (2) using these rules to define rule based layers – a new type of dynamic neural network layer. The focus of this work is on the second step, i.e., rule based layers that are designed to dynamically arrange learnable parameters in the weight matrices and bias vectors depending on the input samples. Indeed, we prove that our approach generalizes classical feed-forward layers such as fully connected and convolutional layers by choosing appropriate rules. As a concrete application we present rule based graph neural networks (RuleGNNs) that overcome some limitations of ordinary graph neural networks. Our experiments show that the predictive performance of RuleGNNs is comparable to state-of-the-art graph classifiers using simple rules based on Weisfeiler-Leman labeling and pattern counting. Moreover, we introduce new synthetic benchmark graph datasets to show how to integrate expert knowledge into RuleGNNs making them more powerful than ordinary graph neural networks.

[LG-41] BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

链接: https://arxiv.org/abs/2406.09952
作者: Imanol Miranda,Ander Salaberria,Eneko Agirre,Gorka Azkune
关键词: Existing Vision-Language Compositionality, Bidirectional Vision-Language Compositionality, correct textual description, Vision-Language Compositionality, Existing Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-42] Neural Concept Binder

链接: https://arxiv.org/abs/2406.09949
作者: Wolfgang Stammer,Antonia Wüst,David Steinmann,Kristian Kersting
关键词: Neural Concept Binder, visual reasoning lies, object-based visual reasoning, distinct concept representations, object-based visual
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:The challenge in object-based visual reasoning lies in generating descriptive yet distinct concept representations. Moreover, doing this in an unsupervised fashion requires human users to understand a model’s learned concepts and potentially revise false concepts. In addressing this challenge, we introduce the Neural Concept Binder, a new framework for deriving discrete concept representations resulting in what we term “concept-slot encodings”. These encodings leverage both “soft binding” via object-centric block-slot encodings and “hard binding” via retrieval-based inference. The Neural Concept Binder facilitates straightforward concept inspection and direct integration of external knowledge, such as human input or insights from other AI models like GPT-4. Additionally, we demonstrate that incorporating the hard binding mechanism does not compromise performance; instead, it enables seamless integration into both neural and symbolic modules for intricate reasoning tasks, as evidenced by evaluations on our newly introduced CLEVR-Sudoku dataset.

[LG-43] Finite-Time Analysis of Simultaneous Double Q-learning

链接: https://arxiv.org/abs/2406.09946
作者: Hyunjun Na,Donghwan Lee
关键词: fundamental reinforcement learning, fundamental reinforcement, learning, double, SDQ
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 25 pages, 3 figures

点击查看摘要

Abstract: Q -learning is one of the most fundamental reinforcement learning (RL) algorithms. Despite its widespread success in various applications, it is prone to overestimation bias in the Q -learning update. To address this issue, double Q -learning employs two independent Q -estimators which are randomly selected and updated during the learning process. This paper proposes a modified double Q -learning, called simultaneous double Q -learning (SDQ), with its finite-time analysis. SDQ eliminates the need for random selection between the two Q -estimators, and this modification allows us to analyze double Q -learning through the lens of a novel switching system framework facilitating efficient finite-time analysis. Empirical studies demonstrate that SDQ converges faster than double Q -learning while retaining the ability to mitigate the maximization bias. Finally, we derive a finite-time expected error bound for SDQ.

[LG-44] Forgetting Order of Continual Learning: Examples That are Learned First are Forgotten Last

链接: https://arxiv.org/abs/2406.09935
作者: Guy Hacohen,Tinne Tuytelaars
关键词: Catastrophic forgetting poses, forget previous tasks, Catastrophic forgetting, poses a significant, significant challenge
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Catastrophic forgetting poses a significant challenge in continual learning, where models often forget previous tasks when trained on new data. Our empirical analysis reveals a strong correlation between catastrophic forgetting and the learning speed of examples: examples learned early are rarely forgotten, while those learned later are more susceptible to forgetting. We demonstrate that replay-based continual learning methods can leverage this phenomenon by focusing on mid-learned examples for rehearsal. We introduce Goldilocks, a novel replay buffer sampling method that filters out examples learned too quickly or too slowly, keeping those learned at an intermediate speed. Goldilocks improves existing continual learning algorithms, leading to state-of-the-art performance across several image classification tasks.

[LG-45] What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

链接: https://arxiv.org/abs/2406.09933
作者: Adham Ibrahim,Shady Shehata,Ajinkya Kulkarni,Mukhtar Mohamed,Muhammad Abdul-Mageed
关键词: enhancing human-computer interaction, Speech emotion recognition, SER, speech-based applications, essential for enhancing
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: ACCEPTED AT INTERSPEECH 2024, GREECE

点击查看摘要

Abstract:Speech emotion recognition (SER) is essential for enhancing human-computer interaction in speech-based applications. Despite improvements in specific emotional datasets, there is still a research gap in SER’s capability to generalize across real-world situations. In this paper, we investigate approaches to generalize the SER system across different emotion datasets. In particular, we incorporate 11 emotional speech datasets and illustrate a comprehensive benchmark on the SER task. We also address the challenge of imbalanced data distribution using over-sampling methods when combining SER datasets for training. Furthermore, we explore various evaluation protocols for adeptness in the generalization of SER. Building on this, we explore the potential of Whisper for SER, emphasizing the importance of thorough evaluation. Our approach is designed to advance SER technology by integrating speaker-independent methods.

[LG-46] Personalized Speech Enhancement Without a Separate Speaker Embedding Model

链接: https://arxiv.org/abs/2406.09928
作者: Tanel Pärnamaa,Ando Saabas
关键词: Personalized speech enhancement, Personalized speech, speaker embedding model, speech enhancement, speaker embedding
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:Personalized speech enhancement (PSE) models can improve the audio quality of teleconferencing systems by adapting to the characteristics of a speaker’s voice. However, most existing methods require a separate speaker embedding model to extract a vector representation of the speaker from enrollment audio, which adds complexity to the training and deployment process. We propose to use the internal representation of the PSE model itself as the speaker embedding, thereby avoiding the need for a separate model. We show that our approach performs equally well or better than the standard method of using a pre-trained speaker embedding model on noise suppression and echo cancellation tasks. Moreover, our approach surpasses the ICASSP 2023 Deep Noise Suppression Challenge winner by 0.15 in Mean Opinion Score.

[LG-47] POWN: Prototypical Open-World Node Classification

链接: https://arxiv.org/abs/2406.09926
作者: Marcel Hoffmann,Lukas Galke,Ansgar Scherp
关键词: present during training, node classification, classes, classification, POWN
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of \textittrue open-world semi-supervised node classification, in which nodes in a graph either belong to known or new classes, with the latter not present during training. Existing methods detect and reject new classes but fail to distinguish between different new classes. We adapt existing methods and show they do not solve the problem sufficiently. We introduce a novel end-to-end approach for classification into known classes and new classes based on class prototypes, which we call Prototypical Open-World Learning for Node Classification (POWN). Our method combines graph semi-supervised learning, self-supervised learning, and pseudo-labeling to learn prototype representations of new classes in a zero-shot way. In contrast to existing solutions from the vision domain, POWN does not require data augmentation techniques for node classification. Experiments on benchmark datasets demonstrate the effectiveness of POWN, where it outperforms baselines by up to 20% accuracy on the small and up to 30% on the large datasets. Source code is available at this https URL.

[LG-48] CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses Procedures Lab Tests Orders and Prescriptions

链接: https://arxiv.org/abs/2406.09923
作者: Mingyu Derek Ma,Chenchen Ye,Yu Yan,Xiaoxuan Wang,Peipei Ping,Timothy S Chang,Wei Wang
关键词: Large Language Models, Artificial Intelligence, Language Models, Large Language, process offers significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

[LG-49] What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?

点击查看摘要

[LG-50] QQQ: Quality Quattuor-Bit Quantization for Large Language Models

链接: https://arxiv.org/abs/2406.09904
作者: Ying Zhang,Peng Zhang,Mincong Huang,Jingyang Xiang,Yujie Wang,Chao Wang,Yineng Zhang,Lei Yu,Chuan Liu,Wei Lin
关键词: compressing large language, large language models, proven effective method, proven effective, compressing large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding stages of inference. W4A8 is a promising strategy to accelerate both of them while usually leads to a significant performance degradation. To address these issues, we present QQQ, a Quality Quattuor-bit Quantization method with 4-bit weights and 8-bit activations. QQQ employs adaptive smoothing and Hessian-based compensation, significantly enhancing the performance of quantized models without extensive training. Furthermore, we meticulously engineer W4A8 GEMM kernels to increase inference speed. Our specialized per-channel W4A8 GEMM and per-group W4A8 GEMM achieve impressive speed increases of 3.67 \times and 3.29 \times over FP16 GEMM. Our extensive experiments show that QQQ achieves performance on par with existing state-of-the-art LLM quantization methods while significantly accelerating inference, achieving speed boosts up to 2.24 \times , 2.10 \times , and 1.25 \times compared to FP16, W8A8, and W4A16, respectively.

[LG-51] Learning Solution-Aware Transformers for Efficiently Solving Quadratic Assignment Problem

链接: https://arxiv.org/abs/2406.09899
作者: Zhentao Tan,Yadong Mu
关键词: Mixed Integer Linear, Integer Linear Programming, Linear Programming Problems, Mixed Integer, Integer Linear
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:Recently various optimization problems, such as Mixed Integer Linear Programming Problems (MILPs), have undergone comprehensive investigation, leveraging the capabilities of machine learning. This work focuses on learning-based solutions for efficiently solving the Quadratic Assignment Problem (QAPs), which stands as a formidable challenge in combinatorial optimization. While many instances of simpler problems admit fully polynomial-time approximate solution (FPTAS), QAP is shown to be strongly NP-hard. Even finding a FPTAS for QAP is difficult, in the sense that the existence of a FPTAS implies P = NP . Current research on QAPs suffer from limited scale and computational inefficiency. To attack the aforementioned issues, we here propose the first solution of its kind for QAP in the learn-to-improve category. This work encodes facility and location nodes separately, instead of forming computationally intensive association graphs prevalent in current approaches. This design choice enables scalability to larger problem sizes. Furthermore, a \textbfSolution \textbfAWare \textbfTransformer (SAWT) architecture integrates the incumbent solution matrix with the attention score to effectively capture higher-order information of the QAPs. Our model’s effectiveness is validated through extensive experiments on self-generated QAP instances of varying sizes and the QAPLIB benchmark.

[LG-52] Positive-Unlabelled Learning for Identifying New Candidate Dietary Restriction-related Genes among Ageing-related Genes

链接: https://arxiv.org/abs/2406.09898
作者: Jorge Paz-Ruza,Alex A. Freitas,Amparo Alonso-Betanzos,Bertha Guijarro-Berdiñas
关键词: Dietary Restriction, popular anti-ageing interventions, prompting exhaustive research, anti-ageing interventions, prompting exhaustive
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dietary Restriction (DR) is one of the most popular anti-ageing interventions, prompting exhaustive research into genes associated with its mechanisms. Recently, Machine Learning (ML) has been explored to identify potential DR-related genes among ageing-related genes, aiming to minimize costly wet lab experiments needed to expand our knowledge on DR. However, to train a model from positive (DR-related) and negative (non-DR-related) examples, existing ML methods naively label genes without known DR relation as negative examples, assuming that lack of DR-related annotation for a gene represents evidence of absence of DR-relatedness, rather than absence of evidence; this hinders the reliability of the negative examples (non-DR-related genes) and the method’s ability to identify novel DR-related genes. This work introduces a novel gene prioritization method based on the two-step Positive-Unlabelled (PU) Learning paradigm: using a similarity-based, KNN-inspired approach, our method first selects reliable negative examples among the genes without known DR associations. Then, these reliable negatives and all known positives are used to train a classifier that effectively differentiates DR-related and non-DR-related genes, which is finally employed to generate a more reliable ranking of promising genes for novel DR-relatedness. Our method significantly outperforms the existing state-of-the-art non-PU approach for DR-relatedness prediction in three relevant performance metrics. In addition, curation of existing literature finds support for the top-ranked candidate DR-related genes identified by our model.

[LG-53] Harm Mitigation in Recommender Systems under User Preference Dynamics

链接: https://arxiv.org/abs/2406.09882
作者: Jerry Chee,Shankar Kalyanaraman,Sindhu Kiranmai Ernala,Udi Weinsberg,Sarah Dean,Stratis Ioannidis
关键词: harmful content, consume harmful content, recommender system, account the interplay, user interests
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Recommender Systems; Harm Mitigation; Amplification; User Preference Modeling

点击查看摘要

Abstract:We consider a recommender system that takes into account the interplay between recommendations, the evolution of user interests, and harmful content. We model the impact of recommendations on user behavior, particularly the tendency to consume harmful content. We seek recommendation policies that establish a tradeoff between maximizing click-through rate (CTR) and mitigating harm. We establish conditions under which the user profile dynamics have a stationary point, and propose algorithms for finding an optimal recommendation policy at stationarity. We experiment on a semi-synthetic movie recommendation setting initialized with real data and observe that our policies outperform baselines at simultaneously maximizing CTR and mitigating harm.

[LG-54] Federated Learning with Flexible Architectures

链接: https://arxiv.org/abs/2406.09877
作者: Jong-Ik Park,Carlee Joe-Wong
关键词: Traditional federated learning, Traditional federated, federated learning, communication abilities, leading to inefficiencies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Traditional federated learning (FL) methods have limited support for clients with varying computational and communication abilities, leading to inefficiencies and potential inaccuracies in model training. This limitation hinders the widespread adoption of FL in diverse and resource-constrained environments, such as those with client devices ranging from powerful servers to mobile devices. To address this need, this paper introduces Federated Learning with Flexible Architectures (FedFA), an FL training algorithm that allows clients to train models of different widths and depths. Each client can select a network architecture suitable for its resources, with shallower and thinner networks requiring fewer computing resources for training. Unlike prior work in this area, FedFA incorporates the layer grafting technique to align clients’ local architectures with the largest network architecture in the FL system during model aggregation. Layer grafting ensures that all client contributions are uniformly integrated into the global model, thereby minimizing the risk of any individual client’s data skewing the model’s parameters disproportionately and introducing security benefits. Moreover, FedFA introduces the scalable aggregation method to manage scale variations in weights among different network architectures. Experimentally, FedFA outperforms previous width and depth flexible aggregation strategies. Furthermore, FedFA demonstrates increased robustness against performance degradation in backdoor attack scenarios compared to earlier strategies.

[LG-55] Sailing in high-dimensional spaces: Low-dimensional embeddings through angle preservation

链接: https://arxiv.org/abs/2406.09876
作者: Jonas Fischer,Rong Ma
关键词: Low-dimensional embeddings, science and engineering, ubiquitous in science, Low-dimensional, data
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Low-dimensional embeddings (LDEs) of high-dimensional data are ubiquitous in science and engineering. They allow us to quickly understand the main properties of the data, identify outliers and processing errors, and inform the next steps of data analysis. As such, LDEs have to be faithful to the original high-dimensional data, i.e., they should represent the relationships that are encoded in the data, both at a local as well as global scale. The current generation of LDE approaches focus on reconstructing local distances between any pair of samples correctly, often out-performing traditional approaches aiming at all distances. For these approaches, global relationships are, however, usually strongly distorted, often argued to be an inherent trade-off between local and global structure learning for embeddings. We suggest a new perspective on LDE learning, reconstructing angles between data points. We show that this approach, Mercat, yields good reconstruction across a diverse set of experiments and metrics, and preserve structures well across all scales. Compared to existing work, our approach also has a simple formulation, facilitating future theoretical analysis and algorithmic improvements.

[LG-56] IGL-Bench: Establishing the Comprehensive Benchmark for Imbalanced Graph Learning

链接: https://arxiv.org/abs/2406.09870
作者: Jiawen Qin,Haonan Yuan,Qingyun Sun,Lyujin Xu,Jiaqi Yuan,Pengfeng Huang,Zhaonan Wang,Xingcheng Fu,Hao Peng,Jianxin Li,Philip S. Yu
关键词: Deep graph learning, gained grand popularity, past years due, Imbalanced Graph Learning, graph learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Preprint, under review)

点击查看摘要

Abstract:Deep graph learning has gained grand popularity over the past years due to its versatility and success in representing graph data across a wide range of domains. However, the pervasive issue of imbalanced graph data distributions, where certain parts exhibit disproportionally abundant data while others remain sparse, undermines the efficacy of conventional graph learning algorithms, leading to biased outcomes. To address this challenge, Imbalanced Graph Learning (IGL) has garnered substantial attention, enabling more balanced data distributions and better task performance. Despite the proliferation of IGL algorithms, the absence of consistent experimental protocols and fair performance comparisons pose a significant barrier to comprehending advancements in this field. To bridge this gap, we introduce IGL-Bench, a foundational comprehensive benchmark for imbalanced graph learning, embarking on 16 diverse graph datasets and 24 distinct IGL algorithms with uniform data processing and splitting strategies. Specifically, IGL-Bench systematically investigates state-of-the-art IGL algorithms in terms of effectiveness, robustness, and efficiency on node-level and graph-level tasks, with the scope of class-imbalance and topology-imbalance. Extensive experiments demonstrate the potential benefits of IGL algorithms on various imbalanced conditions, offering insights and opportunities in the IGL field. Further, we have developed an open-sourced and unified package to facilitate reproducible evaluation and inspire further innovative research, which is available at this https URL.

[LG-57] LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data

链接: https://arxiv.org/abs/2406.09864
作者: Grigor Bezirganyan,Sana Sellami,Laure Berti-Équille,Sébastien Fournier
关键词: diverse information sources, integrating diverse information, Learning enhances decision-making, Multimodal Deep Learning, Deep Learning enhances
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[LG-58] Dataset Condensation with Latent Quantile Matching

点击查看摘要

[LG-59] Learning Multi-view Molecular Representations with Structured and Unstructured Knowledge

链接: https://arxiv.org/abs/2406.09841
作者: Yizhen Luo,Kai Yang,Massimo Hong,Xing Yi Liu,Zikun Nie,Hao Zhou,Zaiqing Nie
关键词: approaches holds significant, holds significant potential, vast scientific fields, Capturing molecular knowledge, learning approaches holds
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Capturing molecular knowledge with representation learning approaches holds significant potential in vast scientific fields such as chemistry and life science. An effective and generalizable molecular representation is expected to capture the consensus and complementary molecular expertise from diverse views and perspectives. However, existing works fall short in learning multi-view molecular representations, due to challenges in explicitly incorporating view information and handling molecular knowledge from heterogeneous sources. To address these issues, we present MV-Mol, a molecular representation learning model that harvests multi-view molecular expertise from chemical structures, unstructured knowledge from biomedical texts, and structured knowledge from knowledge graphs. We utilize text prompts to model view information and design a fusion architecture to extract view-based molecular representations. We develop a two-stage pre-training procedure, exploiting heterogeneous data of varying quality and quantity. Through extensive experiments, we show that MV-Mol provides improved representations that substantially benefit molecular property prediction. Additionally, MV-Mol exhibits state-of-the-art performance in multi-modal comprehension of molecular structures and texts. Code and data are available at this https URL.

[LG-60] abularFM: An Open Framework For Tabular Foundational Models

链接: https://arxiv.org/abs/2406.09837
作者: Quan M. Tran,Suong N. Hoang,Lam M. Nguyen,Dzung Phan,Hoang Thanh Lam
关键词: learning generalized patterns, Foundational models, self-supervised techniques, generalized patterns, patterns from large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundational models (FMs), pretrained on extensive datasets using self-supervised techniques, are capable of learning generalized patterns from large amounts of data. This reduces the need for extensive labeled datasets for each new task, saving both time and resources by leveraging the broad knowledge base established during pretraining. Most research on FMs has primarily focused on unstructured data, such as text and images, or semi-structured data, like time-series. However, there has been limited attention to structured data, such as tabular data, which, despite its prevalence, remains under-studied due to a lack of clean datasets and insufficient research on the transferability of FMs for various tabular data tasks. In response to this gap, we introduce a framework called TabularFM (\urlthis https URL), which incorporates state-of-the-art methods for developing FMs specifically for tabular data. This includes variations of neural architectures such as GANs, VAEs, and Transformers. We have curated a million of tabular datasets and released cleaned versions to facilitate the development of tabular FMs. We pretrained FMs on this curated data, benchmarked various learning methods on these datasets, and released the pretrained models along with leaderboards for future comparative studies. Our fully open-sourced system provides a comprehensive analysis of the transferability of tabular FMs. By releasing these datasets, pretrained models, and leaderboards, we aim to enhance the validity and usability of tabular FMs in the near future.

[LG-61] Robustness-Inspired Defense Against Backdoor Attacks on Graph Neural Networks

链接: https://arxiv.org/abs/2406.09836
作者: Zhiwei Zhang,Minhua Lin,Junjie Xu,Zongyu Wu,Enyan Dai,Suhang Wang
关键词: Graph Neural Networks, Neural Networks, achieved promising results, Graph Neural, achieved promising
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved promising results in tasks such as node classification and graph classification. However, recent studies reveal that GNNs are vulnerable to backdoor attacks, posing a significant threat to their real-world adoption. Despite initial efforts to defend against specific graph backdoor attacks, there is no work on defending against various types of backdoor attacks where generated triggers have different properties. Hence, we first empirically verify that prediction variance under edge dropping is a crucial indicator for identifying poisoned nodes. With this observation, we propose using random edge dropping to detect backdoors and theoretically show that it can efficiently distinguish poisoned nodes from clean ones. Furthermore, we introduce a novel robust training strategy to efficiently counteract the impact of the triggers. Extensive experiments on real-world datasets show that our framework can effectively identify poisoned nodes, significantly degrade the attack success rate, and maintain clean accuracy when defending against various types of graph backdoor attacks with different properties.

[LG-62] I Know How: Combining Prior Policies to Solve New Tasks

链接: https://arxiv.org/abs/2406.09835
作者: Malio Li,Elia Piccoli,Vincenzo Lomonaco,Davide Bacciu
关键词: Multi-Task Reinforcement Learning, Reinforcement Learning aims, Multi-Task Reinforcement, Reinforcement Learning, aims at developing
类目: Machine Learning (cs.LG)
*备注: 7 pages, Conference on Games (CoG) 2024

点击查看摘要

Abstract:Multi-Task Reinforcement Learning aims at developing agents that are able to continually evolve and adapt to new scenarios. However, this goal is challenging to achieve due to the phenomenon of catastrophic forgetting and the high demand of computational resources. Learning from scratch for each new task is not a viable or sustainable option, and thus agents should be able to collect and exploit prior knowledge while facing new problems. While several methodologies have attempted to address the problem from different perspectives, they lack a common structure. In this work, we propose a new framework, I Know How (IKH), which provides a common formalization. Our methodology focuses on modularity and compositionality of knowledge in order to achieve and enhance agent’s ability to learn and adapt efficiently to dynamic environments. To support our framework definition, we present a simple application of it in a simulated driving environment and compare its performance with that of state-of-the-art approaches.

[LG-63] Federated Learning driven Large Language Models for Swarm Intelligence: A Survey

链接: https://arxiv.org/abs/2406.09831
作者: Youyang Qu
关键词: training large language, large language models, large language, addressing data privacy, offers a compelling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

[LG-64] HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning

链接: https://arxiv.org/abs/2406.09827
作者: Heejun Lee,Geon Park,Youngwan Lee,Jina Kim,Wonyoung Jeong,Myeongjae Jeon,Sung Ju Hwang
关键词: multi-modal question answering, increasing sequence lengths, handling complex tasks, modern large language, question answering
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 26 pages, 15 figures

点击查看摘要

[LG-65] Unraveling Anomalies in Time: Unsupervised Discovery and Isolation of Anomalous Behavior in Bio-regenerative Life Support System Telemetry

链接: https://arxiv.org/abs/2406.09825
作者: Ferdinand Rewicki,Jakob Gawlikowski,Julia Niebling,Joachim Denzler
关键词: condition monitoring, critical system states, states is essential, essential in condition, Life Support Systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 12 pages, + Supplemental Materials, Accepted at ECML PKDD 2024 (European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases)

点击查看摘要

Abstract:The detection of abnormal or critical system states is essential in condition monitoring. While much attention is given to promptly identifying anomalies, a retrospective analysis of these anomalies can significantly enhance our comprehension of the underlying causes of observed undesired behavior. This aspect becomes particularly critical when the monitored system is deployed in a vital environment. In this study, we delve into anomalies within the domain of Bio-Regenerative Life Support Systems (BLSS) for space exploration and analyze anomalies found in telemetry data stemming from the EDEN ISS space greenhouse in Antarctica. We employ time series clustering on anomaly detection results to categorize various types of anomalies in both uni- and multivariate settings. We then assess the effectiveness of these methods in identifying systematic anomalous behavior. Additionally, we illustrate that the anomaly detection methods MDI and DAMP produce complementary results, as previously indicated by research.

[LG-66] An I2I Inpainting Approach for Efficient Channel Knowledge Map Construction

点击查看摘要

[LG-67] DeltaPhi: Learning Physical Trajectory Residual for PDE Solving

链接: https://arxiv.org/abs/2406.09795
作者: Xihang Yue,Linchao Zhu,Yi Yang
关键词: limited generalization capability, generalization capability prevents, correct physical dynamics, learning correct physical, data biases exist
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Although neural operator networks theoretically approximate any operator mapping, the limited generalization capability prevents them from learning correct physical dynamics when potential data biases exist, particularly in the practical PDE solving scenario where the available data amount is restricted or the resolution is extremely low. To address this issue, we propose and formulate the Physical Trajectory Residual Learning (DeltaPhi), which learns to predict the physical residuals between the pending solved trajectory and a known similar auxiliary trajectory. First, we transform the direct operator mapping between input-output function fields in original training data to residual operator mapping between input function pairs and output function residuals. Next, we learn the surrogate model for the residual operator mapping based on existing neural operator networks. Additionally, we design helpful customized auxiliary inputs for efficient optimization. Through extensive experiments, we conclude that, compared to direct learning, physical residual learning is preferred for PDE solving.

[LG-68] Faster Convergence on Heterogeneous Federated Edge Learning: An Adaptive Sidelink-Assisted Data Multicasting Approach

链接: https://arxiv.org/abs/2406.09776
作者: Gang Hu,Yinglei Teng,Nan Wang,Zhu Han
关键词: Federated Edge Learning, machine learning paradigm, Federated Edge, Internet of Things, distributed machine learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Edge Learning (FEEL) emerges as a pioneering distributed machine learning paradigm for the 6G Hyper-Connectivity, harnessing data from the Internet of Things (IoT) devices while upholding data privacy. However, current FEEL algorithms struggle with non-independent and non-identically distributed (non-IID) data, leading to elevated communication costs and compromised model accuracy. To address these statistical imbalances within FEEL, we introduce a clustered data sharing framework, mitigating data heterogeneity by selectively sharing partial data from cluster heads to trusted associates through sidelink-aided multicasting. The collective communication pattern is integral to FEEL training, where both cluster formation and the efficiency of communication and computation impact training latency and accuracy simultaneously. To tackle the strictly coupled data sharing and resource optimization, we decompose the overall optimization problem into the clients clustering and effective data sharing subproblems. Specifically, a distribution-based adaptive clustering algorithm (DACA) is devised basing on three deductive cluster forming conditions, which ensures the maximum sharing yield. Meanwhile, we design a stochastic optimization based joint computed frequency and shared data volume optimization (JFVO) algorithm, determining the optimal resource allocation with an uncertain objective function. The experiments show that the proposed framework facilitates FEEL on non-IID datasets with faster convergence rate and higher model accuracy in a limited communication environment.

[LG-69] owards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion

链接: https://arxiv.org/abs/2406.09770
作者: Anke Tang,Li Shen,Yong Luo,Shiwei Liu,Han Hu,Bo Du
关键词: Solving multi-objective optimization, challenging task due, entire Pareto set, Pareto set, Solving multi-objective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: code is available at this https URL

点击查看摘要

Abstract:Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost of training and evaluating models. Efficient Pareto front approximation of large models enables multi-objective optimization for various tasks such as multi-task learning and trade-off analysis. Existing algorithms for learning Pareto set, including (1) evolutionary, hypernetworks, and hypervolume-maximization methods, are computationally expensive and have restricted scalability to large models; (2) Scalarization algorithms, where a separate model is trained for each objective ray, which is inefficient for learning the entire Pareto set and fails to capture the objective trade-offs effectively. Inspired by the recent success of model merging, we propose a practical and scalable approach to Pareto set learning problem via mixture of experts (MoE) based model fusion. By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives and closely approximate the entire Pareto set of large neural networks. Once the routers are learned and a preference vector is set, the MoE module can be unloaded, thus no additional computational cost is introduced during inference. We conduct extensive experiments on vision and language tasks using large-scale models such as CLIP-ViT and GPT-2. The experimental results demonstrate that our method efficiently approximates the entire Pareto front of large models. Using only hundreds of trainable parameters of the MoE routers, our method even has lower memory usage compared to linear scalarization and algorithms that learn a single Pareto optimal solution, and are scalable to both the number of objectives and the size of the model.

[LG-70] Bayesian Conditioned Diffusion Models for Inverse Problems

点击查看摘要

[LG-71] Bootstrapping Language Models with DPO Implicit Rewards

链接: https://arxiv.org/abs/2406.09760
作者: Changyu Chen,Zichen Liu,Chao Du,Tianyu Pang,Qian Liu,Arunesh Sinha,Pradeep Varakantham,Min Lin
关键词: large language models, area of research, large language, active area, implicit reward model
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-72] Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages

链接: https://arxiv.org/abs/2406.09757
作者: Shuvendu K. Lahiri
关键词: prove properties, programming languages, Verification-aware programming languages, user-intent formalization, languages
类目: Programming Languages (cs.PL); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Verification-aware programming languages such as Dafny and F* provide means to formally specify and prove properties of programs. Although the problem of checking an implementation against a specification can be defined mechanically, there is no algorithmic way of ensuring the correctness of the user-intent formalization for programs – that a specification adheres to the user’s intent behind the program. The intent or requirement is expressed informally in natural language and the specification is a formal artefact. The advent of large language models (LLMs) has made strides bridging the gap between informal intent and formal program implementations recently, driven in large parts due to benchmarks and automated metrics for evaluation. Recent work has proposed evaluating \it user-intent formalization problem for mainstream programming languages~\citeendres-fse24. However, such an approach does not readily extend to verification-aware languages that support rich specifications (containing quantifiers and ghost variables) that cannot be evaluated through dynamic execution. Previous work also required generating program mutants using LLMs to create the benchmark. We advocate an alternate approach of \it symbolically testing specifications to provide an intuitive metric for evaluating the quality of specifications for verification-aware languages. We demonstrate that our automated metric agrees closely with mostly GPT-4 generated and human-labeled dataset of roughly 150 Dafny specifications for the popular MBPP code-generation benchmark, yet demonstrates cases where the human labeling is not perfect. We believe our work provides a stepping stone to enable the establishment of a benchmark and research agenda for the problem of user-intent formalization for programs. Subjects: Programming Languages (cs.PL); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2406.09757 [cs.PL] (or arXiv:2406.09757v1 [cs.PL] for this version)

[LG-73] How Does Distribution Matching Help Domain Generalization: An Information-theoretic Analysis

链接: https://arxiv.org/abs/2406.09745
作者: Yuxin Dong,Tieliang Gong,Hong Chen,Shuangyong Song,Weizhan Zhang,Chen Li
关键词: multiple training domains, aims to learn, learn invariance, invariance across multiple, multiple training
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Domain generalization aims to learn invariance across multiple training domains, thereby enhancing generalization against out-of-distribution data. While gradient or representation matching algorithms have achieved remarkable success, these methods generally lack generalization guarantees or depend on strong assumptions, leaving a gap in understanding the underlying mechanism of distribution matching. In this work, we formulate domain generalization from a novel probabilistic perspective, ensuring robustness while avoiding overly conservative solutions. Through comprehensive information-theoretic analysis, we provide key insights into the roles of gradient and representation matching in promoting generalization. Our results reveal the complementary relationship between these two components, indicating that existing works focusing solely on either gradient or representation alignment are insufficient to solve the domain generalization problem. In light of these theoretical findings, we introduce IDM to simultaneously align the inter-domain gradients and representations. Integrated with the proposed PDM method for complex distribution matching, IDM achieves superior performance over various baseline methods.

[LG-74] Deep Symbolic Optimization for Combinatorial Optimization: Accelerating Node Selection by Discovering Potential Heuristics

链接: https://arxiv.org/abs/2406.09740
作者: Hongyu Liu,Haoyang Liu,Yufei Kuang,Jie Wang,Bin Li
关键词: Combinatorial optimization, fundamental mathematical models, real-world applications, Combinatorial, deep symbolic optimization
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combinatorial optimization (CO) is one of the most fundamental mathematical models in real-world applications. Traditional CO solvers, such as Branch-and-Bound (BB) solvers, heavily rely on expert-designed heuristics, which are reliable but require substantial manual tuning. Recent studies have leveraged deep learning (DL) models as an alternative to capture rich feature patterns for improved performance on GPU machines. Nonetheless, the drawbacks of high training and inference costs, as well as limited interpretability, severely hinder the adoption of DL methods in real-world applications. To address these challenges, we propose a novel deep symbolic optimization learning framework that combines their advantages. Specifically, we focus on the node selection module within BB solvers – namely, deep symbolic optimization for node selection (Dso4NS). With data-driven approaches, Dso4NS guides the search for mathematical expressions within the high-dimensional discrete symbolic space and then incorporates the highest-performing mathematical expressions into a solver. The data-driven model captures the rich feature information in the input data and generates symbolic expressions, while the expressions deployed in solvers enable fast inference with high interpretability. Experiments demonstrate the effectiveness of Dso4NS in learning high-quality expressions, outperforming existing approaches on a CPU machine. Encouragingly, the learned CPU-based policies consistently achieve performance comparable to state-of-the-art GPU-based approaches.

[LG-75] When Will Gradient Regularization Be Harmful?

链接: https://arxiv.org/abs/2406.09723
作者: Yang Zhao,Hao Zhang,Xiuyuan Hu
关键词: deep neural networks, shown promising results, modern over-parameterized deep, over-parameterized deep neural, gradient norm atop
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICML 2024 paper

点击查看摘要

Abstract:Gradient regularization (GR), which aims to penalize the gradient norm atop the loss function, has shown promising results in training modern over-parameterized deep neural networks. However, can we trust this powerful technique? This paper reveals that GR can cause performance degeneration in adaptive optimization scenarios, particularly with learning rate warmup. Our empirical and theoretical analyses suggest this is due to GR inducing instability and divergence in gradient statistics of adaptive optimizers at the initial training stage. Inspired by the warmup heuristic, we propose three GR warmup strategies, each relaxing the regularization effect to a certain extent during the warmup course to ensure the accurate and stable accumulation of gradients. With experiments on Vision Transformer family, we confirm the three GR warmup strategies can effectively circumvent these issues, thereby largely improving the model performance. Meanwhile, we note that scalable models tend to rely more on the GR warmup, where the performance can be improved by up to 3% on Cifar10 compared to baseline GR. Code is available at \hrefthis https URLthis https URL.

[LG-76] Cross-view geo-localization: a survey

点击查看摘要

[LG-77] Speed-up of Data Analysis with Kernel Trick in Encrypted Domain

链接: https://arxiv.org/abs/2406.09716
作者: Joon Soo Yoo,Baek Kyung Song,Tae Min Ahn,Ji Won Heo,Ji Won Yoon
关键词: Homomorphic encryption, crucial in privacy-preserving, privacy-preserving data analysis, Homomorphic, data
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Submitted as a preprint

点击查看摘要

Abstract:Homomorphic encryption (HE) is pivotal for secure computation on encrypted data, crucial in privacy-preserving data analysis. However, efficiently processing high-dimensional data in HE, especially for machine learning and statistical (ML/STAT) algorithms, poses a challenge. In this paper, we present an effective acceleration method using the kernel method for HE schemes, enhancing time performance in ML/STAT algorithms within encrypted domains. This technique, independent of underlying HE mechanisms and complementing existing optimizations, notably reduces costly HE multiplications, offering near constant time complexity relative to data dimension. Aimed at accessibility, this method is tailored for data scientists and developers with limited cryptography background, facilitating advanced data analysis in secure environments.

[LG-78] Meta-Learning Loss Functions for Deep Neural Networks

链接: https://arxiv.org/abs/2406.09713
作者: Christian Raymond
关键词: efficiently solve complex, quickly and efficiently, small set, efficiently solve, solve complex
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: PhD thesis

点击查看摘要

Abstract:Humans can often quickly and efficiently solve complex new learning tasks given only a small set of examples. In contrast, modern artificially intelligent systems often require thousands or millions of observations in order to solve even the most basic tasks. Meta-learning aims to resolve this issue by leveraging past experiences from similar learning tasks to embed the appropriate inductive biases into the learning system. Historically methods for meta-learning components such as optimizers, parameter initializations, and more have led to significant performance increases. This thesis aims to explore the concept of meta-learning to improve performance, through the often-overlooked component of the loss function. The loss function is a vital component of a learning system, as it represents the primary learning objective, where success is determined and quantified by the system’s ability to optimize for that objective successfully.

[LG-79] Explainable AI for Comparative Analysis of Intrusion Detection Models

链接: https://arxiv.org/abs/2406.09684
作者: Pap M. Corea,Yongxin Liu,Jian Wang,Shuteng Niu,Houbing Song
关键词: Explainable Artificial Intelligence, Explainable Artificial, Artificial Intelligence, widely discussed topic, related technologies facilitate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Submitted to IEEE MeditCom 2024 - WS-05

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) has become a widely discussed topic, the related technologies facilitate better understanding of conventional black-box models like Random Forest, Neural Networks and etc. However, domain-specific applications of XAI are still insufficient. To fill this gap, this research analyzes various machine learning models to the tasks of binary and multi-class classification for intrusion detection from network traffic on the same dataset using occlusion sensitivity. The models evaluated include Linear Regression, Logistic Regression, Linear Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest, Decision Trees, and Multi-Layer Perceptrons (MLP). We trained all models to the accuracy of 90% on the UNSW-NB15 Dataset. We found that most classifiers leverage only less than three critical features to achieve such accuracies, indicating that effective feature engineering could actually be far more important for intrusion detection than applying complicated models. We also discover that Random Forest provides the best performance in terms of accuracy, time efficiency and robustness. Data and code available at this https URL

[LG-80] Heterogeneous Federated Learning with Convolutional and Spiking Neural Networks

链接: https://arxiv.org/abs/2406.09680
作者: Yingchao Yu,Yuping Yan,Jisong Cai,Yaochu Jin
关键词: safeguarding data privacy, data privacy, decentralized data, safeguarding data, promising paradigm
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages, 5 figures, FL@FM-IJCAI’24

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising paradigm for training models on decentralized data while safeguarding data privacy. Most existing FL systems, however, assume that all machine learning models are of the same type, although it becomes more likely that different edge devices adopt different types of AI models, including both conventional analogue artificial neural networks (ANNs) and biologically more plausible spiking neural networks (SNNs). This diversity empowers the efficient handling of specific tasks and requirements, showcasing the adaptability and versatility of edge computing platforms. One main challenge of such heterogeneous FL system lies in effectively aggregating models from the local devices in a privacy-preserving manner. To address the above issue, this work benchmarks FL systems containing both convoluntional neural networks (CNNs) and SNNs by comparing various aggregation approaches, including federated CNNs, federated SNNs, federated CNNs for SNNs, federated SNNs for CNNs, and federated CNNs with SNN fusion. Experimental results demonstrate that the CNN-SNN fusion framework exhibits the best performance among the above settings on the MNIST dataset. Additionally, intriguing phenomena of competitive suppression are noted during the convergence process of multi-model FL.

[LG-81] Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and Efficiency

链接: https://arxiv.org/abs/2406.09675
作者: Ningyi Liao,Haoyu Liu,Zulun Zhu,Siqiang Luo,Laks V.S. Lakshmanan
关键词: demonstrating promising capability, received increasing popularity, graph neural networks, capturing graph signals, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the recent advancements in graph neural networks (GNNs), spectral GNNs have received increasing popularity by virtue of their specialty in capturing graph signals in the frequency domain, demonstrating promising capability in specific tasks. However, few systematic studies have been conducted on assessing their spectral characteristics. This emerging family of models also varies in terms of designs and settings, leading to difficulties in comparing their performance and deciding on the suitable model for specific scenarios, especially for large-scale tasks. In this work, we extensively benchmark spectral GNNs with a focus on the frequency perspective. We analyze and categorize over 30 GNNs with 27 corresponding filters. Then, we implement these spectral models under a unified framework with dedicated graph computations and efficient training schemes. Thorough experiments are conducted on the spectral models with inclusive metrics on effectiveness and efficiency, offering practical guidelines on evaluating and selecting spectral GNNs with desirable performance. Our implementation enables application on larger graphs with comparable performance and less overhead, which is available at: this https URL.

[LG-82] ScaLES: Scalable Latent Exploration Score for Pre-Trained Generative Networks

链接: https://arxiv.org/abs/2406.09657
作者: Omer Ronen,Ahmed Imtiaz Humayun,Randall Balestriero,Richard Baraniuk,Bin Yu
关键词: Latent Exploration Score, Scalable Latent Exploration, develop Scalable Latent, Exploration Score, Latent Space Optimization
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We develop Scalable Latent Exploration Score (ScaLES) to mitigate over-exploration in Latent Space Optimization (LSO), a popular method for solving black-box discrete optimization problems. LSO utilizes continuous optimization within the latent space of a Variational Autoencoder (VAE) and is known to be susceptible to over-exploration, which manifests in unrealistic solutions that reduce its practicality. ScaLES is an exact and theoretically motivated method leveraging the trained decoder’s approximation of the data distribution. ScaLES can be calculated with any existing decoder, e.g. from a VAE, without additional training, architectural changes, or access to the training data. Our evaluation across five LSO benchmark tasks and three VAE architectures demonstrates that ScaLES enhances the quality of the solutions while maintaining high objective values, leading to improvements over existing solutions. We believe that new avenues to LSO will be opened by ScaLES ability to identify out of distribution areas, differentiability, and computational tractability. Open source code for ScaLES is available at this https URL.

[LG-83] RSEND: Retinex-based Squeeze and Excitation Network with Dark Region Detection for Efficient Low Light Image Enhancement

点击查看摘要

[LG-84] Coralai: Intrinsic Evolution of Embodied Neural Cellular Automata Ecosystems

链接: https://arxiv.org/abs/2406.09654
作者: Aidan Barbieux,Rodrigo Canaan
关键词: Neural Cellular Automata, Cellular Automata, paper presents Coralai, Neural Cellular, exploring diverse ecosystems
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 3 pages, 2 figures. ALIFE 2024 Copenhagen

点击查看摘要

Abstract:This paper presents Coralai, a framework for exploring diverse ecosystems of Neural Cellular Automata (NCA). Organisms in Coralai utilize modular, GPU-accelerated Taichi kernels to interact, enact environmental changes, and evolve through local survival, merging, and mutation operations implemented with HyperNEAT and PyTorch. We provide an exploratory experiment implementing physics inspired by slime mold behavior showcasing the emergence of competition between sessile and mobile organisms, cycles of resource depletion and recovery, and symbiosis between diverse organisms. We conclude by outlining future work to discover simulation parameters through measures of multi-scale complexity and diversity. Code for Coralai is available at this https URL , video demos are available at this https URL .

[LG-85] An Intrinsic Vector Heat Network

点击查看摘要

[LG-86] Reinforced Decoder: Towards Training Recurrent Neural Networks for Time Series Forecasting

链接: https://arxiv.org/abs/2406.09643
作者: Qi Sima,Xinze Zhang,Yukun Bao,Siyue Yang,Liang Shen
关键词: Recurrent neural network-based, Recurrent neural, time series forecasting, neural network-based, time series
类目: Machine Learning (cs.LG)
*备注: 12 pages,8 figures

点击查看摘要

Abstract:Recurrent neural network-based sequence-to-sequence models have been extensively applied for multi-step-ahead time series forecasting. These models typically involve a decoder trained using either its previous forecasts or the actual observed values as the decoder inputs. However, relying on self-generated predictions can lead to the rapid accumulation of errors over multiple steps, while using the actual observations introduces exposure bias as these values are unavailable during the extrapolation stage. In this regard, this study proposes a novel training approach called reinforced decoder, which introduces auxiliary models to generate alternative decoder inputs that remain accessible when extrapolating. Additionally, a reinforcement learning algorithm is utilized to dynamically select the optimal inputs to improve accuracy. Comprehensive experiments demonstrate that our approach outperforms representative training methods over several datasets. Furthermore, the proposed approach also exhibits promising performance when generalized to self-attention-based sequence-to-sequence forecasting models.

[LG-87] GB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs

链接: https://arxiv.org/abs/2406.09639
作者: Julia Gastinger,Shenyang Huang,Mikhail Galkin,Erfan Loghmani,Ali Parviz,Farimah Poursafaei,Jacob Danovitch,Emanuele Rossi,Ioannis Koutis,Heiner Stuckenschmidt,Reihaneh Rabbany,Guillaume Rabusseau
关键词: modeling real-world data, Temporal Graph Benchmark, Multi-relational temporal graphs, TGB, Temporal Knowledge Graphs
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 27 pages, 8 figures

点击查看摘要

Abstract:Multi-relational temporal graphs are powerful tools for modeling real-world data, capturing the evolving and interconnected nature of entities over time. Recently, many novel models are proposed for ML on such graphs intensifying the need for robust evaluation and standardized benchmark datasets. However, the availability of such resources remains scarce and evaluation faces added complexity due to reproducibility issues in experimental protocols. To address these challenges, we introduce Temporal Graph Benchmark 2.0 (TGB 2.0), a novel benchmarking framework tailored for evaluating methods for predicting future links on Temporal Knowledge Graphs and Temporal Heterogeneous Graphs with a focus on large-scale datasets, extending the Temporal Graph Benchmark. TGB 2.0 facilitates comprehensive evaluations by presenting eight novel datasets spanning five domains with up to 53 million edges. TGB 2.0 datasets are significantly larger than existing datasets in terms of number of nodes, edges, or timestamps. In addition, TGB 2.0 provides a reproducible and realistic evaluation pipeline for multi-relational temporal graphs. Through extensive experimentation, we observe that 1) leveraging edge-type information is crucial to obtain high performance, 2) simple heuristic baselines are often competitive with more complex methods, 3) most methods fail to run on our largest datasets, highlighting the need for research on more scalable methods.

[LG-88] RASPNet: A Benchmark Dataset for Radar Adaptive Signal Processing Applications

链接: https://arxiv.org/abs/2406.09638
作者: Shyam Venkatasubramanian,Bosung Kang,Ali Pezeshki,Muralidhar Rangaswamy,Vahid Tarokh
关键词: contiguous United States, aimed at supporting, work presents, data-driven models, United States
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This work presents a large-scale dataset for radar adaptive signal processing (RASP) applications, aimed at supporting the development of data-driven models within the radar community. The dataset, called RASPNet, consists of 100 realistic scenarios compiled over a variety of topographies and land types from across the contiguous United States, designed to reflect a diverse array of real-world environments. Within each scenario, RASPNet consists of 10,000 clutter realizations from an airborne radar setting, which can be utilized for radar algorithm development and evaluation. RASPNet intends to fill a prominent gap in the availability of a large-scale, realistic dataset that standardizes the evaluation of adaptive radar processing techniques. We describe its construction, organization, and several potential applications, which includes a transfer learning example to demonstrate how RASPNet can be leveraged for realistic adaptive radar processing scenarios.

[LG-89] Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

点击查看摘要

[LG-90] DrivAerNet: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks

链接: https://arxiv.org/abs/2406.09624
作者: Mohamed Elrefaie,Florin Morar,Angela Dai,Faez Ahmed
关键词: comprehensive multimodal dataset, comprehensive multimodal, present DrivAerNet, dataset, diverse car designs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We present DrivAerNet++, the largest and most comprehensive multimodal dataset for aerodynamic car design. DrivAerNet++ comprises 8,000 diverse car designs modeled with high-fidelity computational fluid dynamics (CFD) simulations. The dataset includes diverse car configurations such as fastback, notchback, and estateback, with different underbody and wheel designs to represent both internal combustion engines and electric vehicles. Each entry in the dataset features detailed 3D meshes, parametric models, aerodynamic coefficients, and extensive flow and surface field data, along with segmented parts for car classification and point cloud data. This dataset supports a wide array of machine learning applications including data-driven design optimization, generative modeling, surrogate model training, CFD simulation acceleration, and geometric classification. With more than 39 TB of publicly available engineering data, DrivAerNet++ fills a significant gap in available resources, providing high-quality, diverse data to enhance model training, promote generalization, and accelerate automotive design processes. Along with rigorous dataset validation, we also provide ML benchmarking results on the task of aerodynamic drag prediction, showcasing the breadth of applications supported by our dataset. This dataset is set to significantly impact automotive design and broader engineering disciplines by fostering innovation and improving the fidelity of aerodynamic evaluations.

[LG-91] Automated Molecular Concept Generation and Labeling with Large Language Models

链接: https://arxiv.org/abs/2406.09612
作者: Shichang Zhang,Botao Xia,Zimin Zhang,Qianli Wu,Fang Sun,Ziniu Hu,Yizhou Sun
关键词: Artificial intelligence, significantly transforming scientific, Graph Neural Networks, transforming scientific research, significantly transforming
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is significantly transforming scientific research. Explainable AI methods, such as concept-based models (CMs), are promising for driving new scientific discoveries because they make predictions based on meaningful concepts and offer insights into the prediction process. In molecular science, however, explainable CMs are not as common compared to black-box models like Graph Neural Networks (GNNs), primarily due to their requirement for predefined concepts and manual label for each instance, which demand domain knowledge and can be labor-intensive. This paper introduces a novel framework for Automated Molecular Concept (AutoMolCo) generation and labeling. AutoMolCo leverages the knowledge in Large Language Models (LLMs) to automatically generate predictive molecular concepts and label them for each molecule. Such procedures are repeated through iterative interactions with LLMs to refine concepts, enabling simple linear models on the refined concepts to outperform GNNs and LLM in-context learning on several benchmarks. The whole AutoMolCo framework is automated without any human knowledge inputs in either concept generation, labeling, or refinement, thereby surpassing the limitations of extant CMs while maintaining their explainability and allowing easy intervention. Through systematic experiments on MoleculeNet and High-Throughput Experimentation (HTE) datasets, we demonstrate that the AutoMolCo-induced explainable CMs are beneficial and promising for molecular science research.

[LG-92] Cross-Modality Program Representation Learning for Electronic Design Automation with High-Level Synthesis

链接: https://arxiv.org/abs/2406.09606
作者: Zongyue Qin,Yunsheng Bai,Atefeh Sograbizadeh,Zijian Ding,Ziniu Hu,Yizhou Sun,Jason Cong
关键词: domain-specific accelerators, recent years, autonomous driving, gained popularity, popularity for applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: 14 pages, 8 figures. arXiv admin note: text overlap with arXiv:2305.10838

点击查看摘要

Abstract:In recent years, domain-specific accelerators (DSAs) have gained popularity for applications such as deep learning and autonomous driving. To facilitate DSA designs, programmers use high-level synthesis (HLS) to compile a high-level description written in C/C++ into a design with low-level hardware description languages that eventually synthesize DSAs on circuits. However, creating a high-quality HLS design still demands significant domain knowledge, particularly in microarchitecture decisions expressed as \textitpragmas. Thus, it is desirable to automate such decisions with the help of machine learning for predicting the quality of HLS designs, requiring a deeper understanding of the program that consists of original code and pragmas. Naturally, these programs can be considered as sequence data. In addition, these programs can be compiled and converted into a control data flow graph (CDFG). But existing works either fail to leverage both modalities or combine the two in shallow or coarse ways. We propose ProgSG, a model that allows interaction between the source code sequence modality and the graph modality in a deep and fine-grained way. To alleviate the scarcity of labeled designs, a pre-training method is proposed based on a suite of compiler’s data flow analysis tasks. Experimental results show that ProgSG reduces the RMSE of design performance predictions by up to 22% , and identifies designs with an average of 1.10\times and 1.26\times (up to 8.17\times and 13.31\times ) performance improvement in design space exploration (DSE) task compared to HARP and AutoDSE, respectively.

[LG-93] On Value Iteration Convergence in Connected MDPs

链接: https://arxiv.org/abs/2406.09592
作者: Arsenii Mustafin,Alex Olshevsky,Ioannis Ch. Paschalidis
关键词: unique optimal policy, transition matrix ensures, Iteration algorithm, discount factor, average-reward criteria
类目: Machine Learning (cs.LG)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:This paper establishes that an MDP with a unique optimal policy and ergodic associated transition matrix ensures the convergence of various versions of the Value Iteration algorithm at a geometric rate that exceeds the discount factor \gamma for both discounted and average-reward criteria.

[LG-94] Color Equivariant Network

点击查看摘要

[LG-95] A Review of 315 Benchmark and Test Functions for Machine Learning Optimization Algorithms and Metaheuristics with Mathematical and Visual Descriptions

链接: https://arxiv.org/abs/2406.09581
作者: M.Z. Naser,Mohammad Khaled al-Bashiti,Arash Teymori Gharah Tapeh,Armin Dadras Eslamlou,Ahmed Naser,Venkatesh Kodur,Rami Hawileeh,Jamal Abdalla,Nima Khodadadi,Amir H. Gandomi
关键词: rapidly evolving optimization, rapidly evolving, crucially determined, optimization and metaheuristics, functions
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In the rapidly evolving optimization and metaheuristics domains, the efficacy of algorithms is crucially determined by the benchmark (test) functions. While several functions have been developed and derived over the past decades, little information is available on the mathematical and visual description, range of suitability, and applications of many such functions. To bridge this knowledge gap, this review provides an exhaustive survey of more than 300 benchmark functions used in the evaluation of optimization and metaheuristics algorithms. This review first catalogs benchmark and test functions based on their characteristics, complexity, properties, visuals, and domain implications to offer a wide view that aids in selecting appropriate benchmarks for various algorithmic challenges. This review also lists the 25 most commonly used functions in the open literature and proposes two new, highly dimensional, dynamic and challenging functions that could be used for testing new algorithms. Finally, this review identifies gaps in current benchmarking practices and suggests directions for future research.

[LG-96] Online Bandit Learning with Offline Preference Data

链接: https://arxiv.org/abs/2406.09574
作者: Akhil Agnihotri,Rahul Jain,Deepak Ramachandran,Zheng Wen
关键词: Reinforcement Learning, language and images, RLHF, core of fine-tuning, fine-tuning methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) is at the core of fine-tuning methods for generative AI models for language and images. Such feedback is often sought as rank or preference feedback from human raters, as opposed to eliciting scores since the latter tends to be very noisy. On the other hand, RL theory and algorithms predominantly assume that a reward feedback is available. In particular, approaches for online learning that can be helpful in adaptive data collection via active learning cannot incorporate offline preference data. In this paper, we adopt a finite-armed linear bandit model as a prototypical model of online learning. We consider an offline preference dataset to be available generated by an expert of unknown ‘competence’. We propose \textttwarmPref-PS , a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback. We show that by modeling the competence of the expert that generated it, we are able to use such a dataset most effectively. We support our claims with novel theoretical analysis of its Bayesian regret, as well as extensive empirical evaluation of an approximate algorithm which performs substantially better (almost 25 to 50% regret reduction in our studies) as compared to baselines.

[LG-97] Improving Consistency Models with Generator-Induced Coupling

点击查看摘要

[LG-98] owards Domain Adaptive Neural Contextual Bandits

点击查看摘要

[LG-99] -COP : Episodic Constrained Optimization of Policies

链接: https://arxiv.org/abs/2406.09563
作者: Akhil Agnihotri,Rahul Jain,Deepak Ramachandran,Sahil Singla
关键词: constrained Reinforcement Learning, finite horizon, Reinforcement Learning, episodic setting, policy optimization algorithm
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we present the \texttte-COP algorithm, the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. Such formulations are applicable when there are separate sets of optimization criteria and constraints on a system’s behavior. We approach this problem by first establishing a policy difference lemma for the episodic setting, which provides the theoretical foundation for the algorithm. Then, we propose to combine a set of established and novel solution ideas to yield the \texttte-COP algorithm that is easy to implement and numerically stable, and provide a theoretical guarantee on optimality under certain scaling assumptions. Through extensive empirical analysis using benchmarks in the Safety Gym suite, we show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting. The scalability of the algorithm opens the door to its application in safety-constrained Reinforcement Learning from Human Feedback for Large Language or Diffusion Models.

[LG-100] Label Noise Robustness for Domain-Agnostic Fair Corrections via Nearest Neighbors Label Spreading

链接: https://arxiv.org/abs/2406.09561
作者: Nathan Stromberg,Rohan Ayyagari,Sanmi Koyejo,Richard Nock,Lalitha Sankar
关键词: correcting existing base, existing base models, Last-layer retraining, efficient framework, base models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Last-layer retraining methods have emerged as an efficient framework for correcting existing base models. Within this framework, several methods have been proposed to deal with correcting models for subgroup fairness with and without group membership information. Importantly, prior work has demonstrated that many methods are susceptible to noisy labels. To this end, we propose a drop-in correction for label noise in last-layer retraining, and demonstrate that it achieves state-of-the-art worst-group accuracy for a broad range of symmetric label noise and across a wide variety of datasets exhibiting spurious correlations. Our proposed approach uses label spreading on a latent nearest neighbors graph and has minimal computational overhead compared to existing methods.

[LG-101] Decoding the Diversity: A Review of the Indic AI Research Landscape

链接: https://arxiv.org/abs/2406.09559
作者: Sankalp KJ,Vinija Jain,Sreyoshi Bhaduri,Tamoghna Roy,Aman Chadha
关键词: large language model, Indic languages, Indic, Sri Lanka, comprehensive overview
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 27 pages, 1 figure

点击查看摘要

[LG-102] S3 – Semantic Signal Separation

链接: https://arxiv.org/abs/2406.09556
作者: Márton Kardos,Jan Kostkan,Arnault-Quentin Vermillet,Kristoffer Nielbo,Roberta Rocca
关键词: large textual corpora, latent semantic structures, discovering latent semantic, textual corpora, tools for discovering
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 26 pages, 9 figures (main manuscript has 9 pages and 4 figures)

点击查看摘要

Abstract:Topic models are useful tools for discovering latent semantic structures in large textual corpora. Topic modeling historically relied on bag-of-words representations of language. This approach makes models sensitive to the presence of stop words and noise, and does not utilize potentially useful contextual information. Recent efforts have been oriented at incorporating contextual neural representations in topic modeling and have been shown to outperform classical topic models. These approaches are, however, typically slow, volatile and still require preprocessing for optimal results. We present Semantic Signal Separation ( S^3 ), a theory-driven topic modeling approach in neural embedding spaces. S^3 conceptualizes topics as independent axes of semantic space, and uncovers these with blind-source separation. Our approach provides the most diverse, highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextually sensitive topic model to date. We offer an implementation of S^3 , among other approaches, in the Turftopic Python package.

[LG-103] Exploring Syntactic Patterns in Urdu: A Deep Dive into Dependency Analysis

链接: https://arxiv.org/abs/2406.09549
作者: Nudrat Habib
关键词: process of breaking, components and identifying, Urdu, correct sentence structure, grammatical components
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-104] Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale

链接: https://arxiv.org/abs/2406.09548
作者: A. Feder Cooper
关键词: develop rigorous knowledge, develop rigorous, rigorous knowledge, reliable measurement, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注: Ph.D. Dissertation

点击查看摘要

Abstract:To develop rigorous knowledge about ML models – and the systems in which they are embedded – we need reliable measurements. But reliable measurement is fundamentally challenging, and touches on issues of reproducibility, scalability, uncertainty quantification, epistemology, and more. This dissertation addresses criteria needed to take reliability seriously: both criteria for designing meaningful metrics, and for methodologies that ensure that we can dependably and efficiently measure these metrics at scale and in practice. In doing so, this dissertation articulates a research vision for a new field of scholarship at the intersection of machine learning, law, and policy. Within this frame, we cover topics that fit under three different themes: (1) quantifying and mitigating sources of arbitrariness in ML, (2) taming randomness in uncertainty estimation and optimization algorithms, in order to achieve scalability without sacrificing reliability, and (3) providing methods for evaluating generative-AI systems, with specific focuses on quantifying memorization in language models and training latent diffusion models on open-licensed data. By making contributions in these three themes, this dissertation serves as an empirical proof by example that research on reliable measurement for machine learning is intimately and inescapably bound up with research in law and policy. These different disciplines pose similar research questions about reliable measurement in machine learning. They are, in fact, two complementary sides of the same research vision, which, broadly construed, aims to construct machine-learning systems that cohere with broader societal values.

[LG-105] FLea: Addressing Data Scarcity and Label Skew in Federated Learning via Privacy-preserving Feature Augmentation

链接: https://arxiv.org/abs/2406.09547
作者: Tong Xia,Abhirup Ghosh,Xinchi Qiu,Cecilia Mascolo
关键词: Federated Learning, enables model development, numerous edge devices, leveraging data distributed, transferring local data
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: This paper has been accepted by KDD 2024. arXiv admin note: text overlap with arXiv:2312.02327

点击查看摘要

Abstract:Federated Learning (FL) enables model development by leveraging data distributed across numerous edge devices without transferring local data to a central server. However, existing FL methods still face challenges when dealing with scarce and label-skewed data across devices, resulting in local model overfitting and drift, consequently hindering the performance of the global model. In response to these challenges, we propose a pioneering framework called FLea, incorporating the following key components: i) A global feature buffer that stores activation-target pairs shared from multiple clients to support local training. This design mitigates local model drift caused by the absence of certain classes; ii) A feature augmentation approach based on local and global activation mix-ups for local training. This strategy enlarges the training samples, thereby reducing the risk of local overfitting; iii) An obfuscation method to minimize the correlation between intermediate activations and the source data, enhancing the privacy of shared features. To verify the superiority of FLea, we conduct extensive experiments using a wide range of data modalities, simulating different levels of local data scarcity and label skew. The results demonstrate that FLea consistently outperforms state-of-the-art FL counterparts (among 13 of the experimented 18 settings, the improvement is over 5% while concurrently mitigating the privacy vulnerabilities associated with shared features. Code is available at this https URL.

[LG-106] CircuitVAE: Efficient and Scalable Latent Circuit Optimization

链接: https://arxiv.org/abs/2406.09535
作者: Jialin Song,Aidan Swope,Robert Kirby,Rajarshi Roy,Saad Godil,Jonathan Raiman,Bryan Catanzaro
关键词: Automatically designing fast, space-efficient digital circuits, desired logic, costly to simulate, Automatically designing
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Design Automation Conference (DAC) 2024; the first two authors contributed equally

点击查看摘要

Abstract:Automatically designing fast and space-efficient digital circuits is challenging because circuits are discrete, must exactly implement the desired logic, and are costly to simulate. We address these challenges with CircuitVAE, a search algorithm that embeds computation graphs in a continuous space and optimizes a learned surrogate of physical simulation by gradient descent. By carefully controlling overfitting of the simulation surrogate and ensuring diverse exploration, our algorithm is highly sample-efficient, yet gracefully scales to large problem instances and high sample budgets. We test CircuitVAE by designing binary adders across a large range of sizes, IO timing constraints, and sample budgets. Our method excels at designing large circuits, where other algorithms struggle: compared to reinforcement learning and genetic algorithms, CircuitVAE typically finds 64-bit adders which are smaller and faster using less than half the sample budget. We also find CircuitVAE can design state-of-the-art adders in a real-world chip, demonstrating that our method can outperform commercial tools in a realistic setting.

[LG-107] FeatNavigator: Automatic Feature Augmentation on Tabular Data

链接: https://arxiv.org/abs/2406.09534
作者: Jiaming Liang,Chuan Lei,Xiao Qin,Jiani Zhang,Asterios Katsifodimos,Christos Faloutsos,Huzefa Rangwala
关键词: training machine learning, Data-centric AI focuses, feature, machine learning, Automatic feature augmentation
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 15 pages, 41 figures

点击查看摘要

Abstract:Data-centric AI focuses on understanding and utilizing high-quality, relevant data in training machine learning (ML) models, thereby increasing the likelihood of producing accurate and useful results. Automatic feature augmentation, aiming to augment the initial base table with useful features from other tables, is critical in data preparation as it improves model performance, robustness, and generalizability. While recent works have investigated automatic feature augmentation, most of them have limited capabilities in utilizing all useful features as many of them are in candidate tables not directly joinable with the base table. Worse yet, with numerous join paths leading to these distant features, existing solutions fail to fully exploit them within a reasonable compute budget. We present FeatNavigator, an effective and efficient framework that explores and integrates high-quality features in relational tables for ML models. FeatNavigator evaluates a feature from two aspects: (1) the intrinsic value of a feature towards an ML task (i.e., feature importance) and (2) the efficacy of a join path connecting the feature to the base table (i.e., integration quality). FeatNavigator strategically selects a small set of available features and their corresponding join paths to train a feature importance estimation model and an integration quality prediction model. Furthermore, FeatNavigator’s search algorithm exploits both estimated feature importance and integration quality to identify the optimized feature augmentation plan. Our experimental results show that FeatNavigator outperforms state-of-the-art solutions on five public datasets by up to 40.1% in ML model performance.

[LG-108] Differentiable Reasoning about Knowledge Graphs with Region-based Graph Neural Networks

链接: https://arxiv.org/abs/2406.09529
作者: Aleksandar Pavlovic,Emanuel Sallinger,Steven Schockaert
关键词: capture semantic regularities, infer plausible knowledge, infer plausible, explicitly capture semantic, semantic regularities
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Methods for knowledge graph (KG) completion need to capture semantic regularities and use these regularities to infer plausible knowledge that is not explicitly stated. Most embedding-based methods are opaque in the kinds of regularities they can capture, although region-based KG embedding models have emerged as a more transparent alternative. By modeling relations as geometric regions in high-dimensional vector spaces, such models can explicitly capture semantic regularities in terms of the spatial arrangement of these regions. Unfortunately, existing region-based approaches are severely limited in the kinds of rules they can capture. We argue that this limitation arises because the considered regions are defined as the Cartesian product of two-dimensional regions. As an alternative, in this paper, we propose RESHUFFLE, a simple model based on ordering constraints that can faithfully capture a much larger class of rule bases than existing approaches. Moreover, the embeddings in our framework can be learned by a monotonic Graph Neural Network (GNN), which effectively acts as a differentiable rule base. This approach has the important advantage that embeddings can be easily updated as new knowledge is added to the KG. At the same time, since the resulting representations can be used similarly to standard KG embeddings, our approach is significantly more efficient than existing approaches to differentiable reasoning.

[LG-109] A Systematic Review of Generative AI for Teaching and Learning Practice

链接: https://arxiv.org/abs/2406.09520
作者: Bayode Ogunleye,Kudirat Ibilola Zakariyyah,Oluwaseun Ajao,Olakunle Olayinka,Hemlata Sharma
关键词: generative artificial intelligence, hotly debated topic, artificial intelligence, debated topic, generative artificial
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 20 pages, 10 figures, article published in Education Sciences

点击查看摘要

[LG-110] CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making

链接: https://arxiv.org/abs/2406.09509
作者: Zibin Dong,Yifu Yuan,Jianye Hao,Fei Ni,Yi Ma,Pengyi Li,Yan Zheng
关键词: powerful generative capability, build decision-making agents, Leveraging the powerful, achieved extensive success, diffusion models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: The first two authors contribute equally to this work. Code and documentation: this https URL

点击查看摘要

Abstract:Leveraging the powerful generative capability of diffusion models (DMs) to build decision-making agents has achieved extensive success. However, there is still a demand for an easy-to-use and modularized open-source library that offers customized and efficient development for DM-based decision-making algorithms. In this work, we introduce CleanDiffuser, the first DM library specifically designed for decision-making algorithms. By revisiting the roles of DMs in the decision-making domain, we identify a set of essential sub-modules that constitute the core of CleanDiffuser, allowing for the implementation of various DM algorithms with simple and flexible building blocks. To demonstrate the reliability and flexibility of CleanDiffuser, we conduct comprehensive evaluations of various DM algorithms implemented with CleanDiffuser across an extensive range of tasks. The analytical experiments provide a wealth of valuable design choices and insights, reveal opportunities and challenges, and lay a solid groundwork for future research. CleanDiffuser will provide long-term support to the decision-making community, enhancing reproducibility and fostering the development of more robust solutions. The code and documentation of CleanDiffuser are open-sourced on the this https URL.

[LG-111] Fair Data Generation via Score-based Diffusion Model

链接: https://arxiv.org/abs/2406.09495
作者: Yujie Lin,Dong Li,Chen Zhao,Minglai Shao
关键词: garnered increasing attention, numerous fairness algorithms, downstream tasks, increasing attention, decision-making has garnered
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The fairness of AI decision-making has garnered increasing attention, leading to the proposal of numerous fairness algorithms. In this paper, we aim not to address this issue by directly introducing fair learning algorithms, but rather by generating entirely new, fair synthetic data from biased datasets for use in any downstream tasks. Additionally, the distribution of test data may differ from that of the training set, potentially impacting the performance of the generated synthetic data in downstream tasks. To address these two challenges, we propose a diffusion model-based framework, FADM: Fairness-Aware Diffusion with Meta-training. FADM introduces two types of gradient induction during the sampling phase of the diffusion model: one to ensure that the generated samples belong to the desired target categories, and another to make the sensitive attributes of the generated samples difficult to classify into any specific sensitive attribute category. To overcome data distribution shifts in the test environment, we train the diffusion model and the two classifiers used for induction within a meta-learning framework. Compared to other baselines, FADM allows for flexible control over the categories of the generated samples and exhibits superior generalization capability. Experiments on real datasets demonstrate that FADM achieves better accuracy and optimal fairness in downstream tasks.

[LG-112] Q-S5: Towards Quantized State Space Models

链接: https://arxiv.org/abs/2406.09477
作者: Steven Abreu,Jens E. Pedersen,Kade M. Heckel,Alessandro Pierro
关键词: State Space Models, State Space, sequence modeling architectures, next-generation sequence modeling, Space Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:In the quest for next-generation sequence modeling architectures, State Space Models (SSMs) have emerged as a potent alternative to transformers, particularly for their computational efficiency and suitability for dynamical systems. This paper investigates the effect of quantization on the S5 model to understand its impact on model performance and to facilitate its deployment to edge and resource-constrained platforms. Using quantization-aware training (QAT) and post-training quantization (PTQ), we systematically evaluate the quantization sensitivity of SSMs across different tasks like dynamical systems modeling, Sequential MNIST (sMNIST) and most of the Long Range Arena (LRA). We present fully quantized S5 models whose test accuracy drops less than 1% on sMNIST and most of the LRA. We find that performance on most tasks degrades significantly for recurrent weights below 8-bit precision, but that other components can be compressed further without significant loss of performance. Our results further show that PTQ only performs well on language-based LRA tasks whereas all others require QAT. Our investigation provides necessary insights for the continued development of efficient and hardware-optimized SSMs.

[LG-113] Optimal Kernel Orchestration for Tensor Programs with Korch

链接: https://arxiv.org/abs/2406.09465
作者: Muyan Hu,Ashwin Venkatram,Shreyashri Biswas,Balamurugan Marimuthu,Bohan Hou,Gabriele Oliaro,Haojie Wang,Liyan Zheng,Xupeng Miao,Jidong Zhai
关键词: deep neural network, Kernel orchestration, neural network, Kernel, task of mapping
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Fix some typos in the ASPLOS version

点击查看摘要

Abstract:Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7x on V100 GPUs and up to 1.6x on A100 GPUs. Korch is publicly available at this https URL. Comments: Fix some typos in the ASPLOS version Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2406.09465 [cs.DS] (or arXiv:2406.09465v1 [cs.DS] for this version) Journalreference: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 3 (2024) 755-769 Related DOI: https://doi.org/10.1145/3620666.3651383 Focus to learn more DOI(s) linking to related resources

[LG-114] An effective software risk prediction management analysis of data using machine learning and data mining method

链接: https://arxiv.org/abs/2406.09463
作者: Jinxin Xu,Yue Wang,Ruisi Li,Ziyue Wang,Qian Zhao
关键词: software development processes, guarantee higher-quality software, higher-quality software development, development processes, guarantee higher-quality
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For one to guarantee higher-quality software development processes, risk management is essential. Furthermore, risks are those that could negatively impact an organization’s operations or a project’s progress. The appropriate prioritisation of software project risks is a crucial factor in ascertaining the software project’s performance features and eventual success. They can be used harmoniously with the same training samples and have good complement and compatibility. We carried out in-depth tests on four benchmark datasets to confirm the efficacy of our CIA approach in closed-world and open-world scenarios, with and without defence. We also present a sequential augmentation parameter optimisation technique that captures the interdependencies of the latest deep learning state-of-the-art WF attack models. To achieve precise software risk assessment, the enhanced crow search algorithm (ECSA) is used to modify the ANFIS settings. Solutions that very slightly alter the local optimum and stay inside it are extracted using the ECSA. ANFIS variable when utilising the ANFIS technique. An experimental validation with NASA 93 dataset and 93 software project values was performed. This method’s output presents a clear image of the software risk elements that are essential to achieving project performance. The results of our experiments show that, when compared to other current methods, our integrative fuzzy techniques may perform more accurately and effectively in the evaluation of software project risks.

[LG-115] Ad Auctions for LLMs via Retrieval Augmented Generation

链接: https://arxiv.org/abs/2406.09459
作者: MohammadTaghi Hajiaghayi,Sébastien Lahaie,Keivan Rezaei,Suho Shin
关键词: large language models, compromising content integrity, computational advertising, language models, presents an opportunity
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of computational advertising, the integration of ads into the outputs of large language models (LLMs) presents an opportunity to support these services without compromising content integrity. This paper introduces novel auction mechanisms for ad allocation and pricing within the textual outputs of LLMs, leveraging retrieval-augmented generation (RAG). We propose a segment auction where an ad is probabilistically retrieved for each discourse segment (paragraph, section, or entire output) according to its bid and relevance, following the RAG framework, and priced according to competing bids. We show that our auction maximizes logarithmic social welfare, a new notion of welfare that balances allocation efficiency and fairness, and we characterize the associated incentive-compatible pricing rule. These results are extended to multi-ad allocation per segment. An empirical evaluation validates the feasibility and effectiveness of our approach over several ad auction scenarios, and exhibits inherent tradeoffs in metrics as we allow the LLM more flexibility to allocate ads.

[LG-116] Simulating Realistic Post-Stroke Reaching Kinematics with Generative Adversarial Networks

链接: https://arxiv.org/abs/2406.09451
作者: Aaron J. Hadley,Christopher L. Pulliam
关键词: generalizability of machine, wearable monitoring, limited scale, scale and heterogeneity, Generative Adversarial Networks
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, 2 tables; submitted to IEEE BHI’24

点击查看摘要

Abstract:The generalizability of machine learning (ML) models for wearable monitoring in stroke rehabilitation is often constrained by the limited scale and heterogeneity of available data. Data augmentation addresses this challenge by adding computationally derived data to real data to enrich the variability represented in the training set. Traditional augmentation methods, such as rotation, permutation, and time-warping, have shown some benefits in improving classifier performance, but often fail to produce realistic training examples. This study employs Conditional Generative Adversarial Networks (cGANs) to create synthetic kinematic data from a publicly available dataset, closely mimicking the experimentally measured reaching movements of stroke survivors. This approach not only captures the complex temporal dynamics and common movement patterns after stroke, but also significantly enhances the training dataset. By training deep learning models on both synthetic and experimental data, we achieved a substantial enhancement in task classification accuracy: models incorporating synthetic data attained an overall accuracy of 80.2%, significantly higher than the 63.1% seen in models trained solely with real data. These improvements allow for more precise task classification, offering clinicians the potential to monitor patient progress more accurately and tailor rehabilitation interventions more effectively.

[LG-117] Comment on paper: Position: Rethinking Post-Hoc Search-Based Neural Approaches for Solving Large-Scale Traveling Salesman Problems

链接: https://arxiv.org/abs/2406.09441
作者: Yimeng Min
关键词: inconsistent time measurements, identify two major, failure to run, inconsistent time, time measurements
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: comment on arXiv:2406.03503 , 4 pages, 1 figure and 1 table

点击查看摘要

Abstract:We identify two major issues in the SoftDist paper (Xia et al.): (1) the failure to run all steps of different baselines on the same hardware environment, and (2) the use of inconsistent time measurements when comparing to other baselines. These issues lead to flawed conclusions. When all steps are executed in the same hardware environment, the primary claim made in SoftDist is no longer supported.

[LG-118] Understanding Pedestrian Movement Using Urban Sensing Technologies: The Promise of Audio-based Sensors

链接: https://arxiv.org/abs/2406.09998
作者: Chaeyeon Han,Pavan Seshadri,Yiwei Ding,Noah Posner,Bon Woo Koo,Animesh Agrawal,Alexander Lerch,Subhrajit Guhathakurta
关键词: monitor vehicular flows, deployed to monitor, monitor vehicular, pedestrian, vehicular flows
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
*备注: submitted to Urban Informatics

点击查看摘要

Abstract:While various sensors have been deployed to monitor vehicular flows, sensing pedestrian movement is still nascent. Yet walking is a significant mode of travel in many cities, especially those in Europe, Africa, and Asia. Understanding pedestrian volumes and flows is essential for designing safer and more attractive pedestrian infrastructure and for controlling periodic overcrowding. This study discusses a new approach to scale up urban sensing of people with the help of novel audio-based technology. It assesses the benefits and limitations of microphone-based sensors as compared to other forms of pedestrian sensing. A large-scale dataset called ASPED is presented, which includes high-quality audio recordings along with video recordings used for labeling the pedestrian count data. The baseline analyses highlight the promise of using audio sensors for pedestrian tracking, although algorithmic and technological improvements to make the sensors practically usable continue. This study also demonstrates how the data can be leveraged to predict pedestrian trajectories. Finally, it discusses the use cases and scenarios where audio-based pedestrian sensing can support better urban and transportation planning.

[LG-119] SCKansformer: Fine-Grained Classification of Bone Marrow Cells via Kansformer Backbone and Hierarchical Attention Mechanisms

点击查看摘要

[LG-120] Fundamental operating regimes hyper-parameter fine-tuning and glassiness: towards an interpretable replica-theory for trained restricted Boltzmann machines

链接: https://arxiv.org/abs/2406.09924
作者: Alberto Fachechi,Elena Agliari,Miriam Aquaro,Anthony Coolen,Menno Mulder
关键词: Gaussian hidden layer, restricted Boltzmann machines, binary visible layer, single ground pattern, unlabelled dataset composed
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider restricted Boltzmann machines with a binary visible layer and a Gaussian hidden layer trained by an unlabelled dataset composed of noisy realizations of a single ground pattern. We develop a statistical mechanics framework to describe the network generative capabilities, by exploiting the replica trick and assuming self-averaging of the underlying order parameters (i.e., replica symmetry). In particular, we outline the effective control parameters (e.g., the relative number of weights to be trained, the regularization parameter), whose tuning can yield qualitatively-different operative regimes. Further, we provide analytical and numerical evidence for the existence of a sub-region in the space of the hyperparameters where replica-symmetry breaking occurs.

[LG-121] owards Full Integration of Artificial Intelligence in Colon Capsule Endoscopys Pathway

点击查看摘要

[LG-122] Large language model validity via enhanced conformal prediction methods

链接: https://arxiv.org/abs/2406.09714
作者: John J. Cherian,Isaac Gibbs,Emmanuel J. Candès
关键词: large language models, obtaining validity guarantees, conformal inference methods, obtaining validity, language models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:We develop new conformal inference methods for obtaining validity guarantees on the output of large language models (LLMs). Prior work in conformal language modeling identifies a subset of the text that satisfies a high-probability guarantee of correctness. These methods work by filtering claims from the LLM’s original response if a scoring function evaluated on the claim fails to exceed a threshold calibrated via split conformal prediction. Existing methods in this area suffer from two deficiencies. First, the guarantee stated is not conditionally valid. The trustworthiness of the filtering step may vary based on the topic of the response. Second, because the scoring function is imperfect, the filtering step can remove many valuable and accurate claims. We address both of these challenges via two new conformal methods. First, we generalize the conditional conformal procedure of Gibbs et al. (2023) in order to adaptively issue weaker guarantees when they are required to preserve the utility of the output. Second, we show how to systematically improve the quality of the scoring function via a novel algorithm for differentiating through the conditional conformal procedure. We demonstrate the efficacy of our approach on both synthetic and real-world datasets.

[LG-123] An Efficient Approach to Regression Problems with Tensor Neural Networks

链接: https://arxiv.org/abs/2406.09694
作者: Yongxin Li
关键词: tensor neural network, nonparametric regression problems, address nonparametric regression, Basis Function Networks, neural network
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a tensor neural network (TNN) to address nonparametric regression problems. Characterized by its distinct sub-network structure, the TNN effectively facilitates variable separation, thereby enhancing the approximation of complex, unknown functions. Our comparative analysis reveals that the TNN outperforms conventional Feed-Forward Networks (FFN) and Radial Basis Function Networks (RBN) in terms of both approximation accuracy and generalization potential, despite a similar scale of parameters. A key innovation of our approach is the integration of statistical regression and numerical integration within the TNN framework. This integration allows for the efficient computation of high-dimensional integrals associated with the regression function. The implications of this advancement extend to a broader range of applications, particularly in scenarios demanding precise high-dimensional data analysis and prediction.

[LG-124] rainability issues in quantum policy gradients

链接: https://arxiv.org/abs/2406.09614
作者: André Sequeira,Luis Paulo Santos,Luis Soares Barbosa
关键词: Parameterized Quantum circuit-based, Reinforcement Learning, Quantum circuit-based policies, Parameterized Quantum, trainability of Parameterized
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research explores the trainability of Parameterized Quantum circuit-based policies in Reinforcement Learning, an area that has recently seen a surge in empirical exploration. While some studies suggest improved sample complexity using quantum gradient estimation, the efficient trainability of these policies remains an open question. Our findings reveal significant challenges, including standard Barren Plateaus with exponentially small gradients and gradient explosion. These phenomena depend on the type of basis-state partitioning and mapping these partitions onto actions. For a polynomial number of actions, a trainable window can be ensured with a polynomial number of measurements if a contiguous-like partitioning of basis-states is employed. These results are empirically validated in a multi-armed bandit environment.

[LG-125] Causal Fine-Tuning and Effect Calibration of Non-Causal Predictive Models

链接: https://arxiv.org/abs/2406.09567
作者: Carlos Fernández-Loría,Yanfang Hou,Foster Provost,Jennifer Hill
关键词: non-causal models, randomized experiments, enhance the performance, non-causal, paper proposes techniques
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes techniques to enhance the performance of non-causal models for causal inference using data from randomized experiments. In domains like advertising, customer retention, and precision medicine, non-causal models that predict outcomes under no intervention are often used to score individuals and rank them according to the expected effectiveness of an intervention (e.g, an ad, a retention incentive, a nudge). However, these scores may not perfectly correspond to intervention effects due to the inherent non-causal nature of the models. To address this limitation, we propose causal fine-tuning and effect calibration, two techniques that leverage experimental data to refine the output of non-causal models for different causal tasks, including effect estimation, effect ordering, and effect classification. They are underpinned by two key advantages. First, they can effectively integrate the predictive capabilities of general non-causal models with the requirements of a causal task in a specific context, allowing decision makers to support diverse causal applications with a “foundational” scoring model. Second, through simulations and an empirical example, we demonstrate that they can outperform the alternative of building a causal-effect model from scratch, particularly when the available experimental data is limited and the non-causal scores already capture substantial information about the relative sizes of causal effects. Overall, this research underscores the practical advantages of combining experimental data with non-causal models to support causal applications.

[LG-126] Embedding machine-learnt sub-grid variability improves climate model biases

链接: https://arxiv.org/abs/2406.09551
作者: Daniel Giles,James Briant,Cyril J. Morcrette,Serge Guillas
关键词: Unified Model simulations, General Circulation Model, Atmospheric General Circulation, trained MOGP model, current climate models
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The under-representation of cloud formation is a long-standing bias associated with climate simulations. Parameterisation schemes are required to capture cloud processes within current climate models but have known biases. We overcome these biases by embedding a Multi-Output Gaussian Process (MOGP) trained on high resolution Unified Model simulations to represent the variability of temperature and specific humidity within a climate model. A trained MOGP model is coupled in-situ with a simplified Atmospheric General Circulation Model named SPEEDY. The temperature and specific humidity profiles of SPEEDY are perturbed at fixed intervals according to the variability predicted from the MOGP. Ten-year predictions are generated for both control and ML-hybrid models. The hybrid model reduces the global precipitation bias by 18% and over the tropics by 22%. To further understand the drivers of these improvements, physical quantities of interest are explored, such as the distribution of lifted index values and the alteration of the Hadley cell. The control and hybrid set-ups are also run in a plus 4K sea-surface temperature experiment to explore the effects of the approach on patterns relating to cloud cover and precipitation in a warmed climate setting.

[LG-127] Fair GLASSO: Estimating Fair Graphical Models with Unbiased Statistical Behavior

链接: https://arxiv.org/abs/2406.09513
作者: Madeline Navarro,Samuel Rey,Andrei Buciulea,Antonio G. Marques,Santiago Segarra
关键词: estimating Gaussian graphical, propose estimating Gaussian, estimating Gaussian, graphical models, Gaussian graphical models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We propose estimating Gaussian graphical models (GGMs) that are fair with respect to sensitive nodal attributes. Many real-world models exhibit unfair discriminatory behavior due to biases in data. Such discrimination is known to be exacerbated when data is equipped with pairwise relationships encoded in a graph. Additionally, the effect of biased data on graphical models is largely underexplored. We thus introduce fairness for graphical models in the form of two bias metrics to promote balance in statistical similarities across nodal groups with different sensitive attributes. Leveraging these metrics, we present Fair GLASSO, a regularized graphical lasso approach to obtain sparse Gaussian precision matrices with unbiased statistical dependencies across groups. We also propose an efficient proximal gradient algorithm to obtain the estimates. Theoretically, we express the tradeoff between fair and accurate estimated precision matrices. Critically, this includes demonstrating when accuracy can be preserved in the presence of a fairness regularizer. On top of this, we study the complexity of Fair GLASSO and demonstrate that our algorithm enjoys a fast convergence rate. Our empirical validation includes synthetic and real-world simulations that illustrate the value and effectiveness of our proposed optimization problem and iterative algorithm.

[LG-128] he Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments

链接: https://arxiv.org/abs/2406.09494
作者: Shareef Babu Kalluri,Prachi Singh,Pratik Roy Chowdhuri,Apoorva Kulkarni,Shikha Baghel,Pradyoth Hegde,Swapnil Sontakke,Deepak K T,S. R. Mahadeva Prasanna,Deepu Vijayasenan,Sriram Ganapathy
关键词: challenging multilingual conversational, multilingual conversational speech, Conversational Environments, conversational speech dataset, multilingual conversational
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, Interspeech 2024

点击查看摘要

Abstract:The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) 2024 challenge is the second in the series of DISPLACE challenges, which involves tasks of speaker diarization (SD) and language diarization (LD) on a challenging multilingual conversational speech dataset. In the DISPLACE 2024 challenge, we also introduced the task of automatic speech recognition (ASR) on this dataset. The dataset containing 158 hours of speech, consisting of both supervised and unsupervised mono-channel far-field recordings, was released for LD and SD tracks. Further, 12 hours of close-field mono-channel recordings were provided for the ASR track conducted on 5 Indian languages. The details of the dataset, baseline systems and the leader board results are highlighted in this paper. We have also compared our baseline models and the team’s performances on evaluation data of DISPLACE-2023 to emphasize the advancements made in this second version of the challenge.

[LG-129] Lightning-Fast Thunderstorm Warnings: Predicting Severe Convective Environments with Global Neural Weather Models

链接: https://arxiv.org/abs/2406.09474
作者: Monika Feldmann,Tom Beucler,Milton Gomez,Olivia Martius
关键词: recently released suite, models, Deep Layer Shear, produce multi-day, recently released
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 23 pages, 7 Figures. In preparation for submission to Environmental Research Letters

点击查看摘要

Abstract:The recently released suite of AI weather models can produce multi-day, medium-range forecasts within seconds, with a skill on par with state-of-the-art operational forecasts. Traditional AI model evaluation predominantly targets global scores on single levels. Specific prediction tasks, such as severe convective environments, require much more precision on a local scale and with the correct vertical gradients between levels. With a focus on the convective season of global hotspots in 2020, we assess the skill of three top-performing AI models (Pangu-Weather, GraphCast, FourCastNet) for Convective Available Potential Energy (CAPE) and Deep Layer Shear (DLS) at lead-times of up to 10 days against the ERA-5 reanalysis and the IFS operational numerical weather prediction model. Looking at the example of a US tornado outbreak on April 12 and 13, 2020, all models predict elevated CAPE and DLS values multiple days in advance. The spatial structures in the AI models are smoothed in comparison to IFS and ERA-5. The models show differing biases in the prediction of CAPE values, with GraphCast capturing the value distribution the most accurately and FourCastNet showing a consistent underestimation. In seasonal analyses around the globe, we generally see the highest performance by GraphCast and Pangu-Weather, which match or even exceed the performance of IFS. CAPE derived from vertically coarse pressure levels of neural weather models lacks the precision of the vertically fine resolution of numerical models. The promising results here indicate that a direct prediction of CAPE in AI models is likely to be skillful. This would open unprecedented opportunities for fast and inexpensive predictions of severe weather phenomena. By advancing the assessment of AI models towards process-based evaluations we lay the foundation for hazard-driven applications of AI-based weather forecasts.

[LG-130] A topological analysis of the space of recipes

链接: https://arxiv.org/abs/2406.09445
作者: Emerson G. Escolar,Yuta Shimada,Masahiro Yuasa
关键词: recent years, culinary recipes, underlying patterns, patterns and principles, topological data analysis
类目: Algebraic Topology (math.AT); Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

Abstract:In recent years, the use of data-driven methods has provided insights into underlying patterns and principles behind culinary recipes. In this exploratory work, we introduce the use of topological data analysis, especially persistent homology, in order to study the space of culinary recipes. In particular, persistent homology analysis provides a set of recipes surrounding the multiscale “holes” in the space of existing recipes. We then propose a method to generate novel ingredient combinations using combinatorial optimization on this topological information. We made biscuits using the novel ingredient combinations, which were confirmed to be acceptable enough by a sensory evaluation study. Our findings indicate that topological data analysis has the potential for providing new tools and insights in the study of culinary recipes.

[LG-131] Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

链接: https://arxiv.org/abs/2406.09443
作者: Satyam Kumar,Sai Srujana Buddi,Utkarsh Oggy Sarawgi,Vineet Garg,Shivesh Ranjan,Ognjen(Oggi)Rudovic,Ahmed Hussen Abdelaziz,Saurabh Adya
关键词: Voice activity detection, Personalized Voice Activity, hands-free communication systems, Voice activity, personalized VAD systems
类目: Audio and Speech Processing (eess.AS); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection (PVAD) systems to assess their real-world effectiveness. We introduce a comprehensive approach to assess PVAD systems, incorporating various performance metrics such as frame-level and utterance-level error rates, detection latency and accuracy, alongside user-level analysis. Through extensive experimentation and evaluation, we provide a thorough understanding of the strengths and limitations of various PVAD variants. This paper advances the understanding of PVAD technology by offering insights into its efficacy and viability in practical applications using a comprehensive set of metrics.

[LG-132] An insertable glucose sensor using a compact and cost-effective phosphorescence lifetime imager and machine learning

链接: https://arxiv.org/abs/2406.09442
作者: Artem Goncharov,Zoltan Gorocs,Ridhi Pradhan,Brian Ko,Ajmal Ajmal,Andres Rodriguez,David Baum,Marcell Veszpremi,Xilin Yang,Maxime Pindrys,Tianle Zheng,Oliver Wang,Jessica C. Ramella-Roman,Michael J. McShane,Aydogan Ozcan
关键词: conventional electrochemical CGMs, continuous glucose monitoring, personalized glucose management, glucose management owing, prolonged durability compared
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Biological Physics (physics.bio-ph)
*备注: 24 Pages, 4 Figures

点击查看摘要

Abstract:Optical continuous glucose monitoring (CGM) systems are emerging for personalized glucose management owing to their lower cost and prolonged durability compared to conventional electrochemical CGMs. Here, we report a computational CGM system, which integrates a biocompatible phosphorescence-based insertable biosensor and a custom-designed phosphorescence lifetime imager (PLI). This compact and cost-effective PLI is designed to capture phosphorescence lifetime images of an insertable sensor through the skin, where the lifetime of the emitted phosphorescence signal is modulated by the local concentration of glucose. Because this phosphorescence signal has a very long lifetime compared to tissue autofluorescence or excitation leakage processes, it completely bypasses these noise sources by measuring the sensor emission over several tens of microseconds after the excitation light is turned off. The lifetime images acquired through the skin are processed by neural network-based models for misalignment-tolerant inference of glucose levels, accurately revealing normal, low (hypoglycemia) and high (hyperglycemia) concentration ranges. Using a 1-mm thick skin phantom mimicking the optical properties of human skin, we performed in vitro testing of the PLI using glucose-spiked samples, yielding 88.8% inference accuracy, also showing resilience to random and unknown misalignments within a lateral distance of ~4.7 mm with respect to the position of the insertable sensor underneath the skin phantom. Furthermore, the PLI accurately identified larger lateral misalignments beyond 5 mm, prompting user intervention for re-alignment. The misalignment-resilient glucose concentration inference capability of this compact and cost-effective phosphorescence lifetime imager makes it an appealing wearable diagnostics tool for real-time tracking of glucose and other biomarkers.

信息检索

[IR-0] HIRO: Hierarchical Information Retrieval Optimization

链接: https://arxiv.org/abs/2406.09979
作者: Krish Goel,Mahek Chandak
关键词: Large Language Models, Large Language, natural language tasks, face limitations due, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-1] Harm Mitigation in Recommender Systems under User Preference Dynamics

点击查看摘要

[IR-2] Unraveling Anomalies in Time: Unsupervised Discovery and Isolation of Anomalous Behavior in Bio-regenerative Life Support System Telemetry

点击查看摘要

[IR-3] ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures

链接: https://arxiv.org/abs/2406.09818
作者: Tobias Schimanski,Jingwei Ni,Roberto Spacey,Nicola Ranger,Markus Leippold
关键词: Retrieval Augmented Generation, stakeholders increasingly rely, qualitative data produced, Augmented Generation, Retrieval Augmented
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:To handle the vast amounts of qualitative data produced in corporate climate communication, stakeholders increasingly rely on Retrieval Augmented Generation (RAG) systems. However, a significant gap remains in evaluating domain-specific information retrieval - the basis for answer generation. To address this challenge, this work simulates the typical tasks of a sustainability analyst by examining 30 sustainability reports with 16 detailed climate-related questions. As a result, we obtain a dataset with over 8.5K unique question-source-answer pairs labeled by different levels of relevance. Furthermore, we develop a use case with the dataset to investigate the integration of expert knowledge into information retrieval with embeddings. Although we show that incorporating expert knowledge works, we also outline the critical limitations of embeddings in knowledge-intensive downstream domains like climate change communication.

[IR-4] Soil nitrogen forecasting from environmental variables provided by multisensor remote sensing images

链接: https://arxiv.org/abs/2406.09812
作者: Weiying Zhao,Ganzorig Chuluunbat,Aleksei Unagaev,Natalia Efremova
关键词: soil nitrogen content, forecasting soil nitrogen, Area Frame Survey, leveraging multi-modal data, Cover Area Frame
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This study introduces a framework for forecasting soil nitrogen content, leveraging multi-modal data, including multi-sensor remote sensing images and advanced machine learning methods. We integrate the Land Use/Land Cover Area Frame Survey (LUCAS) database, which covers European and UK territory, with environmental variables from satellite sensors to create a dataset of novel features. We further test a broad range of machine learning algorithms, focusing on tree-based models such as CatBoost, LightGBM, and XGBoost. We test the proposed methods with a variety of land cover classes, including croplands and grasslands to ensure the robustness of this approach. Our results demonstrate that the CatBoost model surpasses other methods in accuracy. This research advances the field of agricultural management and environmental monitoring and demonstrates the significant potential of integrating multisensor remote sensing data with machine learning for environmental analysis.

[IR-5] IFA: Interaction Fidelity Attention for Entire Lifelong Behaviour Sequence Modeling

链接: https://arxiv.org/abs/2406.09742
作者: Wenhui Yu,Chao Feng,Yanze Zhang,Lantao Hu,Peng Jiang,Han Li
关键词: computational consumption significantly, gains impressive improvement, increases computational consumption, lifelong user behavior, user behavior sequence
类目: Information Retrieval (cs.IR)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:The lifelong user behavior sequence provides abundant information of user preference and gains impressive improvement in the recommendation task, however increases computational consumption significantly. To meet the severe latency requirement in online service, a short sub-sequence is sampled based on similarity to the target item. Unfortunately, items not in the sub-sequence are abandoned, leading to serious information loss. In this paper, we propose a new efficient paradigm to model the full lifelong sequence, which is named as \textbfInteraction \textbfFidelity \textbfAttention (\textbfIFA). In IFA, we input all target items in the candidate set into the model at once, and leverage linear transformer to reduce the time complexity of the cross attention between the candidate set and the sequence without any interaction information loss. We also additionally model the relationship of all target items for optimal set generation, and design loss function for better consistency of training and inference. We demonstrate the effectiveness and efficiency of our model by off-line and online experiments in the recommender system of Kuaishou. Comments: 7 pages, 2 figures Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2406.09742 [cs.IR] (or arXiv:2406.09742v1 [cs.IR] for this version)

[IR-6] Enhancing Text Corpus Exploration with Post Hoc Explanations and Comparative Design

链接: https://arxiv.org/abs/2406.09686
作者: Michael Gleicher,Keaton Leppenan,Yunyu Bai
关键词: include item discovery, Text corpus exploration, Text corpus, exploratory search tasks, exploratory search
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: The system is available at: this https URL . The user guide (including more examples) is at: this https URL

点击查看摘要

Abstract:Text corpus exploration (TCE) spans the range of exploratory search tasks: it goes beyond simple retrieval to include item discovery and learning about the corpus and topic. Systems support TCE with tools such as similarity-based recommendations and embedding-based spatial maps. However, these tools address specific tasks; current systems lack the flexibility to support the range of tasks encountered in practice and the iterative, multiscale, workflows users employ. In this paper, we provide methods that enhance TCE tools with post hoc explanations and multiscale, comparative designs to provide flexible support for user needs. We introduce salience functions as a mechanism to provide post hoc explanations of similarity, recommendations, and spatial placement. This post hoc strategy allows our approach to complement a variety of underlying algorithms; the salience functions provide both exemplar- and feature-based explanations at scales ranging from individual documents through to the entire corpus. These explanations are incorporated into a set of views that operate at multiple scales. The views use design elements that explicitly support comparison to enable flexible integration. Together, these form an approach that provides a flexible toolset that can address a range of tasks. We demonstrate our approach in a prototype system that enables the exploration of corpora of paper abstracts and newspaper archives. Examples illustrate how our approach enables the system to flexibly support a wide range of tasks and workflows that emerge in user scenarios. A user study confirms that researchers are able to use our system to achieve a variety of tasks.

[IR-7] Enhancing Knowledge Retrieval with In-Context Learning and Semantic Search through Generative AI

链接: https://arxiv.org/abs/2406.09621
作者: Mohammed-Khalil Ghali,Abdelrahman Farrag,Daehan Won,Yu Jin
关键词: today information-rich era, extensive research documents, presents significant challenges, Retrieving and extracting, databases presents significant
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieving and extracting knowledge from extensive research documents and large databases presents significant challenges for researchers, students, and professionals in today’s information-rich era. Existing retrieval systems, which rely on general-purpose Large Language Models (LLMs), often fail to provide accurate responses to domain-specific inquiries. Additionally, the high cost of pretraining or fine-tuning LLMs for specific domains limits their widespread adoption. To address these limitations, we propose a novel methodology that combines the generative capabilities of LLMs with the fast and accurate retrieval capabilities of vector databases. This advanced retrieval system can efficiently handle both tabular and non-tabular data, understand natural language user queries, and retrieve relevant information without fine-tuning. The developed model, Generative Text Retrieval (GTR), is adaptable to both unstructured and structured data with minor refinement. GTR was evaluated on both manually annotated and public datasets, achieving over 90% accuracy and delivering truthful outputs in 87% of cases. Our model achieved state-of-the-art performance with a Rouge-L F1 score of 0.98 on the MSMARCO dataset. The refined model, Generative Tabular Text Retrieval (GTR-T), demonstrated its efficiency in large database querying, achieving an Execution Accuracy (EX) of 0.82 and an Exact-Set-Match (EM) accuracy of 0.60 on the Spider dataset, using an open-source LLM. These efforts leverage Generative AI and In-Context Learning to enhance human-text interaction and make advanced AI capabilities more accessible. By integrating robust retrieval systems with powerful LLMs, our approach aims to democratize access to sophisticated AI tools, improving the efficiency, accuracy, and scalability of AI-driven information retrieval and database querying.

[IR-8] Multi-Modal Retrieval For Large Language Model Based Speech Recognition

链接: https://arxiv.org/abs/2406.09618
作者: Jari Kolehmainen,Aditya Gourav,Prashanth Gurunath Shivakumar,Yile Gu,Ankur Gandhe,Ariya Rastrow,Grant Strimel,Ivan Bulyko
关键词: widely adopted approach, improving language models, language models leveraging, leveraging external information, models leveraging external
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[IR-9] A Systematic Review of Generative AI for Teaching and Learning Practice

链接: https://arxiv.org/abs/2406.09520
作者: Bayode Ogunleye,Kudirat Ibilola Zakariyyah,Oluwaseun Ajao,Olakunle Olayinka,Hemlata Sharma
关键词: generative artificial intelligence, hotly debated topic, artificial intelligence, debated topic, generative artificial
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 20 pages, 10 figures, article published in Education Sciences

点击查看摘要

[IR-10] Exploring Traffic Crash Narratives in Jordan Using Text Mining Analytics

链接: https://arxiv.org/abs/2406.09438
作者: Shadi Jaradat,Taqwa I. Alhadidi,Huthaifa I. Ashqar,Ahmed Hossain,Mohammed Elhenawy
关键词: enhance effective traffic, study explores traffic, attempt to inform, inform and enhance, enhance effective
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

人工智能

[AI-0] Quantifying Variance in Evaluation Benchmarks

点击查看摘要

[AI-1] VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

链接: https://arxiv.org/abs/2406.10228
作者: Chenyu Zhou,Mengdan Zhang,Peixian Chen,Chaoyou Fu,Yunhang Shen,Xiawu Zheng,Xing Sun,Rongrong Ji
关键词: Multi-modal Large Models, Multi-modal Large, progress of Multi-modal, tackle tasks blending, tasks blending vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Project Page: this https URL

点击查看摘要

[AI-2] VideoGUI: A Benchmark for GUI Automation from Instructional Videos

点击查看摘要

[AI-3] Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding

链接: https://arxiv.org/abs/2406.10221
作者: Ridouane Ghermi,Xi Wang,Vicky Kalogeiton,Ivan Laptev
关键词: Recent advances, propelled video understanding, advances in vision-language, Recent, video understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-4] Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

链接: https://arxiv.org/abs/2406.10216
作者: Rui Yang,Ruomeng Ding,Yong Lin,Huan Zhang,Tong Zhang
关键词: aligning Large Language, Large Language Models, aligning Large, human preference data, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 21 pages

点击查看摘要

[AI-5] Make It Count: Text-to-Image Generation with an Accurate Number of Objects

点击查看摘要

[AI-6] SSTFB: Leveraging self-supervised pretext learning and temporal self-attention with feature branching for real-time video polyp segmentation

点击查看摘要

[AI-7] Crafting Parts for Expressive Object Composition

点击查看摘要

[AI-8] RIP-PAL: Travel Planning with Guarantees by Combining Large Language Models and Automated Planners

链接: https://arxiv.org/abs/2406.10196
作者: Tomas de la Rosa,Sriram Gopalakrishnan,Alberto Pozanco,Zhen Zeng,Daniel Borrajo
关键词:
类目: Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures

点击查看摘要

[AI-9] Practical offloading for fine-tuning LLM on commodity GPU via learned subspace projectors

链接: https://arxiv.org/abs/2406.10181
作者: Siyuan Chen,Zelong Guan,Yudong Liu,Phillip B. Gibbons
关键词: requires significant memory, large language models, Fine-tuning large language, requires significant, large language
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU. A common solution to this memory challenge is offloading compute and data from the GPU to the CPU. However, this approach is hampered by the limited bandwidth of commodity hardware, which constrains communication between the CPU and GPU. In this paper, we present an offloading framework, LSP_Offload, that enables near-native speed LLM fine-tuning on commodity hardware through learned subspace projectors. Our data-driven approach involves learning an efficient sparse compressor that minimizes communication with minimal precision loss. Additionally, we introduce a novel layer-wise communication schedule to maximize parallelism between communication and computation. As a result, our framework can fine-tune a 1.3 billion parameter model on a 4GB laptop GPU and a 7 billion parameter model on an NVIDIA RTX 4090 GPU with 24GB memory, achieving only a 31% slowdown compared to fine-tuning with unlimited memory. Compared to state-of-the-art offloading frameworks, our approach increases fine-tuning throughput by up to 3.33 times and reduces end-to-end fine-tuning time by 33.1%~62.5% when converging to the same accuracy. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.10181 [cs.DC] (or arXiv:2406.10181v1 [cs.DC] for this version)

[AI-10] MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers

点击查看摘要

[AI-11] Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

链接: https://arxiv.org/abs/2406.10162
作者: Carson Denison,Monte MacDiarmid,Fazl Barez,David Duvenaud,Shauna Kravec,Samuel Marks,Nicholas Schiefer,Ryan Soklaski,Alex Tamkin,Jared Kaplan,Buck Shlegeris,Samuel R. Bowman,Ethan Perez,Evan Hubinger
关键词: systems learn undesired, highly rewarded due, learn undesired behaviors, specification gaming, misspecified training goals
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-12] One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

链接: https://arxiv.org/abs/2406.10160
作者: Zhaoqing Li,Haoning Xu,Tianzi Wang,Shoukang Hu,Zengrui Jin,Shujie Hu,Jiajun Deng,Mingyu Cui,Mengzhe Geng,Xunying Liu
关键词: one-pass multiple ASR, multiple ASR systems, ASR systems joint, multiple ASR, ASR systems
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by Interspeech 2024

点击查看摘要

Abstract:We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision settings to be simultaneously constructed without the need to train and store individual target systems separately. Experiments consistently demonstrate the multiple ASR systems compressed in a single all-in-one model produced a word error rate (WER) comparable to, or lower by up to 1.01% absolute (6.98% relative) than individually trained systems of equal complexity. A 3.4x overall system compression and training time speed-up was achieved. Maximum model size compression ratios of 12.8x and 3.93x were obtained over the baseline Switchboard-300hr Conformer and LibriSpeech-100hr fine-tuned wav2vec2.0 models, respectively, incurring no statistically significant WER increase.

[AI-13] RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model

链接: https://arxiv.org/abs/2406.10157
作者: Hantao Zhou,Tianying Ji,Jianwei Zhang,Fuchun Sun,Huazhe Xu
关键词: complex ball motion, compelling real-world testbed, countless court layouts, ball motion, constitutes a compelling
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:Minigolf, a game with countless court layouts, and complex ball motion, constitutes a compelling real-world testbed for the study of embodied intelligence. As it not only challenges spatial and kinodynamic reasoning but also requires reflective and corrective capacities to address erroneously designed courses. We introduce RoboGolf, a framework that perceives dual-camera visual inputs with nested VLM-empowered closed-loop control and reflective equilibrium loop. Extensive experiments demonstrate the effectiveness of RoboGolf on challenging minigolf courts including those that are impossible to finish.

[AI-14] Automated Design of Linear Bounding Functions for Sigmoidal Nonlinearities in Neural Networks

点击查看摘要

[AI-15] BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

链接: https://arxiv.org/abs/2406.10149
作者: Yuri Kuratov,Aydar Bulatov,Petr Anokhin,Ivan Rodkin,Dmitry Sorokin,Artyom Sorokin,Mikhail Burtsev
关键词: input context sizes, large language models, recent years, sizes of large, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-16] Improving rule mining via embedding-based link prediction

链接: https://arxiv.org/abs/2406.10144
作者: N’Dah Jean Kouagou,Arif Yilmaz,Michel Dumontier,Axel-Cyrille Ngonga Ngomo
关键词: explainable link prediction, link prediction, explainable link, prediction, Rule mining
类目: Artificial Intelligence (cs.AI)
*备注: 13 pages, 2 figures, 11 tables

点击查看摘要

Abstract:Rule mining on knowledge graphs allows for explainable link prediction. Contrarily, embedding-based methods for link prediction are well known for their generalization capabilities, but their predictions are not interpretable. Several approaches combining the two families have been proposed in recent years. The majority of the resulting hybrid approaches are usually trained within a unified learning framework, which often leads to convergence issues due to the complexity of the learning task. In this work, we propose a new way to combine the two families of approaches. Specifically, we enrich a given knowledge graph by means of its pre-trained entity and relation embeddings before applying rule mining systems on the enriched knowledge graph. To validate our approach, we conduct extensive experiments on seven benchmark datasets. An analysis of the results generated by our approach suggests that we discover new valuable rules on the enriched graphs. We provide an open source implementation of our approach as well as pretrained models and datasets at this https URL

[AI-17] he Rise and Fall(?) of Software Engineering

链接: https://arxiv.org/abs/2406.10141
作者: Antonio Mastropaolo,Camilo Escobar-Velásquez,Mario Linares-Vásquez
关键词: ten years, revolutionary breakthroughs, everyday lives, experienced an explosion, explosion of revolutionary
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Over the last ten years, the realm of Artificial Intelligence (AI) has experienced an explosion of revolutionary breakthroughs, transforming what seemed like a far-off dream into a reality that is now deeply embedded in our everyday lives. AI’s widespread impact is revolutionizing virtually all aspects of human life, and software engineering (SE) is no exception. As we explore this changing landscape, we are faced with questions about what the future holds for SE and how AI will reshape the roles, duties, and methodologies within the field. The introduction of these groundbreaking technologies highlights the inevitable shift towards a new paradigm, suggesting a future where AI’s capabilities may redefine the boundaries of SE, potentially even more than human input. In this paper, we aim at outlining the key elements that, based on our expertise, are vital for the smooth integration of AI into SE, all while preserving the intrinsic human creativity that has been the driving force behind the field. First, we provide a brief description of SE and AI evolution. Afterward, we delve into the intricate interplay between AI-driven automation and human innovation, exploring how these two components can work together to advance SE practices to new methods and standards. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: D.2; I.2 Cite as: arXiv:2406.10141 [cs.SE] (or arXiv:2406.10141v1 [cs.SE] for this version)

[AI-18] Evaluation of Large Language Models: STEM education and Gender Stereotypes

链接: https://arxiv.org/abs/2406.10133
作者: Smilla Due,Sneha Das,Marianne Andersen,Berta Plandolit López,Sniff Andersen Nexø,Line Clemmensen
关键词: study support, coding support, writing assistance, increasing impact, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-19] Linear Contextual Bandits with Hybrid Payoff: Revisited

点击查看摘要

[AI-20] Exploration by Learning Diverse Skills through Successor State Measures

链接: https://arxiv.org/abs/2406.10127
作者: Paul-Antoine Le Tolguenec,Yann Besse,Florent Teichteil-Konigsbuch,Dennis G. Wilson,Emmanuel Rachelson
关键词: agents to explore, ability to perform, encourage agents, diverse skills, skills
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The ability to perform different skills can encourage agents to explore. In this work, we aim to construct a set of diverse skills which uniformly cover the state space. We propose a formalization of this search for diverse skills, building on a previous definition based on the mutual information between states and skills. We consider the distribution of states reached by a policy conditioned on each skill and leverage the successor state measure to maximize the difference between these skill distributions. We call this approach LEADS: Learning Diverse Skills through Successor States. We demonstrate our approach on a set of maze navigation and robotic control tasks which show that our method is capable of constructing a diverse set of skills which exhaustively cover the state space without relying on reward or exploration bonuses. Our findings demonstrate that this new formalization promotes more robust and efficient exploration by combining mutual information maximization and exploration bonuses.

[AI-21] Data Ethics in the Era of Healthcare Artificial Intelligence in Africa: An Ubuntu Philosophy Perspective

链接: https://arxiv.org/abs/2406.10121
作者: Abdoul Jalil Djiberou Mahamadou,Aloysius Ochasi,Russ B. Altman
关键词: healthcare artificial intelligence, developing healthcare artificial, artificial intelligence, Data, essential in developing
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 16 pages

点击查看摘要

Abstract:Data are essential in developing healthcare artificial intelligence (AI) systems. However, patient data collection, access, and use raise ethical concerns, including informed consent, data bias, data protection and privacy, data ownership, and benefit sharing. Various ethical frameworks have been proposed to ensure the ethical use of healthcare data and AI, however, these frameworks often align with Western cultural values, social norms, and institutional contexts emphasizing individual autonomy and well-being. Ethical guidelines must reflect political and cultural settings to account for cultural diversity, inclusivity, and historical factors such as colonialism. Thus, this paper discusses healthcare data ethics in the AI era in Africa from the Ubuntu philosophy perspective. It focuses on the contrast between individualistic and communitarian approaches to data ethics. The proposed framework could inform stakeholders, including AI developers, healthcare providers, the public, and policy-makers about healthcare data ethical usage in AI in Africa.

[AI-22] Precipitation Nowcasting Using Physics Informed Discriminator Generative Models

点击查看摘要

[AI-23] SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

点击查看摘要

[AI-24] ECGMamba: Towards Efficient ECG Classification with BiSSM

点击查看摘要

[AI-25] Biomarker based Cancer Classification using an Ensemble with Pre-trained Models

点击查看摘要

[AI-26] Localizing Events in Videos with Multimodal Queries

点击查看摘要

[AI-27] First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

点击查看摘要

[AI-28] Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation

点击查看摘要

[AI-29] FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in Biomedical Domain

点击查看摘要

[AI-30] owards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

链接: https://arxiv.org/abs/2406.10034
作者: Tianzi Wang,Xurong Xie,Zhaoqing Li,Shoukang Hu,Zengrui Jing,Jiajun Deng,Mingyu Cui,Shujie Hu,Mengzhe Geng,Guinan Li,Helen Meng,Xunying Liu
关键词: Conformer ASR systems, block-based Attention Mask, Conformer ASR, flexibly balances performance-efficiency, balances performance-efficiency trade-offs
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 2 figures, 2 tables, Interspeech24 conference

点击查看摘要

Abstract:This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam search algorithm is designed to leverage a dynamic fusion of CTC, AR Decoder, and AMD probabilities. Experiments on the LibriSpeech-100hr corpus suggest the tripartite Decoder incorporating the AMD module produces a maximum decoding speed-up ratio of 1.73x over the baseline CTC+AR decoding, while incurring no statistically significant word error rate (WER) increase on the test sets. When operating with the same decoding real time factors, statistically significant WER reductions of up to 0.7% and 0.3% absolute (5.3% and 6.1% relative) were obtained over the CTC+AR baseline.

[AI-31] Intepretative Deep Learning using Domain Adaptation for Fluorescence Spectroscopy

点击查看摘要

[AI-32] Group and Shuffle: Efficient Structured Orthogonal Parametrization

链接: https://arxiv.org/abs/2406.10019
作者: Mikhail Gorbunov,Nikolay Yudin,Vera Soboleva,Aibek Alanov,Alexey Naumov,Maxim Rakhuba
关键词: increasing size, growing demand, orthogonal, efficient fine-tuning, fine-tuning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[AI-33] and Average : Geometric Adjustment of the Last Layer for Recalibration

点击查看摘要

[AI-34] Gradient-based Learning in State-based Potential Games for Self-Learning Production Systems

点击查看摘要

[AI-35] Beyond Slow Signs in High-fidelity Model Extraction

点击查看摘要

[AI-36] Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning

链接: https://arxiv.org/abs/2406.09988
作者: Xiaowen Sun,Xufeng Zhao,Jae Hee Lee,Wenhao Lu,Matthias Kerzel,Stefan Wermter
关键词: planning and manipulation, reflects its current, current status, status or condition, object
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
*备注:

点击查看摘要

[AI-37] Challenges in explaining deep learning models for data with biological variation

点击查看摘要

[AI-38] HIRO: Hierarchical Information Retrieval Optimization

链接: https://arxiv.org/abs/2406.09979
作者: Krish Goel,Mahek Chandak
关键词: Large Language Models, Large Language, natural language tasks, face limitations due, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[AI-39] Robust Model-Based Reinforcement Learning with an Adversarial Auxiliary Model

点击查看摘要

[AI-40] Outlier detection in maritime environments using AIS data and deep recurrent architectures

点击查看摘要

[AI-41] DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning

链接: https://arxiv.org/abs/2406.09953
作者: Zeyu Gao,Yao Mu,Jinye Qu,Mengkang Hu,Lingyue Guo,Ping Luo,Yanfeng Lu
关键词: offer enhanced versatility, robots offer enhanced, enabling concurrent manipulation, offer enhanced, enhanced versatility
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 44 pages, 13 figures

点击查看摘要

Abstract:Dual-arm robots offer enhanced versatility and efficiency over single-arm counterparts by enabling concurrent manipulation of multiple objects or cooperative execution of tasks using both arms. However, effectively coordinating the two arms for complex long-horizon tasks remains a significant challenge. Existing task planning methods predominantly focus on single-arm robots or rely on predefined bimanual operations, failing to fully leverage the capabilities of dual-arm systems. To address this limitation, we introduce DAG-Plan, a structured task planning framework tailored for dual-arm robots. DAG-Plan harnesses large language models (LLMs) to decompose intricate tasks into actionable sub-tasks represented as nodes within a directed acyclic graph (DAG). Critically, DAG-Plan dynamically assigns these sub-tasks to the appropriate arm based on real-time environmental observations, enabling parallel and adaptive execution. We evaluate DAG-Plan on the novel Dual-Arm Kitchen Benchmark, comprising 9 sequential tasks with 78 sub-tasks and 26 objects. Extensive experiments demonstrate the superiority of DAG-Plan over directly using LLM to generate plans, achieving nearly 50% higher efficiency compared to the single-arm task planning baseline and nearly double the success rate of the dual-arm task planning baseline.

[AI-42] Neural Concept Binder

点击查看摘要

[AI-43] Experiments in News Bias Detection with Pre-Trained Neural Transformers

链接: https://arxiv.org/abs/2406.09938
作者: Tim Menzner,Jochen L. Leidner
关键词: World Wide Web, World Wide, Wide Web, Web provides unrivalled, including factual
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-44] What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

点击查看摘要

[AI-45] Personalized Speech Enhancement Without a Separate Speaker Embedding Model

点击查看摘要

[AI-46] CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses Procedures Lab Tests Orders and Prescriptions

链接: https://arxiv.org/abs/2406.09923
作者: Mingyu Derek Ma,Chenchen Ye,Yu Yan,Xiaoxuan Wang,Peipei Ping,Timothy S Chang,Wei Wang
关键词: Large Language Models, Artificial Intelligence, Language Models, Large Language, process offers significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

[AI-47] Knowledge Editing in Language Models via Adapted Direct Preference Optimization

链接: https://arxiv.org/abs/2406.09920
作者: Amit Rozner,Barak Battash,Lior Wolf,Ofir Lindenbaum
关键词: Large Language Models, Large Language, Direct Preference Optimization, lack updated world, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures

点击查看摘要

[AI-48] Learning Solution-Aware Transformers for Efficiently Solving Quadratic Assignment Problem

点击查看摘要

[AI-49] Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming

链接: https://arxiv.org/abs/2406.09891
作者: Victor-Alexandru Pădurean,Adish Singla
关键词: demonstrated human-level proficiency, natural sciences, general knowledge, demonstrated human-level, human-level proficiency
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative models have demonstrated human-level proficiency in various benchmarks across domains like programming, natural sciences, and general knowledge. Despite these promising results on competitive benchmarks, they still struggle with seemingly simple problem-solving tasks typically carried out by elementary-level students. How do state-of-the-art models perform on standardized tests designed to assess computational thinking and problem-solving skills at schools? In this paper, we curate a novel benchmark involving computational thinking tests grounded in elementary visual programming domains. Our initial results show that state-of-the-art models like GPT-4o and Llama3 barely match the performance of an average school student. To further boost the performance of these models, we fine-tune them using a novel synthetic data generation methodology. The key idea is to develop a comprehensive dataset using symbolic methods that capture different skill levels, ranging from recognition of visual elements to multi-choice quizzes to synthesis-style tasks. We showcase how various aspects of symbolic information in synthetic data help improve fine-tuned models’ performance. We will release the full implementation and datasets to facilitate further research on enhancing computational thinking in generative models.

[AI-50] Federated Learning with Flexible Architectures

点击查看摘要

[AI-51] IGL-Bench: Establishing the Comprehensive Benchmark for Imbalanced Graph Learning

点击查看摘要

[AI-52] LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data

链接: https://arxiv.org/abs/2406.09864
作者: Grigor Bezirganyan,Sana Sellami,Laure Berti-Équille,Sébastien Fournier
关键词: diverse information sources, integrating diverse information, Learning enhances decision-making, Multimodal Deep Learning, Deep Learning enhances
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-53] Dataset Condensation with Latent Quantile Matching

点击查看摘要

[AI-54] Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps

点击查看摘要

[AI-55] SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering

链接: https://arxiv.org/abs/2406.09833
作者: Zhe Yang,Wenrui Li,Guanghui Cheng
关键词: Audio-Visual Question Answering, Question Answering, task holds significant, holds significant potential, task holds
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The Audio-Visual Question Answering (AVQA) task holds significant potential for applications. Compared to traditional unimodal approaches, the multi-modal input of AVQA makes feature extraction and fusion processes more challenging. Euclidean space is difficult to effectively represent multi-dimensional relationships of data. Especially when extracting and processing data with a tree structure or hierarchical structure, Euclidean space is not suitable as an embedding space. Additionally, the self-attention mechanism in Transformers is effective in capturing the dynamic relationships between elements in a sequence. However, the self-attention mechanism’s limitations in window modeling and quadratic computational complexity reduce its effectiveness in modeling long sequences. To address these limitations, we propose SHMamba: Structured Hyperbolic State Space Model to integrate the advantages of hyperbolic geometry and state space models. Specifically, SHMamba leverages the intrinsic properties of hyperbolic space to represent hierarchical structures and complex relationships in audio-visual data. Meanwhile, the state space model captures dynamic changes over time by globally modeling the entire sequence. Furthermore, we introduce an adaptive curvature hyperbolic alignment module and a cross fusion block to enhance the understanding of hierarchical structures and the dynamic exchange of cross-modal information, respectively. Extensive experiments demonstrate that SHMamba outperforms previous methods with fewer parameters and computational costs. Our learnable parameters are reduced by 78.12%, while the average performance improves by 2.53%. Experiments show that our method demonstrates superiority among all current major methods and is more suitable for practical application scenarios.

[AI-56] Federated Learning driven Large Language Models for Swarm Intelligence: A Survey

链接: https://arxiv.org/abs/2406.09831
作者: Youyang Qu
关键词: training large language, large language models, large language, addressing data privacy, offers a compelling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

[AI-57] Unraveling Anomalies in Time: Unsupervised Discovery and Isolation of Anomalous Behavior in Bio-regenerative Life Support System Telemetry

点击查看摘要

[AI-58] From Manifestations to Cognitive Architectures: a Scalable Framework

链接: https://arxiv.org/abs/2406.09823
作者: Alfredo Ibias,Guillem Ramirez-Miranda,Enric Guinovart,Eduard Alarcon
关键词: Artificial Intelligence field, Artificial General Intelligence, Artificial Intelligence, Intelligence field, field is flooded
类目: Artificial Intelligence (cs.AI)
*备注: To be published by AGI 2024 conference proceedings

点击查看摘要

Abstract:The Artificial Intelligence field is flooded with optimisation methods. In this paper, we change the focus to developing modelling methods with the aim of getting us closer to Artificial General Intelligence. To do so, we propose a novel way to interpret reality as an information source, that is later translated into a computational framework able to capture and represent such information. This framework is able to build elements of classical cognitive architectures, like Long Term Memory and Working Memory, starting from a simple primitive that only processes Spatial Distributed Representations. Moreover, it achieves such level of verticality in a seamless scalable hierarchical way.

[AI-59] Retrieval Augmented Fact Verification by Synthesizing Contrastive Arguments

链接: https://arxiv.org/abs/2406.09815
作者: Zhenrui Yue,Huimin Zeng,Lanyu Shang,Yifan Liu,Yang Zhang,Dong Wang
关键词: poses substantial risks, misinformation poses substantial, public interest, rapid propagation, poses substantial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to ACL 2024

点击查看摘要

[AI-60] Evolving Self-Assembling Neural Networks: From Spontaneous Activity to Experience-Dependent Learning

链接: https://arxiv.org/abs/2406.09787
作者: Erwan Plantec,Joachin W.Pedersen,Milton L.Montero,Eleni Nisioti,Sebastian Risi
关键词: Biological neural networks, Biological neural, Neural Developmental Programs, Lifelong Neural Developmental, natural organisms
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Biological neural networks are characterized by their high degree of plasticity, a core property that enables the remarkable adaptability of natural organisms. Importantly, this ability affects both the synaptic strength and the topology of the nervous systems. Artificial neural networks, on the other hand, have been mainly designed as static, fully connected structures that can be notoriously brittle in the face of changing environments and novel inputs. Building on previous works on Neural Developmental Programs (NDPs), we propose a class of self-organizing neural networks capable of synaptic and structural plasticity in an activity and reward-dependent manner which we call Lifelong Neural Developmental Program (LNDP). We present an instance of such a network built on the graph transformer architecture and propose a mechanism for pre-experience plasticity based on the spontaneous activity of sensory neurons. Our results demonstrate the ability of the model to learn from experiences in different control tasks starting from randomly connected or empty networks. We further show that structural plasticity is advantageous in environments necessitating fast adaptation or with non-stationary rewards.

[AI-61] OSPC: Detecting Harmful Memes with Large Language Model as a Catalyst

链接: https://arxiv.org/abs/2406.09779
作者: Jingtao Cao,Zheng Zhang,Hongru Wang,Bin Liang,Hao Wang,Kam-Fai Wong
关键词: rapidly disseminate personal, disseminate personal opinions, propagating social bias, pose significant challenges, bias and prejudice
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-62] Research on Edge Detection of LiDAR Images Based on Artificial Intelligence Technology

点击查看摘要

[AI-63] owards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion

点击查看摘要

[AI-64] Bayesian Conditioned Diffusion Models for Inverse Problems

点击查看摘要

[AI-65] Mix Q-learning for Lane Changing: A Collaborative Decision-Making Method in Multi-Agent Deep Reinforcement Learning

链接: https://arxiv.org/abs/2406.09755
作者: Xiaojun Bi,Mingjie He,Yiwen Sun
关键词: face practical challenges, practical challenges due, vehicle path planning, autonomous vehicle path, path planning
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Lane-changing decisions, which are crucial for autonomous vehicle path planning, face practical challenges due to rule-based constraints and limited data. Deep reinforcement learning has become a major research focus due to its advantages in data acquisition and interpretability. However, current models often overlook collaboration, which affects not only impacts overall traffic efficiency but also hinders the vehicle’s own normal driving in the long run. To address the aforementioned issue, this paper proposes a method named Mix Q-learning for Lane Changing(MQLC) that integrates a hybrid value Q network, taking into account both collective and individual benefits for the greater good. At the collective level, our method coordinates the individual Q and global Q networks by utilizing global information. This enables agents to effectively balance their individual interests with the collective benefit. At the individual level, we integrated a deep learning-based intent recognition module into our observation and enhanced the decision network. These changes provide agents with richer decision information and more accurate feature extraction for improved lane-changing decisions. This strategy enables the multi-agent system to learn and formulate optimal decision-making strategies effectively. Our MQLC model, through extensive experimental results, impressively outperforms other state-of-the-art multi-agent decision-making methods, achieving significantly safer and faster lane-changing decisions.

[AI-66] ControlVAR: Exploring Controllable Visual Autoregressive Modeling

点击查看摘要

[AI-67] When Will Gradient Regularization Be Harmful?

点击查看摘要

[AI-68] Self-Knowledge Distillation for Learning Ambiguity

链接: https://arxiv.org/abs/2406.09719
作者: Hancheol Park,Soyeong Jeong,Sukmin Cho,Jong C. Park
关键词: Recent language models, natural language understanding, Recent language, shown remarkable performance, language understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures

点击查看摘要

[AI-69] Speed-up of Data Analysis with Kernel Trick in Encrypted Domain

点击查看摘要

[AI-70] Meta-Learning Loss Functions for Deep Neural Networks

点击查看摘要

[AI-71] Fine-Grained Urban Flow Inference with Multi-scale Representation Learning

点击查看摘要

[AI-72] Explainable AI for Comparative Analysis of Intrusion Detection Models

点击查看摘要

[AI-73] Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and Efficiency

点击查看摘要

[AI-74] Evaluating ChatGPT-4 Vision on Brazils National Undergraduate Computer Science Exam

链接: https://arxiv.org/abs/2406.09671
作者: Nabor C. Mendonça
关键词: Large Language Models, Large Language, National Undergraduate Exam, Computer Science section, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted for publication

点击查看摘要

[AI-75] Learning Language Structures through Grounding

点击查看摘要

[AI-76] mporal Planning via Interval Logic Satisfiability for Autonomous Systems

链接: https://arxiv.org/abs/2406.09661
作者: Miquel Ramirez,Anubhav Singh,Peter Stuckey,Chris Manzie
关键词: suitably designed abstractions, Allen Interval Logic, automated planning methods, attain computational scalability, rely on suitably
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: This publication is an extended version of a manuscript submitted to ICAPS-24 (and rejected). Please contact the first author for queries, comments or discussion of the paper

点击查看摘要

Abstract:Many automated planning methods and formulations rely on suitably designed abstractions or simplifications of the constrained dynamics associated with agents to attain computational scalability. We consider formulations of temporal planning where intervals are associated with both action and fluent atoms, and relations between these are given as sentences in Allen’s Interval Logic. We propose a notion of planning graphs that can account for complex concurrency relations between actions and fluents as a Constraint Programming (CP) model. We test an implementation of our algorithm on a state-of-the-art framework for CP and compare it with PDDL 2.1 planners that capture plans requiring complex concurrent interactions between agents. We demonstrate our algorithm outperforms existing PDDL 2.1 planners in the case studies. Still, scalability remains challenging when plans must comply with intricate concurrent interactions and the sequencing of actions.

[AI-77] RSEND: Retinex-based Squeeze and Excitation Network with Dark Region Detection for Efficient Low Light Image Enhancement

点击查看摘要

[AI-78] A Survey of Video Datasets for Grounded Event Understanding

点击查看摘要

[AI-79] RobustSAM: Segment Anything Robustly on Degraded Images

点击查看摘要

[AI-80] DrivAerNet: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks

点击查看摘要

[AI-81] DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer

点击查看摘要

[AI-82] Multi-Modal Retrieval For Large Language Model Based Speech Recognition

链接: https://arxiv.org/abs/2406.09618
作者: Jari Kolehmainen,Aditya Gourav,Prashanth Gurunath Shivakumar,Yile Gu,Ankur Gandhe,Ariya Rastrow,Grant Strimel,Ivan Bulyko
关键词: widely adopted approach, improving language models, language models leveraging, leveraging external information, models leveraging external
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-83] Automated Molecular Concept Generation and Labeling with Large Language Models

点击查看摘要

[AI-84] Cross-Modality Program Representation Learning for Electronic Design Automation with High-Level Synthesis

点击查看摘要

[AI-85] Analyzing Gender Polarity in Short Social Media Texts with BERT: The Role of Emojis and Emoticons

链接: https://arxiv.org/abs/2406.09573
作者: Saba Yousefian Jazi,Amir Mirzaeinia,Sina Yousefian Jazi
关键词: based on BERT, BERT to detect, effort we fine, fine tuned, polarity of twitter
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-86] Improving Consistency Models with Generator-Induced Coupling

点击查看摘要

[AI-87] Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time

链接: https://arxiv.org/abs/2406.09569
作者: Frank Seide,Morrie Doulaty,Yangyang Shi,Yashesh Gaur,Junteng Jia,Chunyang Wu
关键词: make multimodal LLM, multimodal LLM architectures, LLM architectures capable, ASR architecture, ASR architecture designed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-88] owards Domain Adaptive Neural Contextual Bandits

点击查看摘要

[AI-89] Label Noise Robustness for Domain-Agnostic Fair Corrections via Nearest Neighbors Label Spreading

点击查看摘要

[AI-90] Decoding the Diversity: A Review of the Indic AI Research Landscape

链接: https://arxiv.org/abs/2406.09559
作者: Sankalp KJ,Vinija Jain,Sreyoshi Bhaduri,Tamoghna Roy,Aman Chadha
关键词: large language model, Indic languages, Indic, Sri Lanka, comprehensive overview
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 27 pages, 1 figure

点击查看摘要

[AI-91] My Body My Choice: Human-Centric Full-Body Anonymization

点击查看摘要

[AI-92] Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale

点击查看摘要

[AI-93] Differentiable Reasoning about Knowledge Graphs with Region-based Graph Neural Networks

点击查看摘要

[AI-94] A Systematic Review of Generative AI for Teaching and Learning Practice

链接: https://arxiv.org/abs/2406.09520
作者: Bayode Ogunleye,Kudirat Ibilola Zakariyyah,Oluwaseun Ajao,Olakunle Olayinka,Hemlata Sharma
关键词: generative artificial intelligence, hotly debated topic, artificial intelligence, debated topic, generative artificial
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 20 pages, 10 figures, article published in Education Sciences

点击查看摘要

[AI-95] alking Heads: Understanding Inter-layer Communication in Transformer Language Models

链接: https://arxiv.org/abs/2406.09519
作者: Jack Merullo,Carsten Eickhoff,Ellie Pavlick
关键词: information is represented, represented and routed, transformer language models, pass features, model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-96] CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making

点击查看摘要

[AI-97] You are what you eat? Feeding foundation models a regionally diverse food dataset of World Wide Dishes

链接: https://arxiv.org/abs/2406.09496
作者: Jabez Magomere,Shu Ishida,Tejumade Afonja,Aya Salama,Daniel Kochin,Foutse Yuehgoh,Imane Hamzaoui,Raesetje Sefala,Aisha Alaagib,Elizaveta Semenova,Lauren Crais,Siobhan Mackenzie Hall
关键词: World Wide Dishes, interactions with chatbots, daily lives, text-image searches, World Wide
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Foundation models are increasingly ubiquitous in our daily lives, used in everyday tasks such as text-image searches, interactions with chatbots, and content generation. As use increases, so does concern over the disparities in performance and fairness of these models for different people in different parts of the world. To assess these growing regional disparities, we present World Wide Dishes, a mixed text and image dataset consisting of 765 dishes, with dish names collected in 131 local languages. World Wide Dishes has been collected purely through human contribution and decentralised means, by creating a website widely distributed through social networks. Using the dataset, we demonstrate a novel means of operationalising capability and representational biases in foundation models such as language models and text-to-image generative models. We enrich these studies with a pilot community review to understand, from a first-person perspective, how these models generate images for people in five African countries and the United States. We find that these models generally do not produce quality text and image outputs of dishes specific to different regions. This is true even for the US, which is typically considered to be more well-resourced in training data - though the generation of US dishes does outperform that of the investigated African countries. The models demonstrate a propensity to produce outputs that are inaccurate as well as culturally misrepresentative, flattening, and insensitive. These failures in capability and representational bias have the potential to further reinforce stereotypes and disproportionately contribute to erasure based on region. The dataset and code are available at this https URL. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.09496 [cs.CY] (or arXiv:2406.09496v1 [cs.CY] for this version)

[AI-98] Fair Data Generation via Score-based Diffusion Model

点击查看摘要

[AI-99] SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

点击查看摘要

[AI-100] Distributed genetic algorithm for application placement in the compute continuum leveraging infrastructure nodes for optimization

链接: https://arxiv.org/abs/2406.09478
作者: Carlos Guerrero,Isaac Lera,Carlos Juiz
关键词: computing environments calls, fog computing environments, resource optimization techniques, efficient resource optimization, fog computing
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing complexity of fog computing environments calls for efficient resource optimization techniques. In this paper, we propose and evaluate three distributed designs of a genetic algorithm (GA) for resource optimization in fog computing, within an increasing degree of distribution. The designs leverage the execution of the GA in the fog devices themselves by dealing with the specific features of this domain: constrained resources and widely geographical distribution of the devices. For their evaluation, we implemented a benchmark case using the NSGA-II for the specific problem of optimizing the fog service placement, according to the guidelines of our three distributed designs. These three experimental scenarios were compared with a control case, a traditional centralized version of this GA algorithm, considering solution quality and network overhead. The results show that the design with the lowest distribution degree, which keeps centralized storage of the objective space, achieves comparable solution quality to the traditional approach but incurs a higher network load. The second design, which completely distributes the population between the workers, reduces network overhead but exhibits lower solution diversity while keeping enough good results in terms of optimization objective minimization. Finally, the proposal with a distributed population and that only interchanges solution between the workers’ neighbors achieves the lowest network load but with compromised solution quality.

[AI-101] Q-S5: Towards Quantized State Space Models

点击查看摘要

[AI-102] GPT-ology Computational Models Silicon Sampling: How should we think about LLMs in Cognitive Science?

链接: https://arxiv.org/abs/2406.09464
作者: Desmond C. Ong
关键词: Large Language Models, Large Language, Language Models, cognitive science world, world by storm
类目: Artificial Intelligence (cs.AI)
*备注: CogSci 2024; 6 pages + 2 page of references

点击查看摘要

Abstract:Large Language Models have taken the cognitive science world by storm. It is perhaps timely now to take stock of the various research paradigms that have been used to make scientific inferences about cognition" in these models or about human cognition. We review several emerging research paradigms -- GPT-ology, LLMs-as-computational-models, and silicon sampling" – and review recent papers that have used LLMs under these paradigms. In doing so, we discuss their claims as well as challenges to scientific inference under these various paradigms. We highlight several outstanding issues about LLMs that have to be addressed to push our science forward: closed-source vs open-sourced models; (the lack of visibility of) training data; and reproducibility in LLM research, including forming conventions on new task ``hyperparameters" like instructions and prompts.

[AI-103] SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

点击查看摘要

[AI-104] Ad Auctions for LLMs via Retrieval Augmented Generation

点击查看摘要

[AI-105] Updating CLIP to Prefer Descriptions Over Captions

链接: https://arxiv.org/abs/2406.09458
作者: Amir Zur,Elisa Kreiss,Karel D’Oosterlinck,Christopher Potts,Atticus Geiger
关键词: powerful generic metric, meant to complement, meant to replace, replace an image, powerful generic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-106] Pandora: Towards General World Model with Natural Language Actions and Video States

链接: https://arxiv.org/abs/2406.09455
作者: Jiannan Xiang,Guangyi Liu,Yi Gu,Qiyue Gao,Yuting Ning,Yuheng Zha,Zeyu Feng,Tianhua Tao,Shibo Hao,Yemin Shi,Zhengzhong Liu,Eric P. Xing,Zhiting Hu
关键词: World, general world, general world models, World models, simulate future states
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Website: this https URL

点击查看摘要

[AI-107] Advancing High Resolution Vision-Language Models in Biomedicine

链接: https://arxiv.org/abs/2406.09454
作者: Zekai Chen,Arda Pekis,Kevin Brown
关键词: significantly advanced generative, vision-language modeling, learning has significantly, significantly advanced, advanced generative
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注: 15 pages

点击查看摘要

[AI-108] Comment on paper: Position: Rethinking Post-Hoc Search-Based Neural Approaches for Solving Large-Scale Traveling Salesman Problems

点击查看摘要

[AI-109] LooPIN: A PinFi protocol for decentralized computing

链接: https://arxiv.org/abs/2406.09422
作者: Yunwei Mao,Qi He,Ju Li
关键词: Networked computing power, Physical Infrastructure Finance, Networked computing, artificial intelligence, critical utility
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Networked computing power is a critical utility in the era of artificial intelligence. This paper presents a novel Physical Infrastructure Finance (PinFi) protocol designed to facilitate the distribution of computing power within networks in a decentralized manner. Addressing the core challenges of coordination, pricing, and liquidity in decentralized physical infrastructure networks (DePIN), the PinFi protocol introduces a distinctive dynamic pricing mechanism. It enables providers to allocate excess computing resources to a “dissipative” PinFi liquidity pool, distinct from traditional DeFi liquidity pools, ensuring seamless access for clients at equitable, market-based prices. This approach significantly reduces the costs of accessing computing power, potentially to as low as 1% compared to existing services, while simultaneously enhancing security and dependability. The PinFi protocol is poised to transform the dynamics of supply and demand in computing power networks, setting a new standard for efficiency and accessibility.

[AI-110] Understanding Pedestrian Movement Using Urban Sensing Technologies: The Promise of Audio-based Sensors

点击查看摘要

[AI-111] Implementing engrams from a machine learning perspective: XOR as a basic motif

链接: https://arxiv.org/abs/2406.09940
作者: Jesus Marco de Lucas,Maria Peña Fernandez,Lara Lloret Iglesias
关键词: complex multimodal information, machine learning tools, compressed form, previously presented, complex multimodal
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 9 pages, short comment

点击查看摘要

Abstract:We have previously presented the idea of how complex multimodal information could be represented in our brains in a compressed form, following mechanisms similar to those employed in machine learning tools, like autoencoders. In this short comment note we reflect, mainly with a didactical purpose, upon the basic question for a biological implementation: what could be the mechanism working as a loss function, and how it could be connected to a neuronal network providing the required feedback to build a simple training configuration. We present our initial ideas based on a basic motif that implements an XOR switch, using few excitatory and inhibitory neurons. Such motif is guided by a principle of homeostasis, and it implements a loss function that could provide feedback to other neuronal structures, establishing a control system. We analyse the presence of this XOR motif in the connectome of C.Elegans, and indicate the relationship with the well-known lateral inhibition motif. We then explore how to build a basic biological neuronal structure with learning capacity integrating this XOR motif. Guided by the computational analogy, we show an initial example that indicates the feasibility of this approach, applied to learning binary sequences, like it is the case for simple melodies. In summary, we provide didactical examples exploring the parallelism between biological and computational learning mechanisms, identifying basic motifs and training procedures, and how an engram encoding a melody could be built using a simple recurrent network involving both excitatory and inhibitory neurons.

[AI-112] Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition

链接: https://arxiv.org/abs/2406.09873
作者: Yicong Jiang,Tianzi Wang,Xurong Xie,Juan Liu,Wei Sun,Nan Yan,Hui Chen,Lan Wang,Xunying Liu,Feng Tian
关键词: Disordered speech recognition, Disordered speech, recognition profound implications, profound implications, implications for improving
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Accepted by interspeech 2024

点击查看摘要

Abstract:Disordered speech recognition profound implications for improving the quality of life for individuals afflicted with, for example, dysarthria. Dysarthric speech recognition encounters challenges including limited data, substantial dissimilarities between dysarthric and non-dysarthric speakers, and significant speaker variations stemming from the disorder. This paper introduces Perceiver-Prompt, a method for speaker adaptation that utilizes P-Tuning on the Whisper large-scale model. We first fine-tune Whisper using LoRA and then integrate a trainable Perceiver to generate fixed-length speaker prompts from variable-length inputs, to improve model recognition of Chinese dysarthric speech. Experimental results from our Chinese dysarthric speech dataset demonstrate consistent improvements in recognition performance with Perceiver-Prompt. Relative reduction up to 13.04% in CER is obtained over the fine-tuned Whisper.

附件下载

点击下载今日全部论文列表

今日(2024-06-17)Arxiv最新论文

目录

概览 (2024-06-17)

自然语言处理

计算机视觉

机器学习

信息检索

人工智能

附件下载