本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-06-28)

今日共更新422篇论文,其中:

  • 自然语言处理78篇(Computation and Language (cs.CL))
  • 计算机视觉135篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能114篇(Artificial Intelligence (cs.AI))
  • 机器学习131篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] aming Data and Transformers for Audio Generation
[NLP-0] 用于音频生成的数据和变形金刚

链接: https://arxiv.org/abs/2406.19388
作者: Moayed Haji-Ali,Willi Menapace,Aliaksandr Siarohin,Guha Balakrishnan,Sergey Tulyakov,Vicente Ordonez
关键词: Generating ambient sounds, Generating ambient, challenging problem due, employ large-scale generative, making it difficult
中文关键词: 生成环境声音,生成环境,具有挑战性的问题,采用大规模生成,使其变得困难
类目: ound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Project Webpage: this https URL

点击查看摘要

Abstract:Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available with the audio modality, we can substantially improve the quality of captions. AutoCap reaches CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761,000 audio clips with high-quality captions, forming the largest available audio-text dataset. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters and train with our new dataset. When compared to state-of-the-art audio generators, GenAu obtains significant improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. This shows that the quality of data is often as important as its quantity. Besides, since AutoCap is fully automatic, new audio samples can be added to the training dataset, unlocking the training of even larger generative models for audio synthesis.
摘要:由于数据的稀缺性和字幕质量的不足,生成环境声音和效果是一个具有挑战性的问题,这使得使用大规模的生成模型来完成这项任务变得困难。在这项工作中,我们通过引入两个新模型来解决这个问题。首先,我们提出了一种高质量、高效率的自动音频字幕模型AutoCap。我们表明,通过利用音频通道提供的元数据,我们可以显著提高字幕的质量。AutoCap达到了83.2的苹果酒分数,比现有的最佳字幕模型提高了3.2%,推理速度快了四倍。然后,我们使用AutoCap为现有数据集中的片段添加字幕,获得了761,000个具有高质量字幕的音频片段,形成了最大的可用音频文本数据集。其次,我们提出了Genau,这是一种可扩展的基于变压器的音频生成体系结构,我们将其扩展到1.25B参数并使用我们的新数据集进行训练。与最先进的音频生成器相比,genau在FAD得分上获得了15.7%的显著改进,在IS上获得了22.7%的提升,在CLAP得分上获得了13.5%的提升,表明与之前的作品相比,生成的音频质量有了显著的提高。这表明,数据的质量往往与其数量一样重要。此外,由于AutoCap是全自动的,可以将新的音频样本添加到训练数据集中,从而解锁用于音频合成的更大生成模型的训练。

[NLP-1] he Remarkable Robustness of LLMs: Stages of Inference?
[NLP-1] 法学硕士的非凡稳健性:推理阶段?

链接: https://arxiv.org/abs/2406.19384
作者: Vedang Lad,Wes Gurnee,Max Tegmark
关键词: Large Language Models, Large Language, swapping adjacent layers, Language Models, deleting and swapping
中文关键词: 大型语言模型、大型语言、交换相邻层、语言模型、删除和交换
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We demonstrate and investigate the remarkable robustness of Large Language Models by deleting and swapping adjacent layers. We find that deleting and swapping interventions retain 72-95% of the original model’s prediction accuracy without fine-tuning, whereas models with more layers exhibit more robustness. Based on the results of the layer-wise intervention and further experiments, we hypothesize the existence of four universal stages of inference across eight different models: detokenization, feature engineering, prediction ensembling, and residual sharpening. The first stage integrates local information, lifting raw token representations into higher-level contextual representations. Next is the iterative refinement of task and entity-specific features. Then, the second half of the model begins with a phase transition, where hidden representations align more with the vocabulary space due to specialized model components. Finally, the last layer sharpens the following token distribution by eliminating obsolete features that add noise to the prediction.
摘要:我们通过删除和交换相邻层来演示和研究大型语言模型的显著健壮性。我们发现,删除和交换干预措施在没有微调的情况下保持了原始模型72-95%的预测精度,而具有更多层的模型表现出更强的稳健性。基于分层干预的结果和进一步的实验,我们假设在八种不同的模型中存在四个普遍的推理阶段:去标记化、特征工程、预测集成和残差锐化。第一阶段集成本地信息,将原始标记表示提升为更高级别的上下文表示。接下来是任务和特定于实体的功能的迭代细化。然后,模型的后半部分从阶段转换开始,由于特定的模型组件,隐藏的表示与词汇空间更一致。最后,最后一层通过消除增加预测噪声的过时特征来锐化下面的标记分布。

[NLP-2] Suri: Multi-constraint Instruction Following for Long-form Text Generation
[NLP-2] Suri:长格式文本生成的多约束指令

链接: https://arxiv.org/abs/2406.19371
作者: Chau Minh Pham,Simeng Sun,Mohit Iyyer
关键词: Existing research, largely focuses, focuses on tasks, tasks with simple, Existing
中文关键词: 现有的研究,主要集中在任务上,任务简单,现有
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing research on instruction following largely focuses on tasks with simple instructions and short responses. In this work, we explore multi-constraint instruction following for generating long-form text. We create Suri, a dataset with 20K human-written long-form texts paired with LLM-generated backtranslated instructions that contain multiple complex constraints. Because of prohibitive challenges associated with collecting human preference judgments on long-form texts, preference-tuning algorithms such as DPO are infeasible in our setting; thus, we propose Instructional ORPO (I-ORPO), an alignment method based on the ORPO algorithm. Instead of receiving negative feedback from dispreferred responses, I-ORPO obtains negative feedback from synthetically corrupted instructions generated by an LLM. Using Suri, we perform supervised and I-ORPO fine-tuning on Mistral-7b-Instruct-v0.2. The resulting models, Suri-SFT and Suri-I-ORPO, generate significantly longer texts (~5K tokens) than base models without significant quality deterioration. Our human evaluation shows that while both SFT and I-ORPO models satisfy most constraints, Suri-I-ORPO generations are generally preferred for their coherent and informative incorporation of the constraints. We release our code at this https URL.
摘要:现有的教学跟踪研究主要集中在指令简单、反应时间短的任务上。在这项工作中,我们探索了多约束指令跟随生成长文本的方法。我们创建了Suri,一个包含20,000个人写的长形式文本的数据集,与LLM生成的包含多个复杂约束的反向翻译指令配对。由于收集人类对长文本的偏好判断面临着巨大的挑战,DPO等偏好调整算法在我们的环境下是不可行的;因此,我们提出了一种基于ORPO算法的比对方法–指令ORPO(I-ORPO)。I-ORPO不是从不受欢迎的响应中接收负反馈,而是从由LLM生成的综合破坏的指令中获得负反馈。使用Suri,我们对Mistral-7b-Indict-v0.2进行了监督和I-ORPO微调。所得到的模型,Suri-SFT和Suri-I-Orpo,生成的文本(~5K令牌)明显长于基本模型,而没有显著的质量下降。我们的人类评估表明,虽然SFT和I-ORPO模型都满足大多数约束条件,但Suri-I-ORPO模型通常因其连贯和信息丰富的约束条件而更受欢迎。我们在这个HTTPS URL发布我们的代码。

[NLP-3] he Model Arena for Cross-lingual Sentiment Analysis: A Comparative Study in the Era of Large Language Models
[NLP-3] 跨语言情感分析模型竞技场:大型语言模型时代的比较研究

链接: https://arxiv.org/abs/2406.19358
作者: Xiliang Zhu,Shayna Gardiner,Tere Roldán,David Rossouw
关键词: Natural Language Processing, Language Processing, component in Natural, Natural Language, Sentiment analysis serves
中文关键词: 自然语言处理,语言处理,自然组件,自然语言,情感分析服务
类目: Computation and Language (cs.CL)
备注: Accepted to WASSA workshop at ACL2024

点击查看摘要

Abstract:Sentiment analysis serves as a pivotal component in Natural Language Processing (NLP). Advancements in multilingual pre-trained models such as XLM-R and mT5 have contributed to the increasing interest in cross-lingual sentiment analysis. The recent emergence in Large Language Models (LLM) has significantly advanced general NLP tasks, however, the capability of such LLMs in cross-lingual sentiment analysis has not been fully studied. This work undertakes an empirical analysis to compare the cross-lingual transfer capability of public Small Multilingual Language Models (SMLM) like XLM-R, against English-centric LLMs such as Llama-3, in the context of sentiment analysis across English, Spanish, French and Chinese. Our findings reveal that among public models, SMLMs exhibit superior zero-shot cross-lingual performance relative to LLMs. However, in few-shot cross-lingual settings, public LLMs demonstrate an enhanced adaptive potential. In addition, we observe that proprietary GPT-3.5 and GPT-4 lead in zero-shot cross-lingual capability, but are outpaced by public models in few-shot scenarios.
摘要:情感分析是自然语言处理的重要组成部分。XLm-R和MT5等多语言预训练模型的进步促进了人们对跨语言情感分析越来越感兴趣。最近出现的大语言模型(LLM)极大地推动了一般的自然语言处理任务,然而,这类大语言模型在跨语言情感分析中的能力还没有得到充分的研究。本文在英语、西班牙语、法语和汉语情感分析的背景下,对XLM-R等公共小型多语言模型与Llama-3等以英语为中心的多语言模型的跨语言迁移能力进行了实证分析。我们的研究结果表明,在公共模式中,最小二乘模型的跨语言零命中率要高于二次最小二乘模型。然而,在少有机会的跨语言环境中,公共LLM表现出更强的适应潜力。此外,我们观察到,专有的GPT-3.5和GPT-4在零射击跨语言能力方面领先,但在少数射击场景下被公共型号超越。

[NLP-4] DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions
[NLP-4] DivVERT:具有变分错误的干扰生成表示为数学多项选择题的文本

链接: https://arxiv.org/abs/2406.19356
作者: Nigel Fernandez,Alexander Scarlatos,Simon Woodhead,Andrew Lan
关键词: anticipate knowledge deficiencies, High-quality distractors, assessment and pedagogical, manually crafting, anticipate knowledge
中文关键词: 预测知识缺陷、高质量干扰物、评估和教学、手动制作、预测知识
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-quality distractors are crucial to both the assessment and pedagogical value of multiple-choice questions (MCQs), where manually crafting ones that anticipate knowledge deficiencies or misconceptions among real students is difficult. Meanwhile, automated distractor generation, even with the help of large language models (LLMs), remains challenging for subjects like math. It is crucial to not only identify plausible distractors but also understand the error behind them. In this paper, we introduce DiVERT (Distractor Generation with Variational Errors Represented as Text), a novel variational approach that learns an interpretable representation of errors behind distractors in math MCQs. Through experiments on a real-world math MCQ dataset with 1,434 questions used by hundreds of thousands of students, we show that DiVERT, despite using a base open-source LLM with 7B parameters, outperforms state-of-the-art approaches using GPT-4o on downstream distractor generation. We also conduct a human evaluation with math educators and find that DiVERT leads to error labels that are of comparable quality to human-authored ones.
摘要:高质量的干扰因素对多项选择题(MCQ)的评估和教学价值都至关重要,在这种情况下,手工制作预测真实学生中知识缺陷或误解的问题是困难的。与此同时,即使在大型语言模型(LLM)的帮助下,自动生成分心物仍然对数学等学科具有挑战性。关键的一点是,不仅要找出看似合理的干扰因素,还要了解它们背后的错误。本文介绍了一种新的变分方法DIRECT,它学习了数学MCQ中干扰项后面的错误的可解释表示。通过在数十万学生使用的包含1,434个问题的真实世界数学McQ数据集上进行实验,我们表明,尽管使用了具有7B参数的基本开源LLM,但Divert在下游干扰项生成方面优于使用GPT-40的最先进方法。我们还与数学教育工作者进行了一项人类评估,发现转移会导致错误标签,其质量与人类创作的标签相当。

[NLP-5] Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?
[NLP-5] 模型编辑的基本问题:理性信念修订应该如何在LLM中工作?

链接: https://arxiv.org/abs/2406.19354
作者: Peter Hase,Thomas Hofweber,Xiang Zhou,Elias Stengel-Eskin,Mohit Bansal
关键词: model editing, model editing problem, editing, model, editing problem concerns
中文关键词: 模型编辑,模型编辑问题,编辑,模型,编辑问题关注点
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 4 figures

点击查看摘要

Abstract:The model editing problem concerns how language models should learn new facts about the world over time. While empirical research on model editing has drawn widespread attention, the conceptual foundations of model editing remain shaky – perhaps unsurprisingly, since model editing is essentially belief revision, a storied problem in philosophy that has eluded succinct solutions for decades. Model editing nonetheless demands a solution, since we need to be able to control the knowledge within language models. With this goal in mind, this paper critiques the standard formulation of the model editing problem and proposes a formal testbed for model editing research. We first describe 12 open problems with model editing, based on challenges with (1) defining the problem, (2) developing benchmarks, and (3) assuming LLMs have editable beliefs in the first place. Many of these challenges are extremely difficult to address, e.g. determining far-reaching consequences of edits, labeling probabilistic entailments between facts, and updating beliefs of agent simulators. Next, we introduce a semi-synthetic dataset for model editing based on Wikidata, where we can evaluate edits against labels given by an idealized Bayesian agent. This enables us to say exactly how belief revision in language models falls short of a desirable epistemic standard. We encourage further research exploring settings where such a gold standard can be compared against. Our code is publicly available at: this https URL
摘要:模型编辑问题关系到语言模型应该如何随着时间的推移学习关于世界的新事实。虽然模型编辑的经验研究引起了广泛的关注,但模型编辑的概念基础仍然不稳固–也许并不令人惊讶,因为模型编辑本质上是信念修正,这是哲学中的一个古老问题,几十年来一直缺乏简洁的解决方案。尽管如此,模型编辑仍然需要一个解决方案,因为我们需要能够控制语言模型中的知识。基于这一目标,本文对模型编辑问题的标准公式进行了批判,并提出了一个模型编辑研究的形式化试验台。我们首先描述模型编辑的12个公开问题,基于(1)定义问题,(2)开发基准,以及(3)假设LLM首先具有可编辑的信念的挑战。其中许多挑战非常难以解决,例如确定编辑的深远后果,标记事实之间的概率蕴涵,以及更新代理模拟器的信念。接下来,我们介绍了一个用于基于维基数据的模型编辑的半合成数据集,其中我们可以根据理想的贝叶斯代理给出的标签来评估编辑。这使我们能够准确地说,语言模型中的信念修正如何没有达到理想的认知标准。我们鼓励进一步的研究,探索可以与这样的黄金标准相比较的环境。我们的代码可通过以下网址公开获得:这个HTTPS URL

[NLP-6] IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language
[NLP-6] IndoToxic 2024:印度尼西亚语言仇恨言语和毒性类型的人口统计学丰富数据集

链接: https://arxiv.org/abs/2406.19349
作者: Lucky Susanto,Musa Izzanardi Wijanarko,Prasetia Anugrah Pratama,Traci Hong,Ika Idris,Alham Fikri Aji,Derry Wijaya
关键词: Hate speech poses, Hate speech, Indonesian hate speech, social harmony, poses a significant
中文关键词: 仇恨言论构成,仇恨言论,印度尼西亚仇恨言论,社会和谐,构成重大
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hate speech poses a significant threat to social harmony. Over the past two years, Indonesia has seen a ten-fold increase in the online hate speech ratio, underscoring the urgent need for effective detection mechanisms. However, progress is hindered by the limited availability of labeled data for Indonesian texts. The condition is even worse for marginalized minorities, such as Shia, LGBTQ, and other ethnic minorities because hate speech is underreported and less understood by detection tools. Furthermore, the lack of accommodation for subjectivity in current datasets compounds this issue. To address this, we introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity classification dataset. Comprising 43,692 entries annotated by 19 diverse individuals, the dataset focuses on texts targeting vulnerable groups in Indonesia, specifically during the hottest political event in the country: the presidential election. We establish baselines for seven binary classification tasks, achieving a macro-F1 score of 0.78 with a BERT model (IndoBERTweet) fine-tuned for hate speech classification. Furthermore, we demonstrate how incorporating demographic information can enhance the zero-shot performance of the large language model, gpt-3.5-turbo. However, we also caution that an overemphasis on demographic information can negatively impact the fine-tuned model performance due to data fragmentation.
摘要:仇恨言论对社会和谐构成重大威胁。在过去的两年里,印度尼西亚的网上仇恨言论比率增加了十倍,突显出迫切需要有效的检测机制。然而,印度尼西亚文本的标记数据有限,阻碍了进展。对于被边缘化的少数群体,如什叶派、LGBTQ和其他少数民族来说,情况甚至更糟,因为仇恨言论被低估了,检测工具也更少理解。此外,当前数据集缺乏对主观性的考虑,这加剧了这个问题。为了解决这个问题,我们引入了IndoToxic2024,一个全面的印尼仇恨言论和毒性分类数据集。该数据集由19个不同的人注释,由43,692个条目组成,重点关注针对印尼弱势群体的文本,特别是在该国最热门的政治事件:总统选举期间。我们为七个二进制分类任务建立了基线,通过微调用于仇恨言论分类的BERT模型(IndoBERTweet),获得了0.78的宏观F1分数。此外,我们还演示了如何结合人口统计信息来提高大型语言模型GPT-3.5-TURBO的零射性能。然而,我们也警告,由于数据碎片化,过度强调人口统计信息可能会对微调模型的性能产生负面影响。

[NLP-7] Jump Starting Bandits with LLM-Generated Prior Knowledge
[NLP-7] 利用LLM生成的先验知识启动盗贼

链接: https://arxiv.org/abs/2406.19317
作者: Parand A. Alamdari,Yanshuai Cao,Kevin H. Wilson
关键词: integrating Large Language, Large Language Models, Large Language, present substantial evidence, substantial evidence demonstrating
中文关键词: 集成大语言、大语言模型、大语言,提供大量证据,大量证据证明
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.
摘要:我们提供了大量证据,证明了将大型语言模型(LLM)与上下文多臂Bandit框架集成的好处。上下文强盗已广泛用于推荐系统中,以根据用户特定的上下文生成个性化建议。我们表明,LLM在丰富人类知识和偏好的广泛数据库上预先训练,可以很好地模拟人类行为,从而启动上下文多武装强盗,以减少在线学习的遗憾。我们通过促使LLM生成人类对强盗的大致偏好的预训练数据集,为上下文强盗提出了一种初始化算法。这显着减少了训练此类模型的在线学习遗憾和数据收集成本。我们的方法通过两组具有不同强盗设置的实验进行了经验验证:一组利用LLM作为先知,另一组利用联合调查实验的数据的现实世界实验。

[NLP-8] LiveBench: A Challenging Contamination-Free LLM Benchmark
[NLP-8] LiveBench:令人惊叹的无污染LLM基准

链接: https://arxiv.org/abs/2406.19314
作者: Colin White,Samuel Dooley,Manley Roberts,Arka Pal,Ben Feuer,Siddhartha Jain,Ravid Shwartz-Ziv,Neel Jain,Khalid Saifullah,Siddartha Naidu,Chinmay Hegde,Yann LeCun,Tom Goldstein,Willie Neiswanger,Micah Goldblum
关键词: Test set contamination, fair LLM evaluation, render benchmarks obsolete, quickly render benchmarks, newer model training
中文关键词: 测试集污染、公平的LLM评估、使基准过时、快速渲染基准、更新的模型训练
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test set contamination, wherein test data from a benchmark ends up in a newer model’s training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.
摘要:测试集污染,即来自基准测试的测试数据最终进入较新模型的训练集,这是公平的LLM评估的一个有据可查的障碍,并可能很快使基准过时。为了缓解这一问题,最近的许多基准将新的提示和评估从人类或LLM评委那里众包出去;然而,这些可能会引入严重的偏见,并在给难回答的问题打分时崩溃。在这项工作中,我们为LLMS引入了一个新的基准,其设计既不受测试集污染的影响,也不受LLM判断和人工众包的陷阱的影响。我们发布了LiveBch,这是第一个基准测试,它(1)包含来自最近信息源的频繁更新的问题,(2)根据客观的基本真实值自动对答案进行评分,(3)包含各种具有挑战性的任务,跨越数学、编码、推理、语言、指令遵循和数据分析。为了实现这一点,LiveB边包含基于最近发布的数学竞赛、arxiv论文、新闻文章和数据集的问题,它还包含来自以前的基准测试任务的更难、无污染的版本,如Big-Beck Hard、AMPS和IFEval。我们评估了许多著名的闭源模型,以及大小从0.5B到110B的数十个开源模型。LiveBitch很难做到,顶级模特的准确率低于65%。我们发布所有问题、代码和模型答案。问题将每月添加和更新,我们将随着时间的推移发布新任务和任务的更硬版本,以便LiveBtch可以在未来改进时区分LLMS的功能。我们欢迎社区参与和协作,以扩大基准任务和模式。

[NLP-9] he Odyssey of Commonsense Causality: From Foundational Benchmarks to Cutting-Edge Reasoning
[NLP-9] 常识因果关系奥德赛:从基础基准到前沿推理

链接: https://arxiv.org/abs/2406.19307
作者: Shaobo Cui,Zhijing Jin,Bernhard Schölkopf,Boi Faltings
关键词: Understanding commonsense causality, Understanding commonsense, intelligence for humans, unique mark, mark of intelligence
中文关键词: 理解常识因果关系,理解常识,人类的智力,独特的标记,智力的标记
类目: Computation and Language (cs.CL)
备注: 42 pages

点击查看摘要

Abstract:Understanding commonsense causality is a unique mark of intelligence for humans. It helps people understand the principles of the real world better and benefits the decision-making process related to causation. For instance, commonsense causality is crucial in judging whether a defendant’s action causes the plaintiff’s loss in determining legal liability. Despite its significance, a systematic exploration of this topic is notably lacking. Our comprehensive survey bridges this gap by focusing on taxonomies, benchmarks, acquisition methods, qualitative reasoning, and quantitative measurements in commonsense causality, synthesizing insights from over 200 representative articles. Our work aims to provide a systematic overview, update scholars on recent advancements, provide a pragmatic guide for beginners, and highlight promising future research directions in this vital field.
摘要:理解常识性因果关系是人类智力的独特标志。它帮助人们更好地理解现实世界的原则,并有利于与因果关系相关的决策过程。例如,常识性的因果关系对于判断被告的行为是否造成原告的损失至关重要,以确定法律责任。尽管它很重要,但对这一主题的系统探索却明显缺乏。我们的全面调查通过关注常识因果关系的分类、基准、获取方法、定性推理和定量测量,综合了200多篇代表性文章的见解,弥合了这一差距。我们的工作旨在提供系统性的概述,向学者介绍最新进展,为初学者提供实用指南,并强调这一重要领域有前途的未来研究方向。

[NLP-10] From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data
[NLP-10] 从人造针到真正的干草堆:通过对合成数据进行微调来提高LLM的检索能力

链接: https://arxiv.org/abs/2406.19292
作者: Zheyang Xiong,Vasilis Papageorgiou,Kangwook Lee,Dimitris Papailiopoulos
关键词: Large Language Models, Large Language, Recent studies, shown that Large, accurately retrieve information
中文关键词: 大型语言模型,大型语言,最近的研究表明,大型、准确地检索信息
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs’ information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., 10.5% improvement on 20 documents MDQA at position 10 for GPT-3.5 Turbo). We also find that finetuned LLMs’ performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from 2.33% to 6.19% ). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.
摘要:最近的研究表明,大型语言模型(LLM)在处理长上下文输入时难以准确检索信息和保持推理能力。为了解决这些局限性,我们提出了一种利用精心设计的合成数据集进行优化的方法,该数据集包括数值键-值检索任务。我们在GPT-3.5 Turbo和Mistral 7B等机型上的实验表明,在此数据集上优化LLMS显著提高了LLMS在较长上下文环境中的信息检索和推理能力。我们提供了对精调模型的分析,说明了技能从合成任务评估到真实任务评估的转移(例如,对于GPT-3.5Turbo,在第10位对20个文档MDQA进行了10.5%的改进)。我们还发现,优化后的LLMS在一般基准测试上的性能几乎保持不变,而在其他基线长上下文增强数据上优化的LLMS可能会鼓励幻觉(例如,在TriviaQA上,在我们的合成数据上优化Mistral 7B不会导致性能下降,而其他基线数据可能会导致2.33%到6.19%的下降)。我们的研究强调了在合成数据上进行精调的潜力,以提高LLMS在较长上下文任务中的性能。

[NLP-11] HuatuoGPT-Vision Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
[NLP-11] 华拓GPT-愿景将医学视觉知识大规模注入多模式LLM

链接: https://arxiv.org/abs/2406.19280
作者: Junying Chen,Ruyi Ouyang,Anningzhe Gao,Shunian Chen,Guiming Hardy Chen,Xidong Wang,Ruifei Zhang,Zhenyang Cai,Ke Ji,Guangjun Yu,Xiang Wan,Benyou Wang
关键词: large language models, multimodal large language, rapid development, large language, medical
中文关键词: 大型语言模型、多模式大型语言、快速发展、大型语言、医学
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed’s large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an ‘unblinded’ capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
摘要:多通道大型语言模型的快速发展,如GPT-4V,带来了显著的进步。然而,由于数据隐私问题和高昂的注释成本,这些模型在医疗多式联运能力方面仍然面临挑战,原因是医疗视觉文本数据的数量和质量受到限制。虽然开创性的方法利用PubMed的大规模、未识别的医学图像-文本对来解决这些限制,但由于固有的数据噪声,它们仍然不足。为了解决这一问题,我们从PubMed中提炼了医学图像-文本对,并采用了MLLMS(GPT-4V),以不加遮挡的能力对数据进行去噪和重新格式化,从而创建了包含130万个医学VQA样本的PubMedVision数据集。我们的验证表明:(1)PubMedVision可以显著增强当前MLLMS的医疗多式联运能力,在包括MMMU Health Medicine Track在内的基准测试中显示出显著的改善;(2)医学专家的手动检查和经验结果验证了我们的数据集与其他数据构建方法相比具有优越的数据质量。使用PubMedVision,我们训练了34B医疗MLLM HuatuoGPT-Vision,它在开源MLLM中的医疗多模式场景中表现出了优越的性能。

[NLP-12] VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation
[NLP-12] VERISCOR:评估长格式文本生成中可验证声明的真实性

链接: https://arxiv.org/abs/2406.19276
作者: Yixiao Song,Yekyung Kim,Mohit Iyyer
关键词: base like Wikipedia, Existing metrics, decompose an input, input text, knowledge base
中文关键词: 维基百科等基础、现有指标、分解输入、输入文本、知识库
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing metrics for evaluating the factuality of long-form text, such as FACTSCORE (Min et al., 2023) and SAFE (Wei et al., 2024), decompose an input text into “atomic claims” and verify each against a knowledge base like Wikipedia. These metrics are not suitable for most generation tasks because they assume that every claim is verifiable (i.e., can plausibly be proven true or false). We address this issue with VERISCORE, a metric for diverse long-form generation tasks that contain both verifiable and unverifiable content. VERISCORE can be effectively implemented with either closed or fine-tuned open-weight language models, and human evaluation confirms that VERISCORE’s extracted claims are more sensible than those from competing methods across eight different long-form tasks. We use VERISCORE to evaluate generations from 16 different models across multiple long-form tasks and find that while GPT-4o is the best-performing model overall, open-weight models such as Mixtral-8x22 are closing the gap. We show that an LM’s VERISCORE on one task (e.g., biography generation) does not necessarily correlate to its VERISCORE on a different task (e.g., long-form QA), highlighting the need for expanding factuality evaluation across tasks with varying fact density.
摘要:现有的评估长格式文本真实性的指标,如FACTSCORE(Min等人,2023)和SAFE(魏等人,2024),将输入文本分解为“原子声明”,并根据维基百科等知识库进行验证。这些指标不适用于大多数生成任务,因为它们假设每个声明都是可验证的(即,可以可信地证明是真的还是假的)。我们使用VERISCORE来解决这个问题,VERISCORE是一种针对包含可验证和不可验证内容的各种长表单生成任务的指标。VERISCORE可以通过封闭或微调的开放式权重语言模型有效地实现,人类评估证实,VERISCORE提取的声明比来自八个不同长形式任务的竞争方法的声明更合理。我们使用VERISCORE评估了16个不同型号在多个长表格任务中的代数,发现虽然GPT-40是整体表现最好的型号,但Mixtral-8x22等开放式重量型号正在缩小差距。我们表明,LM在一个任务(例如,传记生成)上的VERISCORE不一定与它在不同任务(例如,长格式QA)上的VERISCORE相关,从而强调了在具有不同事实密度的任务之间扩展真实性评估的必要性。

[NLP-13] AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning
[NLP-13] AutoPureData:自动过滤Web数据以进行LLM微调

链接: https://arxiv.org/abs/2406.19271
作者: Praneeth Vadlapati
关键词: Large Language Models, reliable Large Language, Large Language, Language Models, reliable Large
中文关键词: 大型语言模型,可靠的大型语言,大型语言,语言模型,可靠的大型
类目: Computation and Language (cs.CL)
备注: Initial version

点击查看摘要

Abstract:Up-to-date and reliable Large Language Models (LLMs) are consistently sought after. Typically, LLMs are trained on a fixed dataset and then deployed. However, the training data continually becomes outdated. Enable automatic training of AI using web data involves significant concerns regarding data quality and safety due to bias, spam, and other unsafe or unwanted text. Pure data is essential for producing reliable models. Training a model on impure data may result in undesirable outcomes. This research proposes a system that collects web data and automatically filters out unwanted text with the assistance of existing trusted AI models. In the experiment, a small sample of web data was collected and filtered, demonstrating the system’s effectiveness in purifying the data.
摘要:最新且可靠的大型语言模型(LLM)一直受到追捧。通常,LLM在固定数据集上训练,然后部署。然而,训练数据不断过时。由于偏见、垃圾邮件和其他不安全或不想要的文本,使用网络数据实现人工智能的自动训练涉及对数据质量和安全性的重大担忧。纯粹的数据对于生成可靠的模型至关重要。在不纯数据上训练模型可能会导致不良结果。这项研究提出了一种系统,可以收集网络数据并在现有可信人工智能模型的帮助下自动过滤掉不需要的文本。在实验中,收集并过滤了一小部分网络数据样本,证明了系统净化数据的有效性。

[NLP-14] Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
[NLP-14] 随时随地阅读:具有镜头树基础的布局感知的图形界面屏幕阅读

链接: https://arxiv.org/abs/2406.19263
作者: Yue Fan,Lei Ding,Ching-Chen Kuo,Shan Jiang,Yang Zhao,Xinze Guan,Jie Yang,Yi Zhang,Xin Eric Wang
关键词: Graphical User Interfaces, Graphical User, User Interfaces, ToL agent, digital devices
中文关键词: 图形用户界面、图形用户、用户界面、ToL代理、数字设备
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (SPR) task. This task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the SPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements. Such layout information is crucial for accurately interpreting information on the screen, distinguishing our ToL agent from other screen reading tools. We also thoroughly evaluate the ToL agent against other baselines on a newly proposed SPR benchmark, which includes GUIs from mobile, web, and operating systems. Last but not least, we test the ToL agent on mobile GUI navigation tasks, demonstrating its utility in identifying incorrect actions along the path of agent execution trajectories. Code and data: this http URL
摘要:图形用户界面是我们与数字设备交互的核心。最近,为各种图形用户界面理解任务构建模型的工作越来越多。然而,这些努力在很大程度上忽略了一项重要的涉及图形用户界面的任务:基于用户指示点的屏幕阅读,我们将其称为屏幕指向并阅读(SPR)任务。这项任务主要由刚性的可访问屏幕阅读工具来处理,在多模式大型语言模型(MLLMS)的进步推动下,迫切需要新的模型。在本文中,我们提出了一种透镜树(TOL)代理,利用一种新的TOL接地机制来解决SPR任务。基于输入点坐标和相应的图形用户界面截图,我们的TOL代理构建了一个层次布局树。在树的基础上,我们的TOL代理不仅理解指定区域的内容,而且清楚地表达了元素之间的布局和空间关系。这样的布局信息对于准确解释屏幕上的信息至关重要,从而将我们的TOL代理与其他屏幕阅读工具区分开来。我们还根据新提出的SPR基准对TOL代理进行了彻底的评估,该基准包括来自移动、网络和操作系统的图形用户界面。最后但同样重要的是,我们在移动图形用户界面导航任务上测试了TOL代理,展示了它在识别代理执行轨迹路径上的错误操作方面的有效性。代码和数据:此http URL

[NLP-15] Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
[NLP-15] 通过结构时空对齐增强视频语言表示

链接: https://arxiv.org/abs/2406.19255
作者: Hao Fei,Shengqiong Wu,Meishan Zhang,Min Zhang,Tat-Seng Chua,Shuicheng Yan
关键词: coarse-grained cross-modal aligning, shown remarkable potential, detached video-language view, large-scale video-language models, pre-training large-scale video-language
中文关键词: 粗粒度的跨模式对齐,显示出显着的潜力,独立的视频语言视图,大规模视频语言模型,预训练大规模视频语言
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by IEEE TPAMI 2024

点击查看摘要

Abstract:While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions. Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as a plugplay system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.
摘要:尽管预训练的大规模视频语言模型在各种下游视频语言任务中显示出了巨大的潜力,但现有的视频语言模型仍然存在一些常见的局限性,如粗粒度的跨模式对齐、时间动力学建模不足、视频语言分离。在这项工作中,我们的目标是用一种细粒度的结构时空排列学习方法(即FINTA)来增强VLM。首先,我们用细粒度场景图(SG)结构来表示输入的文本和视频,并将两者进一步统一为一个整体的场景图(HSG),用于连接两个通道。在此基础上,建立了一个基于SG的框架,其中文本SG(TSG)用图转换器编码,而视频动态SG(DSG)和HSG用一种新的递归图转换器建模,用于时空特征的传播。进一步设计了一种时空高斯差分图形转换器,以增强物体在空间和时间维度上的变化感觉。然后,基于TSG和DSG的细粒度结构特征,分别进行了以对象为中心的空间对齐和以谓词为中心的时间对齐,增强了视频语言在空间性和时间性上的基础。我们将我们的方法设计成一个即插即用系统,它可以集成到现有的训练有素的VLM中进行进一步的表示增强,而不需要从头开始训练,也不需要在下游应用中依赖SG注释。在标准和长格式视频场景中的12个数据集上的6个代表性VL建模任务上,FINSTA持续不断地改进现有的13个强大的VLM,并在微调和零镜头设置下显著刷新当前最先进的结束任务性能。

[NLP-16] AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation
[NLP-16] AutoRAG-HP:用于检索增强生成的自动在线超参数调整

链接: https://arxiv.org/abs/2406.19251
作者: Jia Fu,Xiaoting Qin,Fangkai Yang,Lu Wang,Jue Zhang,Qingwei Lin,Yubo Chen,Dongmei Zhang,Saravan Rajmohan,Qi Zhang
关键词: Large Language Models, Language Models, Recent advancements, Retrieval-Augmented Generation, Models have transformed
中文关键词: 大型语言模型、语言模型、最新进展、检索增强一代、模型已发生转变
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models have transformed ML/AI development, necessitating a reevaluation of AutoML principles for the Retrieval-Augmented Generation (RAG) systems. To address the challenges of hyper-parameter optimization and online adaptation in RAG, we propose the AutoRAG-HP framework, which formulates the hyper-parameter tuning as an online multi-armed bandit (MAB) problem and introduces a novel two-level Hierarchical MAB (Hier-MAB) method for efficient exploration of large search spaces. We conduct extensive experiments on tuning hyper-parameters, such as top-k retrieved documents, prompt compression ratio, and embedding methods, using the ALCE-ASQA and Natural Questions datasets. Our evaluation from jointly optimization all three hyper-parameters demonstrate that MAB-based online learning methods can achieve Recall@5 \approx 0.8 for scenarios with prominent gradients in search space, using only \sim20% of the LLM API calls required by the Grid Search approach. Additionally, the proposed Hier-MAB approach outperforms other baselines in more challenging optimization scenarios. The code will be made available at this https URL.
摘要:大型语言模型的最新进展改变了ML/AI的发展,需要重新评估检索-增强生成(RAG)系统的AutoML原则。为了解决RAG中超参数优化和在线自适应的问题,我们提出了AutoRAG-HP框架,将超参数调整问题描述为一个在线多臂强盗(MAB)问题,并引入了一种新的两级分层MAB(HIER-MAB)方法来有效地探索大搜索空间。我们使用ALCE-ASQA和Natural Questions数据集在调整超参数方面进行了广泛的实验,如top-k检索文档、即时压缩比和嵌入方法。三个超参数的联合优化结果表明,对于搜索空间中梯度较大的场景,基于MAB的在线学习方法可以达到约0.8的召回率,只需使用网格搜索方法所需的\sim20%的LLM API调用。此外,提出的Hier-MAB方法在更具挑战性的优化场景中的性能优于其他基线。代码将在此HTTPS URL上提供。

[NLP-17] Revealing Fine-Grained Values and Opinions in Large Language Models
[NLP-17] 揭示大型语言模型中的细粒度价值观和观点

链接: https://arxiv.org/abs/2406.19238
作者: Dustin Wright,Arnav Arora,Nadav Borenstein,Srishti Yadav,Serge Belongie,Isabelle Augenstein
关键词: mitigate potential harm, Uncovering latent, potential harm, biases and mitigate, mitigate potential
中文关键词: 减轻潜在的伤害,揭露潜在的、潜在的伤害,偏见和减轻,减轻潜力
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 28 pages, 20 figures, 7 tables

点击查看摘要

Abstract:Uncovering latent values and opinions in large language models (LLMs) can help identify biases and mitigate potential harm. Recently, this has been approached by presenting LLMs with survey questions and quantifying their stances towards morally and politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and there are many ways to argue for or against a given position. In this work, we propose to address this by analysing a large and robust dataset of 156k LLM responses to the 62 propositions of the Political Compass Test (PCT) generated by 6 LLMs using 420 prompt variations. We perform coarse-grained analysis of their generated stances and fine-grained analysis of the plain text justifications for those stances. For fine-grained analysis, we propose to identify tropes in the responses: semantically similar phrases that are recurrent and consistent across different prompts, revealing patterns in the text that a given LLM is prone to produce. We find that demographic features added to prompts significantly affect outcomes on the PCT, reflecting bias, as well as disparities between the results of tests when eliciting closed-form vs. open domain responses. Additionally, patterns in the plain text rationales via tropes show that similar justifications are repeatedly generated across models and prompts even with disparate stances.
摘要:挖掘大型语言模型中的潜在价值和观点有助于识别偏见并减轻潜在的危害。最近,通过向低收入国家提出调查问题并量化他们对充满道德和政治色彩的声明的立场来解决这一问题。然而,LLMS生成的立场可能会因提示方式的不同而有很大差异,而且有许多方法可以支持或反对给定的立场。在这项工作中,我们建议通过分析一个大型且稳健的数据集来解决这一问题,该数据集由6个LLMS使用420个即时变量生成,包含156,000个LLM对政治罗盘测试(PCT)的62个命题的响应。我们对他们生成的立场执行粗粒度分析,并对这些立场的纯文本理由进行细粒度分析。对于细粒度分析,我们建议识别回答中的比喻:语义相似的短语,这些短语在不同的提示中反复出现并保持一致,揭示了给定LLM容易产生的文本模式。我们发现,添加到提示中的人口统计特征显著影响了PCT的结果,反映了偏见,以及引发闭合与开放领域反应时测试结果之间的差异。此外,通过比喻的纯文本理由中的模式表明,即使立场不同,类似的理由也会在模型和提示中重复生成。

[NLP-18] FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
[NLP-18] FlowVQA:使用流程图在视觉问题回答中映射多模式逻辑

链接: https://arxiv.org/abs/2406.19237
作者: Shubhankar Singh,Purvi Chaurasia,Yerram Varun,Pranshu Pandya,Vatsal Gupta,Vivek Gupta,Dan Roth
关键词: question answering lack, spatial reasoning skills, visual question answering, evaluating spatial reasoning, Existing benchmarks
中文关键词: 缺乏问题回答、空间推理技能、视觉问题回答、评估空间推理、现有基准
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark’s potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.
摘要:现有的视觉问答基准缺乏视觉基础和复杂性,尤其是在评估空间推理能力方面。我们引入了FlowVQA,这是一个新的基准,旨在评估可视化问答多通道语言模型在以流程图为可视上下文的推理中的能力。FlowVQA包括来自三个不同内容来源的2,272个精心生成和人工验证的流程图图像,以及22,413个不同的问题-答案对,以测试一系列推理任务,包括信息本地化、决策和逻辑级数。我们使用不同的策略对一套开源和专有的多模式语言模型进行了彻底的基线评估,然后分析了方向偏差。这些结果突出了基准作为推进多模式建模领域的重要工具的潜力,为提高视觉和逻辑推理任务中的模型性能提供了一个集中和具有挑战性的环境。

[NLP-19] RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs
[NLP-19] RuBLiMP:俄罗斯语言最小配对基准

链接: https://arxiv.org/abs/2406.19232
作者: Ekaterina Taktasheva,Maxim Bazhukov,Kirill Koncha,Alena Fenogenova,Ekaterina Artemova
关键词: Linguistic Minimal Pairs, Minimal pairs, minimal pairs address, Linguistic Minimal, well-established approach
中文关键词: 语言最小对、最小对、最小对地址、语言最小、成熟的方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal pairs address a limited number of languages and lack diversity of language-specific grammatical phenomena. This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon. In contrast to existing benchmarks of linguistic minimal pairs, RuBLiMP is created by applying linguistic perturbations to automatically annotated sentences from open text corpora and carefully curating test data. We describe the data collection protocol and present the results of evaluating 25 language models in various scenarios. We find that the widely used language models for Russian are sensitive to morphological and agreement-oriented contrasts but fall behind humans on phenomena requiring understanding of structural relations, negation, transitivity, and tense. RuBLiMP, the codebase, and other materials are publicly available.
摘要:最小对是评价语言模型语法知识的一种行之有效的方法。然而,现有的最小配对资源涉及的语言数量有限,缺乏针对特定语言的语法现象的多样性。本文介绍了俄语语言最小对基准(RuBLiMP),它包括45k对语法不同的句子,并分离出一种形态、句法或语义现象。与现有的语言最小对基准不同,RuBLiMP是通过对开放文本语料库中自动标注的句子应用语言扰动并仔细挑选测试数据来创建的。我们描述了数据收集协议,并给出了在不同场景下对25种语言模型进行评估的结果。我们发现,广泛使用的俄语语言模型对形态和一致取向的对比很敏感,但在需要理解结构关系、否定、及物性和时态的现象上落后于人类。RuBLiMP、代码库和其他材料都是公开可用的。

[NLP-20] Spiking Convolutional Neural Networks for Text Classification
[NLP-20] 用于文本分类的尖峰卷积神经网络

链接: https://arxiv.org/abs/2406.19230
作者: Changze Lv,Jianhan Xu,Xiaoqing Zheng
关键词: Spiking neural networks, deep neural networks, implement deep neural, neural networks, Spiking neural
中文关键词: 尖峰神经网络,深度神经网络,实现深度神经,神经网络,尖峰神经
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs) offer a promising pathway to implement deep neural networks (DNNs) in a more energy-efficient manner since their neurons are sparsely activated and inferences are event-driven. However, there have been very few works that have demonstrated the efficacy of SNNs in language tasks partially because it is non-trivial to represent words in the forms of spikes and to deal with variable-length texts by SNNs. This work presents a “conversion + fine-tuning” two-step method for training SNNs for text classification and proposes a simple but effective way to encode pre-trained word embeddings as spike trains. We show empirically that after fine-tuning with surrogate gradients, the converted SNNs achieve comparable results to their DNN counterparts with much less energy consumption across multiple datasets for both English and Chinese. We also show that such SNNs are more robust to adversarial attacks than DNNs.
摘要:尖峰神经网络(SNN)为以更节能的方式实施深度神经网络(DNN)提供了一种有希望的途径,因为它们的神经元是稀疏激活的,并且推理是事件驱动的。然而,很少有作品证明了SNN在语言任务中的功效,部分原因是以尖峰形式表示单词并通过SNN处理变长文本并不是小事。这项工作提出了一种“转换+微调”两步方法来训练SNN进行文本分类,并提出了一种简单但有效的方法来将预训练的单词嵌入编码为尖峰序列。我们经验表明,在使用替代梯度进行微调后,转换后的SNN可以实现与DNN对应的结果,并且英语和中文的多个数据集的能耗要低得多。我们还表明,此类SNN比DNN对对抗攻击更稳健。

[NLP-21] ools Fail: Detecting Silent Errors in Faulty Tools
[NLP-21] ools失败:检测故障工具中的无声错误

链接: https://arxiv.org/abs/2406.19228
作者: Jimin Sun,So Yeon Min,Yingshan Chang,Yonatan Bisk
关键词: control robots, retrieve knowledge, perform tasks, mainstay of LLMs, Abstract
中文关键词: 控制机器人、检索知识、执行任务、LLM的支柱、摘要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 12 figures

点击查看摘要

Abstract:Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model’s ability to detect “silent” tool errors, and reflect on how to plan. This more directly aligns with the increasingly popular use of models as tools. We provide an initial approach to failure recovery with promising results both on a controlled calculator setting and embodied agent planning.
摘要:工具已成为LLM的支柱,使它们能够检索不受其重量影响的知识,在网络上执行任务,甚至控制机器人。然而,大多数关于工具使用的实体论和调查都认为LLM的核心挑战是选择工具。相反,我们引入了一个更广泛的工具框架,引导我们探索模型检测“无声”工具错误的能力,并反思如何规划。这与越来越流行的模型作为工具的使用更加直接一致。我们提供了一种故障恢复的初步方法,在受控计算器设置和具体代理规划上都取得了令人满意的结果。

[NLP-22] Aligning Teacher with Student Preferences for Tailored Training Data Generation
[NLP-22] 将教师与学生偏好保持一致,以生成量身定制的培训数据

链接: https://arxiv.org/abs/2406.19227
作者: Yantao Liu,Zhao Zhang,Zijun Yao,Shulin Cao,Lei Hou,Juanzi Li
关键词: Large Language Models, Large Language, shown significant promise, Language Models, teacher model
中文关键词: 大型语言模型,大型语言,显示出巨大的前景,语言模型,教师模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant promise as copilots in various tasks. Local deployment of LLMs on edge devices is necessary when handling privacy-sensitive data or latency-sensitive tasks. The computational constraints of such devices make direct deployment of powerful large-scale LLMs impractical, necessitating the Knowledge Distillation from large-scale models to lightweight models. Lots of work has been done to elicit diversity and quality training examples from LLMs, but little attention has been paid to aligning teacher instructional content based on student preferences, akin to “responsive teaching” in pedagogy. Thus, we propose ARTE, dubbed Aligning TeacheR with StudenT PreferencEs, a framework that aligns the teacher model with student preferences to generate tailored training examples for Knowledge Distillation. Specifically, we elicit draft questions and rationales from the teacher model, then collect student preferences on these questions and rationales using students’ performance with in-context learning as a proxy, and finally align the teacher model with student preferences. In the end, we repeat the first step with the aligned teacher model to elicit tailored training examples for the student model on the target task. Extensive experiments on academic benchmarks demonstrate the superiority of ARTE over existing instruction-tuning datasets distilled from powerful LLMs. Moreover, we thoroughly investigate the generalization of ARTE, including the generalization of fine-tuned student models in reasoning ability and the generalization of aligned teacher models to generate tailored training data across tasks and students. In summary, our contributions lie in proposing a novel framework for tailored training example generation, demonstrating its efficacy in experiments, and investigating the generalization of both student aligned teacher models in ARTE.
摘要:大型语言模型(LLM)在各种任务中显示出作为副驾驶的巨大前景。在处理隐私敏感数据或延迟敏感任务时,需要在边缘设备上本地部署LLM。这些设备的计算限制使得直接部署功能强大的大规模LLM是不现实的,这就需要从大规模模型到轻量级模型的知识蒸馏。为了从LLMS中获取多样性和高质量的培训范例,人们做了大量的工作,但很少有人注意到根据学生的偏好调整教师的教学内容,这类似于教育学中的“响应式教学”。因此,我们提出了ARTE,称为使教师与学生偏好保持一致,这是一个将教师模型与学生偏好保持一致的框架,以生成用于知识蒸馏的定制训练实例。具体地说,我们从教师模型中引出问题草稿和基本原理,然后使用学生的表现和情景学习作为替代,收集学生对这些问题和原理的偏好,最后使教师模型与学生的偏好保持一致。最后,我们用调整后的教师模型重复第一步,为目标任务的学生模型得出定制的训练样本。在学术基准上的大量实验表明,ARTE比现有的从强大的LLMS中提取的指令调优数据集更具优势。此外,我们对ARTE的泛化进行了深入的研究,包括在推理能力方面对微调的学生模型的泛化,以及对齐的教师模型的泛化,以生成跨任务和跨学生的定制训练数据。综上所述,我们的贡献在于提出了一种新的定制训练示例生成框架,并在实验中证明了其有效性,并研究了两种以学生为中心的教师模型在ARTE中的推广。

[NLP-23] Simulating Classroom Education with LLM-Empowered Agents
[NLP-23] 使用法学硕士授权代理模拟课堂教育

链接: https://arxiv.org/abs/2406.19226
作者: Zheyuan Zhang,Daniel Zhang-Li,Jifan Yu,Linlu Gong,Jinchang Zhou,Zhiyuan Liu,Lei Hou,Juanzi Li
关键词: Large language models, Large language, intelligent educational tasks, language models, educational tasks
中文关键词: 大型语言模型、大型语言、智能教育任务、语言模型、教育任务
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been employed in various intelligent educational tasks to assist teaching. While preliminary explorations have focused on independent LLM-empowered agents for specific educational tasks, the potential for LLMs within a multi-agent collaborative framework to simulate a classroom with real user participation remains unexplored. In this work, we propose SimClass, a multi-agent classroom simulation framework involving user participation. We recognize representative class roles and introduce a novel class control mechanism for automatic classroom teaching, and conduct user experiments in two real-world courses. Utilizing the Flanders Interactive Analysis System and Community of Inquiry theoretical frame works from educational analysis, we demonstrate that LLMs can simulate traditional classroom interaction patterns effectively while enhancing user’s experience. We also observe emergent group behaviors among agents in SimClass, where agents collaborate to create enlivening interactions in classrooms to improve user learning process. We hope this work pioneers the application of LLM-empowered multi-agent systems in virtual classroom teaching.
摘要:大语言模型被广泛应用于各种智能教育任务中,用于辅助教学。虽然初步的探索集中在独立的LLM授权的代理用于特定的教育任务,但在多代理协作框架内模拟真实用户参与的课堂的LLM的潜力仍未开发。在这项工作中,我们提出了SimClass,一个用户参与的多智能体课堂模拟框架。我们识别了具有代表性的班级角色,并引入了一种新的班级控制机制来实现自动课堂教学,并在两门真实的课程中进行了用户实验。利用佛兰德斯交互分析系统和教育分析中的探究共同体理论框架,我们证明了LLMS可以有效地模拟传统的课堂交互模式,同时提高用户体验。我们还观察到SimClass中代理之间的紧急群体行为,其中代理协作在课堂上创建活跃的交互,以改善用户的学习过程。我们希望这项工作为LLM授权的多智能体系统在虚拟课堂教学中的应用开辟了先河。

[NLP-24] -FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
[NLP-24] - 免费:通过稀疏表示实现内存高效嵌入的无令牌化生成LLM

链接: https://arxiv.org/abs/2406.19223
作者: Björn Deiseroth,Manuel Brack,Patrick Schramowski,Kristian Kersting,Samuel Weinbach
关键词: Large Language Models, Tokenizers are crucial, Language Models, recently stagnated, inherent weaknesses
中文关键词: 大型语言模型,令牌器至关重要,语言模型,最近停滞不前,固有弱点
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2406.19223 [cs.CL] (or arXiv:2406.19223v1 [cs.CL] for this version)
摘要:标记化器对于在大型语言模型中编码信息是至关重要的,但它们的发展最近停滞不前,而且存在固有的弱点。主要限制包括计算开销、无效的词汇使用以及不必要的大嵌入层和头层。此外,它们的表现偏向于参考语料库,导致对代表性不足的语言的有效性降低。为了解决这些问题,我们提出了T-Free,它通过字符三元组上的稀疏激活模式直接嵌入单词,而不需要参考语料库。T-Free固有地利用了形态上的相似性,并允许对嵌入层进行强大的压缩。在我们详尽的实验评估中,我们在这些层上获得了具有竞争力的下行性能,参数减少了85%以上。此外,T-Free在跨语言迁移学习中显示出显著的进步。学科:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG)引用AS:arxiv:2406.19223cs.CL

[NLP-25] SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation
[NLP-25] SeaKR:自适应检索增强生成的自我意识知识检索

链接: https://arxiv.org/abs/2406.19215
作者: Zijun Yao,Weijian Qi,Liangming Pan,Shulin Cao,Linmei Hu,Weichuan Liu,Lei Hou,Juanzi Li
关键词: paper introduces Self-aware, introduces Self-aware Knowledge, self-aware uncertainty, extracts self-aware uncertainty, internal states
中文关键词: 论文介绍自我意识,介绍自我意识知识,自我意识的不确定性,提取自我意识的不确定性,内部状态
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces Self-aware Knowledge Retrieval (SeaKR), a novel adaptive RAG model that extracts self-aware uncertainty of LLMs from their internal states. SeaKR activates retrieval when the LLMs present high self-aware uncertainty for generation. To effectively integrate retrieved knowledge snippets, SeaKR re-ranks them based on LLM’s self-aware uncertainty to preserve the snippet that reduces their uncertainty to the utmost. To facilitate solving complex tasks that require multiple retrievals, SeaKR utilizes their self-aware uncertainty to choose among different reasoning strategies. Our experiments on both complex and simple Question Answering datasets show that SeaKR outperforms existing adaptive RAG methods. We release our code at this https URL.
摘要:本文介绍了自我意识知识检索(SeaKR),这是一种新型的自适应RAG模型,可以从LLM的内部状态中提取其自我意识的不确定性。当LLM呈现出高度的自我意识不确定性时,SeaKR会激活检索。为了有效集成检索到的知识片段,SeaKR根据LLM的自我意识不确定性对它们进行重新排名,以保留最大限度地降低其不确定性的片段。为了促进解决需要多次检索的复杂任务,SeaKR利用其自我意识的不确定性来选择不同的推理策略。我们在复杂和简单问题解答数据集上的实验表明,SeaKR优于现有的自适应RAG方法。我们在此https URL上发布我们的代码。

[NLP-26] Annotation Errors and NER: A Study with OntoNotes 5.0
[NLP-26] 注释错误和NER:使用OntoNotes 5.0的研究

链接: https://arxiv.org/abs/2406.19172
作者: Gabriel Bernier-Colborne,Sowmya Vajjala
关键词: Named Entity Recognition, problem in NLP, Named Entity, Entity Recognition, NER
中文关键词: 命名实体识别,NLP中的问题,命名实体,实体识别,NER
类目: Computation and Language (cs.CL)
备注: Unpublished report. Originally submitted to LREC 2022

点击查看摘要

Abstract:Named Entity Recognition (NER) is a well-studied problem in NLP. However, there is much less focus on studying NER datasets, compared to developing new NER models. In this paper, we employed three simple techniques to detect annotation errors in the OntoNotes 5.0 corpus for English NER, which is the largest available NER corpus for English. Our techniques corrected ~10% of the sentences in train/dev/test data. In terms of entity mentions, we corrected the span and/or type of ~8% of mentions in the dataset, while adding/deleting/splitting/merging a few more. These are large numbers of changes, considering the size of OntoNotes. We used three NER libraries to train, evaluate and compare the models trained with the original and the re-annotated datasets, which showed an average improvement of 1.23% in overall F-scores, with large (10%) improvements for some of the entity types. While our annotation error detection methods are not exhaustive and there is some manual annotation effort involved, they are largely language agnostic and can be employed with other NER datasets, and other sequence labelling tasks.
摘要:命名实体识别(NER)是自然语言处理领域研究较多的问题。然而,与开发新的NER模型相比,对NER数据集的研究要少得多。在本文中,我们使用了三种简单的方法来检测OntoNotes5.0英语语料库中的标注错误,该语料库是目前可用的最大的英语语料库。我们的技术纠正了训练/开发/测试数据中约10%的句子。在实体提及方面,我们更正了数据集中约8%的提及的范围和/或类型,同时添加/删除/拆分/合并了更多。考虑到OntoNotes的大小,这些都是大量的更改。我们使用三个NER库来训练、评估和比较用原始和重新标注的数据集训练的模型,结果显示总体F分数平均提高了1.23%,其中一些实体类型的改进幅度很大(10%)。虽然我们的标注错误检测方法不是穷尽的,并且需要一些人工标注工作,但它们在很大程度上是语言不可知的,可以用于其他NER数据集和其他序列标注任务。

[NLP-27] he Illusion of Competence: Evaluating the Effect of Explanations on Users Mental Models of Visual Question Answering Systems
[NLP-27] 能力错觉:评估简化对视觉问题回答系统用户心理模型的影响

链接: https://arxiv.org/abs/2406.19170
作者: Judith Sieker,Simeon Junker,Ronja Utescher,Nazia Attari,Heiko Wersing,Hendrik Buschmeier,Sina Zarrieß
关键词: providing explanations alongside, answers aids users, perform perfectly, mental model, system
中文关键词: 同时提供解释、答案帮助用户、完美表现、心理模型、系统
类目: Computation and Language (cs.CL)
备注: 16 pages (including Appendix); under review

点击查看摘要

Abstract:We examine how users perceive the limitations of an AI system when it encounters a task that it cannot perform perfectly and whether providing explanations alongside its answers aids users in constructing an appropriate mental model of the system’s capabilities and limitations. We employ a visual question answer and explanation task where we control the AI system’s limitations by manipulating the visual inputs: during inference, the system either processes full-color or grayscale images. Our goal is to determine whether participants can perceive the limitations of the system. We hypothesize that explanations will make limited AI capabilities more transparent to users. However, our results show that explanations do not have this effect. Instead of allowing users to more accurately assess the limitations of the AI system, explanations generally increase users’ perceptions of the system’s competence - regardless of its actual performance.
摘要:我们研究了当人工智能系统遇到无法完美执行的任务时,用户如何感知人工智能系统的局限性,以及在其答案的同时提供解释是否有助于用户构建系统能力和局限性的适当心理模型。我们采用视觉问题回答和解释任务,通过操纵视觉输入来控制人工智能系统的局限性:在推理过程中,系统要么处理全彩色图像,要么处理灰度图像。我们的目标是确定参与者是否能够感知到系统的局限性。我们假设解释将使有限的人工智能能力对用户更加透明。然而,我们的结果表明解释并没有这种效果。解释通常不会让用户更准确地评估人工智能系统的局限性,而是会增加用户对系统能力的看法–无论其实际性能如何。

[NLP-28] Resolving Discrepancies in Compute-Optimal Scaling of Language Models
[NLP-28] 解决语言模型的计算最优缩放中的差异

链接: https://arxiv.org/abs/2406.19146
作者: Tomer Porian,Mitchell Wortsman,Jenia Jitsev,Ludwig Schmidt,Yair Carmon
关键词: laws yield substantially, developed influential scaling, Kaplan scaling law, influential scaling laws, compute budget
中文关键词: 定律产生了大量的影响力,发展了有影响力的缩放定律,卡普兰缩放定律,有影响力的缩放定律,计算预算
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., “Chinchilla”) scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW \beta_2 parameter is essential at lower batch sizes.
摘要:Kaplan等人和Hoffmann等人为最佳模型大小开发了有影响力的缩放定律,作为计算预算的函数,但这些定律产生了截然不同的预测。我们通过在两个数据集(OpenWebText2和RefinedWeb)上复制Kaplan缩放定律并确定导致差异的三个因素来解释这种差异:最后一层计算成本、预热持续时间和依赖规模的优化器调优。纠正这些因素后,我们与霍夫曼等人的观点取得了极好的一致(即,“龙猫”)缩放定律。与霍夫曼等人的假设相反,我们发现,仔细的学习率衰减对于其缩放定律的有效性并不重要。作为次要结果,我们推导出最佳学习率和批量大小的缩放定律,发现在较低批量大小下调整AdamW\beta_2参数至关重要。

[NLP-29] CHEW: A Dataset of CHanging Events in Wikipedia
[NLP-29] CHEW:维基百科中Changing事件的数据集

链接: https://arxiv.org/abs/2406.19116
作者: Hsuvas Borkakoty,Luis Espinosa-Anke
关键词: naturally occurring text, occurring text, introduce CHEW, Wikipedia expressed, dataset of changing
中文关键词: 自然发生的文本,发生的文本,介绍CHEW,维基百科表达,变化的数据集
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Short Paper

点击查看摘要

Abstract:We introduce CHEW, a novel dataset of changing events in Wikipedia expressed in naturally occurring text. We use CHEW for probing LLMs for their timeline understanding of Wikipedia entities and events in generative and classification experiments. Our results suggest that LLMs, despite having temporal information available, struggle to construct accurate timelines. We further show the usefulness of CHEW-derived embeddings for identifying meaning shift.
摘要:我们介绍CHEW,这是维基百科中变化事件的一个新颖数据集,以自然发生的文本表达。我们使用CHEW来探索LLM在生成和分类实验中对维基百科实体和事件的时间轴理解。我们的结果表明,尽管LLM有可用的时间信息,但很难构建准确的时间线。我们进一步展示了CHEW衍生的嵌入对于识别意义转变的有用性。

[NLP-30] Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs
[NLP-30] 声明:从ESG KPI的大型语言模型表中提取通用信息

链接: https://arxiv.org/abs/2406.19102
作者: Lokesh Mishra,Sohayl Dhibi,Yusik Kim,Cesar Berrospi Ramis,Shubham Gupta,Michele Dolfi,Peter Staar
关键词: greenhouse gas emissions, water consumption, waste management, KPIs assess, climate change
中文关键词: 温室气体排放、水消耗、废物管理、KPI评估、气候变化
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at the NLP4Climate workshop in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)

点击查看摘要

Abstract:Environment, Social, and Governance (ESG) KPIs assess an organization’s performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.
摘要:环境、社会和治理(ESG)关键绩效指标评估一个组织在气候变化、温室气体排放、用水量、废物管理、人权、多样性和政策等问题上的表现。ESG报告通过表格传达这一有价值的定量信息。不幸的是,由于表结构和内容的高度可变性,提取这些信息很困难。我们提出了一种新的领域不可知的数据结构–语句,用于提取量化事实和相关信息。我们建议将表到语句的转换作为一种新的有监督的深度学习通用信息提取任务。我们介绍了SemTabNet-一个包含超过10万个注释表的数据集。研究了一系列基于T5的语句提取模型,我们的最佳模型生成的语句与基本事实相似82%(而基线为21%)。我们通过将我们的模型应用于ESG报告中的2700多个表来演示语句的优势。报表的同质性允许对大量ESG报告集合中的大量信息进行探索性数据分析。

[NLP-31] Fairness and Bias in Multimodal AI: A Survey
[NLP-31] 多模式人工智能的公平性和偏见:一项调查

链接: https://arxiv.org/abs/2406.19097
作者: Tosin Adewumi,Lama Alkhaled,Namrata Gurung,Goya van Boven,Irene Pagliai
关键词: Large Language Models, Large Multimodal Models, fairness and bias, bias in Large, Large Language
中文关键词: 大型语言模型,大型多模式模型,公平和偏见,大型语言中的偏见
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:The importance of addressing fairness and bias in artificial intelligence (AI) systems cannot be over-emphasized. Mainstream media has been awashed with news of incidents around stereotypes and bias in many of these systems in recent years. In this survey, we fill a gap with regards to the minimal study of fairness and bias in Large Multimodal Models (LMMs) compared to Large Language Models (LLMs), providing 50 examples of datasets and models along with the challenges affecting them; we identify a new category of quantifying bias (preuse), in addition to the two well-known ones in the literature: intrinsic and extrinsic; we critically discuss the various ways researchers are addressing these challenges. Our method involved two slightly different search queries on Google Scholar, which revealed that 33,400 and 538,000 links are the results for the terms “Fairness and bias in Large Multimodal Models” and “Fairness and bias in Large Language Models”, respectively. We believe this work contributes to filling this gap and providing insight to researchers and other stakeholders on ways to address the challenge of fairness and bias in multimodal A!.
摘要:在人工智能(AI)系统中解决公平和偏见的重要性怎么强调都不为过。近年来,主流媒体上充斥着关于其中许多系统中关于刻板印象和偏见的事件的新闻。在这项调查中,我们填补了关于大型多模式模型(LMM)相对于大型语言模型(LLM)中的公平性和偏差的最小研究的空白,提供了50个数据集和模型的例子以及影响它们的挑战;除了文献中众所周知的两个类别:内在和外在;我们确定了一种新的量化偏差(前使用)类别;我们批判性地讨论了研究人员解决这些挑战的各种方法。我们的方法涉及谷歌学者上的两个略有不同的搜索查询,结果显示,33,400个和538,000个链接分别是“大型多模式模型中的公平和偏见”和“大型语言模型中的公平和偏见”的结果。我们相信,这项工作有助于填补这一空白,并为研究人员和其他利益相关者提供洞察,了解如何应对多式联运A!中公平和偏见的挑战。

[NLP-32] AMBROSIA: A Benchmark for Parsing Ambiguous Questions into Database Queries
[NLP-32] AMBROSIA:将模糊问题解析到数据库收件箱中的基准

链接: https://arxiv.org/abs/2406.19073
作者: Irina Saparina,Mirella Lapata
关键词: Practical semantic parsers, understand user utterances, Practical semantic, executable programs, expected to understand
中文关键词: 实用的语义解析器,理解用户话语,实用的语义、可执行程序,期望理解
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Practical semantic parsers are expected to understand user utterances and map them to executable programs, even when these are ambiguous. We introduce a new benchmark, AMBROSIA, which we hope will inform and inspire the development of text-to-SQL parsers capable of recognizing and interpreting ambiguous requests. Our dataset contains questions showcasing three different types of ambiguity (scope ambiguity, attachment ambiguity, and vagueness), their interpretations, and corresponding SQL queries. In each case, the ambiguity persists even when the database context is provided. This is achieved through a novel approach that involves controlled generation of databases from scratch. We benchmark various LLMs on AMBROSIA, revealing that even the most advanced models struggle to identify and interpret ambiguity in questions.
摘要:实用的语义解析器需要理解用户的话语并将其映射到可执行程序,即使这些程序是模糊的。我们引入了一个新的基准测试AMBROSIA,我们希望它能够为能够识别和解释模糊请求的文本到SQL解析器的开发提供信息和启发。我们的数据集包含展示三种不同类型的歧义(范围歧义、附件歧义和歧义)、它们的解释以及相应的SQL查询的问题。在每种情况下,即使提供了数据库上下文,模糊性也会持续存在。这是通过一种新颖的方法实现的,该方法涉及从头开始控制数据库的生成。我们在AMBROSIA上对各种LLM进行基准测试,发现即使是最先进的模型也很难识别和解释问题中的模糊性。

[NLP-33] EmPO: Theory-Driven Dataset Construction for Empathetic Response Generation through Preference Optimization
[NLP-33] EmPO:理论驱动的数据集构建,通过偏好优化生成同理心反应

链接: https://arxiv.org/abs/2406.19071
作者: Ondrej Sotolar
关键词: emotionally intelligent multi-turn, intelligent multi-turn conversations, Empathetic response generation, conversational agents, crucial for facilitating
中文关键词: 情商高的多回合、智能的多回合对话、同理心的响应生成、对话代理,对于促进至关重要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: v01, 4 pages short paper, ACL style

点击查看摘要

Abstract:Empathetic response generation is a desirable aspect of conversational agents, crucial for facilitating engaging and emotionally intelligent multi-turn conversations between humans and machines. Leveraging large language models for this task has shown promising results, yet challenges persist in ensuring both the empathetic quality of the responses and retention of the generalization performance of the models. In this paper, we propose a novel approach where we construct theory-driven preference datasets and use them to align LLMs with preference optimization algorithms to address these challenges. To measure empathetic response generation, we employ the EmpatheticDialogues dataset, assessing empathy with the diff-EPITOME and BERTscore metrics, and evaluate the generalization performance on the MMLU benchmark. We make all datasets, source code, and models publicly available.
摘要:同理心响应生成是对话代理的一个理想方面,对于促进人类和机器之间引人入胜且具有情感智能的多回合对话至关重要。利用大型语言模型来完成这项任务已经显示出有希望的结果,但在确保响应的同理心质量和保持模型的概括性能方面仍然存在挑战。在本文中,我们提出了一种新颖的方法,构建理论驱动的偏好数据集,并使用它们将LLM与偏好优化算法对齐,以应对这些挑战。为了衡量同理心反应的生成,我们使用EmpatheticDialogues数据集,通过diff-EPITOME和BERTscore指标评估同理心,并评估MMLU基准的概括性能。我们公开所有数据集、源代码和模型。

[NLP-34] STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis
[NLP-34] STBench:评估大型语言模型在时空分析中的能力

链接: https://arxiv.org/abs/2406.19065
作者: Wenbin Li,Di Yao,Ruibo Zhao,Wenjie Chen,Zijie Xu,Chengxue Luo,Chang Gong,Quanliang Jing,Haining Tan,Jingping Bi
关键词: spatio-temporal data mining, holds promise, large language models, rapid evolution, evolution of large
中文关键词: 时空数据挖掘,有希望,大型语言模型,快速进化,大型进化
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) holds promise for reforming the methodology of spatio-temporal data mining. However, current works for evaluating the spatio-temporal understanding capability of LLMs are somewhat limited and biased. These works either fail to incorporate the latest language models or only focus on assessing the memorized spatio-temporal knowledge. To address this gap, this paper dissects LLMs’ capability of spatio-temporal data into four distinct dimensions: knowledge comprehension, spatio-temporal reasoning, accurate computation, and downstream applications. We curate several natural language question-answer tasks for each category and build the benchmark dataset, namely STBench, containing 13 distinct tasks and over 60,000 QA pairs. Moreover, we have assessed the capabilities of 13 LLMs, such as GPT-4o, Gemma and Mistral. Experimental results reveal that existing LLMs show remarkable performance on knowledge comprehension and spatio-temporal reasoning tasks, with potential for further enhancement on other tasks through in-context learning, chain-of-though prompting, and fine-tuning. The code and datasets of STBench are released on this https URL.
摘要:大型语言模型的快速发展为时空数据挖掘的方法论改革带来了希望。然而,目前对LLMS时空理解能力的评价工作存在一定的局限性和偏颇。这些作品要么没有融入最新的语言模型,要么只专注于评估记忆的时空知识。为了弥补这一差距,本文将LLMS的时空数据处理能力分解为四个不同的维度:知识理解、时空推理、精确计算和下游应用。我们为每个类别挑选了几个自然语言问答任务,并构建了基准数据集STBch,包含13个不同的任务和超过60,000个QA对。此外,我们还评估了GPT-40、杰玛和米斯特拉尔等13架低地小火箭的能力。实验结果表明,现有的LLMS在知识理解和时空推理任务上表现出了显著的性能,并有可能通过上下文学习、链式提示和微调来进一步提高其他任务的性能。在此HTTPS URL上发布了STB边的代码和数据集。

[NLP-35] Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
[NLP-35] 通过可靠性感知一致改进弱到强的概括

链接: https://arxiv.org/abs/2406.19032
作者: Yue Guo,Yi Yang
关键词: natural language tasks, Large language models, Large language, surpassing human abilities, language tasks
中文关键词: 自然语言任务、大型语言模型、大型语言、超越人类能力、语言任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are now rapidly advancing and surpassing human abilities on many natural language tasks. However, aligning these super-human LLMs with human knowledge remains challenging because the supervision signals from human annotators may be wrong. This issue, known as the “super-alignment” problem, requires enhancing weak-to-strong generalization, where a strong LLM must generalize from imperfect supervision provided by a weaker source. To address this issue, we propose an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals in the alignment process. In our method, we query the weak supervisor for multiple answers, estimate the answer reliability, and enhance the alignment process by filtering out uncertain data or re-weighting reliable data. Experiments on four datasets demonstrate that our methods effectively identify the quality of weak labels and significantly enhance weak-to-strong generalization. Our work presents effective techniques for error-robust model alignment, reducing error propagation from noisy supervision and enhancing the accuracy and reliability of LLMs. Codes are publicly available at this http URL.
摘要:大语言模型在许多自然语言任务上正在迅速发展并超越人类的能力。然而,将这些超人类的LLM与人类的知识对齐仍然具有挑战性,因为来自人类注释员的监督信号可能是错误的。这个问题被称为“超级对齐”问题,需要加强从弱到强的推广,其中强大的LLM必须从较弱来源提供的不完美监督中进行推广。为了解决这个问题,我们提出了一种方法,通过在配准过程中引入弱监督信号的可靠性来提高从弱到强的泛化能力。在我们的方法中,我们向弱监督者查询多个答案,估计答案的可靠性,并通过过滤不确定数据或重新加权可靠数据来增强比对过程。在四个数据集上的实验表明,我们的方法有效地识别了弱标签的质量,并显著地提高了从弱到强的泛化能力。我们的工作提出了有效的误差稳健模型对齐技术,减少了噪声监督带来的误差传播,提高了LLMS的准确性和可靠性。代码在此http URL上公开提供。

[NLP-36] RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton
[NLP-36] RoboUniView:具有统一视图表示的机器人操纵的视觉语言模型

链接: https://arxiv.org/abs/2406.18977
作者: Fanfan Liu,Feng Yan,Liming Zheng,Chengjian Feng,Yiyang Huang,Lin Ma
关键词: Utilizing Vision-Language Models, Utilizing Vision-Language, robotic manipulation represents, unified view representation, aiming to enhance
中文关键词: 利用视觉语言模型,利用视觉语言,机器人操纵表示,统一视图表示,旨在增强
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model’s ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform’s camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the D \to D setting from 88.7% to 96.2%, and in the ABC \to D setting from 82.4% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. this https URL
摘要:利用视觉语言模型进行机器人操作是一种新的范式,旨在增强模型对新对象和新指令的泛化能力。然而,由于摄像机规格和安装位置的不同,现有方法在不同的机器人平台上表现出显著的性能差异。为了应对这一挑战,我们在本文中提出了RoboUniView,这是一种将视觉特征提取与动作学习解耦的创新方法。我们首先通过对容易访问的数据进行预训练,从多个透视图中学习统一的视图表示,然后从该统一的视图表示中派生动作来控制机器人操作。这种统一的视图表示更准确地反映了物理世界,并且不受机器人平台的摄像头参数的限制。得益于这种方法,我们在要求苛刻的Calvin基准测试中实现了最先进的性能,将D\to D设置的成功率从88.7%提高到96.2%,将ABC\to D设置的成功率从82.4%提高到94.2%。此外,我们的模型表现出出色的适应性和灵活性:它在看不见的摄像机参数下保持高性能,可以利用具有不同摄像机参数的多个数据集,并且能够跨数据集联合跨任务学习。提供代码以供重新实现。此HTTPS URL

[NLP-37] UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models
[NLP-37] UniGen:使用大型语言模型生成文本数据集的统一框架

链接: https://arxiv.org/abs/2406.18966
作者: Siyuan Wu,Yue Huang,Chujie Gao,Dongping Chen,Qihui Zhang,Yao Wan,Tianyi Zhou,Xiangliang Zhang,Jianfeng Gao,Chaowei Xiao,Lichao Sun
关键词: Large Language Models, Large Language, Language Models, expensive human-generated datasets, high-quality synthetic data
中文关键词: 大型语言模型、大型语言、语言模型、昂贵的人类生成数据集、高质量合成数据
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents UniGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. UniGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by UniGen, and each module within UniGen plays a critical role in this enhancement. Additionally, UniGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that UniGen effectively supports dynamic and evolving benchmarking, and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.
摘要:大型语言模型,如GPT-4和Llama3,通过实现高质量的合成数据生成和减少对昂贵的人工生成数据集的依赖,对各个领域产生了重大影响。尽管如此,在现有的生成框架内,在概括性、可控性、多样性和真实性方面仍然存在挑战。为了应对这些挑战,本文提出了UniGen,这是一个由LLM支持的全面框架,旨在生成多样化、准确和高度可控的数据集。UniGen具有适应性,支持所有类型的文本数据集,并通过创新机制增强生成过程。为了增加数据多样性,UniGen加入了一个属性引导生成模块和一个组检查功能。为了准确,它采用了基于代码的数学评估来进行标签验证,同时使用了检索增强的生成技术来进行事实验证。该框架还允许用户指定的约束,允许定制数据生成过程以满足特定要求。广泛的实验表明,UniGen生成的数据具有卓越的质量,UniGen中的每个模块在这一增强中发挥着关键作用。此外,UniGen还应用于两个实际场景:基准LLMS和数据增强。结果表明,UniGen有效地支持动态和演化的基准测试,数据增强提高了LLM在各个领域的能力,包括面向主体的能力和推理能力。

[NLP-38] he single-use restriction for register automata and transducers over infinite alphabets
[NLP-38] 对无限字母的寄存器自动机和传感器的一次性限制

链接: https://arxiv.org/abs/2406.18934
作者: Rafał Stefański
关键词: register automata, single-use, automata, register, single-use Mealy machines
中文关键词: 注册自动机,一次性使用,自动机,注册,一次性使用Mealy机器
类目: Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL)
备注: PhD Thesis at University of Warsaw. Supervisor: Mikołaj Bojańczyk

点击查看摘要

Abstract:This thesis studies the single-use restriction for register automata and transducers over infinite alphabets. The restriction requires that a read-access to a register should have the side effect of destroying its contents. This constraint results in robust classes of languages and transductions. For automata models, we show that one-way register automata, two-way register automata, and orbit-finite monoids have the same expressive power. For transducer models, we show that single-use Mealy machines and single-use two-way transducers admit versions of the Krohn-Rhodes decomposition theorem. Moreover, single-use Mealy machines are equivalent to an algebraic model called local algebraic semigroup transductions. Additionally, we show that single-use two-way transducers are equivalent to single-use streaming string transducers (SSTs) over infinite alphabets and to regular list functions with atoms. Compared with the previous work arXiv:1907.10504, this thesis offers a coherent narrative on the single-use restriction. We introduce an abstract notion of single-use functions and use them to define all the discussed single-use models. We also introduce and study the algebraic models of local semigroup transduction and local rational semigroup transduction. Comments: PhD Thesis at University of Warsaw. Supervisor: Mikołaj Bojańczyk Subjects: Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL) ACMclasses: F.4.3 Cite as: arXiv:2406.18934 [cs.FL] (or arXiv:2406.18934v1 [cs.FL] for this version)
摘要:本文研究了无限字母表上寄存器自动机和传感器的一次性使用限制。该限制要求对寄存器的读访问应该具有销毁其内容的副作用。这种约束产生了健壮的语言和转换类。对于自动机模型,我们证明了单向寄存器自动机、双向寄存器自动机和轨道有限么半群具有相同的表达能力。对于换能器模型,我们证明了单次使用的Mealy机和单次使用的双向换能器允许不同版本的Krohn-Rhodes分解定理。此外,一次性Mealy机等价于一个称为局部代数半群变换的代数模型。此外,我们还证明了单次使用的双向换能器等价于无限字母表上的单次使用的串流换能器(SSTs),并且等价于具有原子的规则列表函数。与前人的著作ARXIV:1907.10504相比,本文对单次使用限制进行了连贯的叙述。我们引入了单一用途函数的抽象概念,并使用它们来定义所有讨论的单一用途模型。并介绍和研究了局部半群变换和局部有理半群变换的代数模型。评论:华沙大学博士论文。主管:MikołAJ BOJAńCzyk科目:形式语言和自动机理论(cs.FL);计算和语言(cs.CL)ACM类:F.4.3引用AS:arxiv:2406.18934cs.FL

[NLP-39] Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network
[NLP-39] 利用前端自适应网络增强的ASB对数据包丢失的鲁棒性

链接: https://arxiv.org/abs/2406.18928
作者: Yehoshua Dissen,Shiry Yonash,Israel Cohen,Joseph Keshet
关键词: automatic speech recognition, ASR models, ASR, speech recognition, significant challenge
中文关键词: 自动语音识别,ASB模型,ASB,语音识别,重大挑战
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted for publication at INTERSPEECH 2024

点击查看摘要

Abstract:In the realm of automatic speech recognition (ASR), robustness in noisy environments remains a significant challenge. Recent ASR models, such as Whisper, have shown promise, but their efficacy in noisy conditions can be further enhanced. This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models. We propose using a front-end adaptation network connected to a frozen ASR model. The adaptation network is trained to modify the corrupted input spectrum by minimizing the criteria of the ASR model in addition to an enhancement loss function. Our experiments demonstrate that the adaptation network, trained on Whisper’s criteria, notably reduces word error rates across domains and languages in packet-loss scenarios. This improvement is achieved with minimal affect to Whisper model’s foundational performance, underscoring our method’s practicality and potential in enhancing ASR models in challenging acoustic environments.
摘要:在自动语音识别(ASB)领域,在嘈杂环境中的鲁棒性仍然是一个重大挑战。最近的ASB模型(例如Whisper)已经显示出了希望,但它们在噪音条件下的功效可以进一步增强。本研究的重点是从数据包丢失中恢复,以提高ASB模型的误字率(WER)。我们建议使用连接到冻结的ASB模型的前端自适应网络。除了增强损失函数之外,自适应网络还被训练为通过最小化ASB模型的标准来修改损坏的输入频谱。我们的实验表明,根据Whisper标准训练的自适应网络在数据包丢失情况下显着降低了跨域和语言的字错误率。实现这一改进对Whisper模型的基本性能影响最小,强调了我们的方法在具有挑战性的声学环境中增强ASB模型的实用性和潜力。

[NLP-40] Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding
[NLP-40] 选择性视觉是视觉推理的挑战:视觉论证理解的基准

链接: https://arxiv.org/abs/2406.18925
作者: Jiwan Chung,Sungjae Lee,Minseo Kim,Seungju Han,Ashkan Yousefpour,Jack Hessel,Youngjae Yu
关键词: Visual, Visual arguments, advertising or social, persuade viewers, arguments
中文关键词: 视觉,视觉论点,广告或社交,说服观众,论点
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences, we ask: are today’s AI capable of similar understanding? We collect and release VisArgs, an annotated corpus designed to make explicit the (usually implicit) structures underlying visual arguments. VisArgs includes 1,611 images accompanied by three types of textual annotations: 5,112 visual premises (with region annotations), 5,574 commonsense premises, and reasoning trees connecting them to a broader argument. We propose three tasks over VisArgs to probe machine capacity for visual argument understanding: localization of premises, identification of premises, and deduction of conclusions. Experiments demonstrate that 1) machines cannot fully identify the relevant visual cues. The top-performing model, GPT-4-O, achieved an accuracy of only 78.5%, whereas humans reached 98.0%. All models showed a performance drop, with an average decrease in accuracy of 19.5%, when the comparison set was changed from objects outside the image to irrelevant objects within the image. Furthermore, 2) this limitation is the greatest factor impacting their performance in understanding visual arguments. Most models improved the most when given relevant visual premises as additional inputs, compared to other inputs, for deducing the conclusion of the visual argument. Comments: 12 pages, 5 figures Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.18925 [cs.CL] (or arXiv:2406.18925v1 [cs.CL] for this version)
摘要:视觉论据经常被用于广告或社会事业中,它依靠图像来说服观众去做或相信某事。理解这些论点需要选择性的视觉:只有图像中的特定视觉刺激与论点相关,而且相关性只能在更广泛的论点结构的背景下才能理解。虽然视觉论证很容易被人类观众理解,但我们问:今天的人工智能有能力理解类似的东西吗?我们收集并发布VisArgs,这是一个带注释的语料库,旨在显式地(通常是隐式地)显示视觉参数背后的结构。VisArgs包括1,611幅图像,并带有三种类型的文本注释:5,112个可视前提(带有区域注释)、5,574个常识前提,以及将它们与更广泛的论点联系起来的推理树。我们在VisArgs上提出了三项任务来探索机器对可视论点的理解能力:前提的本地化、前提的识别和结论的推导。实验表明,1)机器不能完全识别相关的视觉线索。性能最好的模型GPT-4-O的准确率只有78.5%,而人类达到了98.0%。当比较集从图像外的对象改变为图像内不相关的对象时,所有模型都表现出性能下降,准确率平均下降19.5%。此外,这一局限性也是影响他们理解视觉论据的最大因素。与其他输入相比,当给予相关的视觉前提作为额外的输入时,大多数模型的改进最大,以推断视觉论点的结论。评论:12页,5位数字主题:计算和语言(cs.CL);计算机视觉和模式识别(cs.CV)引用为:arxiv:2406.18925cs.CL

[NLP-41] Capturing Minds Not Just Words: Enhancing Role-Playing Language Models with Personality-Indicative Data
[NLP-41] 捕捉思想而不仅仅是言语:用个性指示数据增强角色扮演语言模型

链接: https://arxiv.org/abs/2406.18921
作者: Yiting Ran,Xintao Wang,Rui Xu,Xinfeng Yuan,Jiaqing Liang,Yanghua Xiao,Deqing Yang
关键词: large language models, role-playing language models, attracting significant interest, language models, popular application area
中文关键词: 大型语言模型、角色扮演语言模型、吸引人们的兴趣、语言模型、流行的应用领域
类目: Computation and Language (cs.CL)
备注: 10pages

点击查看摘要

Abstract:Role-playing agents (RPA) have been a popular application area for large language models (LLMs), attracting significant interest from both industry and academia.While existing RPAs well portray the characters’ knowledge and tones, they face challenges in capturing their minds, especially for small role-playing language models (RPLMs). In this paper, we propose to enhance RPLMs via personality-indicative data. Specifically, we leverage questions from psychological scales and distill advanced RPAs to generate dialogues that grasp the minds of characters. Experimental results validate that RPLMs trained with our dataset exhibit advanced role-playing capabilities for both general and personality-related evaluations. Code and data are available at \hrefthis https URLthis URL.
摘要:角色扮演代理(RPA)一直是大型语言模型(LLM)的热门应用领域,引起了业界和学术界的极大兴趣。虽然现有的RPA很好地描绘了角色的知识和语气,但它们在捕捉他们的思想方面面临着挑战,尤其是对于小型角色扮演语言模型(RPLM)。在本文中,我们建议通过个性指示数据增强RPLM。具体来说,我们利用心理量表中的问题并提取高级RPA来生成抓住角色思想的对话。实验结果证实,使用我们的数据集训练的RPLM在一般评估和个性相关评估方面都表现出先进的角色扮演能力。代码和数据可在\hrefThis https URLThis URL上获取。

[NLP-42] rustUQA: A Trustful Framework for Unified Structured Data Question Answering
[NLP-42] rustUQA:统一结构化数据问题解答的可信框架

链接: https://arxiv.org/abs/2406.18916
作者: Wen Zhang,Long Jin,Yushan Zhu,Jiaoyan Chen,Zhiwei Huang,Junjie Wang,Yin Hua,Lei Liang,Huajun Chen
关键词: Large Language Models, Natural language question, Natural language, Language Models, language question answering
中文关键词: 大型语言模型、自然语言问题、自然语言、语言模型、语言问答
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural language question answering (QA) over structured data sources such as tables and knowledge graphs (KGs) have been widely investigated, for example with Large Language Models (LLMs). The main solutions include question to formal query parsing and retrieval-based answer generation. However, current methods of the former often suffer from weak generalization, failing to dealing with multiple sources simultaneously, while the later is limited in trustfulness. In this paper, we propose UnifiedTQA, a trustful QA framework that can simultaneously support multiple types of structured data in a unified way. To this end, it adopts an LLM-friendly and unified knowledge representation method called Condition Graph (CG), and uses an LLM and demonstration-based two-level method for CG querying. For enhancement, it is also equipped with dynamic demonstration retrieval. We have evaluated UnifiedTQA with 5 benchmarks covering 3 types of structured data. It outperforms 2 existing unified structured data QA methods and in comparison with the baselines that are specific to a data type, it achieves state-of-the-art on 2 of them. Further more, we demonstrates potential of our method for more general QA tasks, QA over mixed structured data and QA across structured data.
摘要:表格和知识图等结构化数据源上的自然语言问答已被广泛研究,例如使用大型语言模型(LLM)。主要的解决方案包括问题到形式查询的解析和基于检索的答案生成。然而,前者的现有方法往往泛化能力较弱,不能同时处理多个信源,而后者的可信性有限。在本文中,我们提出了UnifiedTQA,一个可以同时以统一的方式支持多种类型的结构化数据的可信的QA框架。为此,它采用了一种LLM友好的统一知识表示方法–条件图(CG),并使用LLM和基于演示的两级方法进行CG查询。为了增强功能,它还配备了动态演示检索。我们用涵盖3种结构化数据的5个基准对UnifiedTQA进行了评估。它比现有的两种统一结构化数据QA方法性能更好,与特定于数据类型的基线相比,它在其中两种方法上达到了最先进的水平。此外,我们还展示了我们的方法在更一般的QA任务、混合结构化数据上的QA和结构化数据上的QA上的潜力。

[NLP-43] Factor-Conditioned Speaking-Style Captioning
[NLP-43] 因素条件说话风格字幕

链接: https://arxiv.org/abs/2406.18910
作者: Atsushi Ando,Takafumi Moriya,Shota Horiguchi,Ryo Masumura
关键词: predicting speaking-style information, accurately predicting speaking-style, speaking-style information, speaking-style captioning method, paper presents
中文关键词: 预测说话风格信息,准确预测说话风格,说话风格信息,说话风格字幕方法,论文提出
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based training, and with GtS, it generates more diverse captions while keeping style prediction performance.
摘要:本文提出了一种新颖的说话式字幕方法,在准确预测说话式信息的同时,生成多样化的描述。传统的学习标准直接使用原始字幕,这些字幕不仅包含说话式因素词,还包含句法词,这干扰了说话式信息的学习。为了解决这个问题,我们引入了因素条件字幕(FCC),它首先输出一个代表说话风格因素(如性别、音调等)的短语,然后生成一个字幕,以确保模型明确地学习说话风格因素。我们还提出了贪婪然后采样(GTS)的解码方法,它首先确定性地预测说话式的因素以保证语义的准确性,然后基于因素条件采样来生成字幕以确保多样性。实验表明,FCC比基于字幕的训练效果更好,并且在保持风格预测性能的同时,使用GTS生成更多样化的字幕。

[NLP-44] Historia Magistra Vitae: Dynamic Topic Modeling of Roman Literature using Neural Embeddings
[NLP-44] Historia Magistra Vitae:使用神经嵌入的罗马文学动态主题建模

链接: https://arxiv.org/abs/2406.18907
作者: Michael Ginn,Mans Hulden
关键词: dynamic topic modeling, Dynamic topic, limited usefulness, difficult to configure, Dynamic topic models
中文关键词: 动态主题建模、动态主题、有用性有限、难以配置、动态主题模型
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Dynamic topic models have been proposed as a tool for historical analysis, but traditional approaches have had limited usefulness, being difficult to configure, interpret, and evaluate. In this work, we experiment with a recent approach for dynamic topic modeling using BERT embeddings. We compare topic models built using traditional statistical models (LDA and NMF) and the BERT-based model, modeling topics over the entire surviving corpus of Roman literature. We find that while quantitative metrics prefer statistical models, qualitative evaluation finds better insights from the neural model. Furthermore, the neural topic model is less sensitive to hyperparameter configuration and thus may make dynamic topic modeling more viable for historical researchers.
摘要:动态主题模型已被提出作为历史分析的工具,但传统方法的用途有限,难以配置、解释和评估。在这项工作中,我们尝试了一种使用BERT嵌入进行动态主题建模的最新方法。我们比较了使用传统统计模型(LDA和NMF)和基于BERT的模型构建的主题模型,对整个现存罗马文学文集的主题进行建模。我们发现,虽然量化指标更喜欢统计模型,但定性评估可以从神经模型中找到更好的见解。此外,神经主题模型对超参数配置不太敏感,因此可能使动态主题建模对于历史研究人员来说更可行。

[NLP-45] Sonnet or Not Bot? Poetry Evaluation for Large Models and Datasets
[NLP-45] 十四行诗还是不是机器人?大型模型和数据集的诗歌评估

链接: https://arxiv.org/abs/2406.18906
作者: Melanie Walsh,Anna Preus,Maria Antoniak
关键词: including highly specialized, Large language models, Large language, highly specialized, wide range
中文关键词: 包括高度专业化、大型语言模型、大型语言、高度专业化、广泛
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can now generate and recognize text in a wide range of styles and genres, including highly specialized, creative genres like poetry. But what do LLMs really know about poetry? What can they know about poetry? We develop a task to evaluate how well LLMs recognize a specific aspect of poetry, poetic form, for more than 20 forms and formal elements in the English language. Poetic form captures many different poetic features, including rhyme scheme, meter, and word or line repetition. We use this task to reflect on LLMs’ current poetic capabilities, as well as the challenges and pitfalls of creating NLP benchmarks for poetry and for other creative tasks. In particular, we use this task to audit and reflect on the poems included in popular pretraining datasets. Our findings have implications for NLP researchers interested in model evaluation, digital humanities and cultural analytics scholars, and cultural heritage professionals.
摘要:大型语言模型(LLM)现在可以生成和识别各种风格和流派的文本,包括诗歌等高度专业化、创造性的流派。但法学硕士对诗歌真正了解多少?他们对诗歌了解多少?我们制定了一项任务,评估LLM如何识别诗歌的特定方面、诗歌形式、英语中20多种形式和形式元素。诗歌形式体现了许多不同的诗歌特征,包括押韵方案、格律以及词或行重复。我们利用这项任务来反思LLM当前的诗歌能力,以及为诗歌和其他创意任务创建NLP基准的挑战和陷阱。特别是,我们使用此任务来审核和反思流行预训练数据集中包含的诗歌。我们的发现对模型评估感兴趣的NLP研究人员、数字人文和文化分析学者以及文化遗产专业人士具有影响。

[NLP-46] Can we teach language models to gloss endangered languages?
[NLP-46] 我们可以教语言模型来掩盖濒危语言吗?

链接: https://arxiv.org/abs/2406.18895
作者: Michael Ginn,Mans Hulden,Alexis Palmer
关键词: Interlinear glossed text, language documentation projects, Interlinear glossed, glossed text, documentation projects
中文关键词: 线性注释文本、语言文档项目、线性注释、注释文本、文档项目
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interlinear glossed text (IGT) is a popular format in language documentation projects, where each morpheme is labeled with a descriptive annotation. Automating the creation of interlinear glossed text can be desirable to reduce annotator effort and maintain consistency across annotated corpora. Prior research has explored a number of statistical and neural methods for automatically producing IGT. As large language models (LLMs) have showed promising results across multilingual tasks, even for rare, endangered languages, it is natural to wonder whether they can be utilized for the task of generating IGT. We explore whether LLMs can be effective at the task of interlinear glossing with in-context learning, without any traditional training. We propose new approaches for selecting examples to provide in-context, observing that targeted selection can significantly improve performance. We find that LLM-based methods beat standard transformer baselines, despite requiring no training at all. These approaches still underperform state-of-the-art supervised systems for the task, but are highly practical for researchers outside of the NLP community, requiring minimal effort to use. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.18895 [cs.CL] (or arXiv:2406.18895v1 [cs.CL] for this version)
摘要:行间注释文本(IGT)是语言文档项目中的一种流行格式,其中每个语素都有一个描述性注释。自动创建行间注释文本可能是可取的,以减少注释员的工作量并保持带注释的语料库的一致性。先前的研究已经探索了一些统计和神经方法来自动生成IGT。由于大型语言模型(LLM)在多语言任务中显示了良好的结果,即使是对于稀有的濒危语言,人们自然会想知道它们是否可以用于生成IGT的任务。我们探索了LLMS是否可以在没有任何传统训练的情况下,通过情境学习有效地完成线间注解任务。我们提出了选择示例以提供上下文的新方法,观察到有针对性的选择可以显著提高性能。我们发现,基于LLM的方法优于标准变压器基线,尽管根本不需要培训。这些方法在这项任务中的表现仍然不如最先进的监督系统,但对于NLP社区以外的研究人员来说非常实用,只需要最少的努力就可以使用。科目:计算和语言(cs.CL)引用为:arxiv:2406.18895cs.CL

[NLP-47] SSP: Self-Supervised Prompting for Cross-Lingual Transfer to Low-Resource Languages using Large Language Models
[NLP-47] STP:使用大型语言模型进行跨语言迁移到低资源语言的自我监督预算

链接: https://arxiv.org/abs/2406.18880
作者: Vipul Rathore,Aniruddha Deb,Ankish Chandresh,Parag Singla,Mausam
关键词: English NLP tasks, English NLP, shown exceptional performance, NLP tasks, large language models
中文关键词: 英语NLP任务,英语NLP,表现出出色的性能,NLP任务,大型语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, very large language models (LLMs) have shown exceptional performance on several English NLP tasks with just in-context learning (ICL), but their utility in other languages is still underexplored. We investigate their effectiveness for NLP tasks in low-resource languages (LRLs), especially in the setting of zero-labelled cross-lingual transfer (0-CLT), where no labelled training data for the target language is available – however training data from one or more related medium-resource languages (MRLs) is utilized, alongside the available unlabeled test data for a target language. We introduce Self-Supervised Prompting (SSP), a novel ICL approach tailored for the 0-CLT setting. SSP is based on the key observation that LLMs output more accurate labels if in-context exemplars are from the target language (even if their labels are slightly noisy). To operationalize this, since target language training data is not available in 0-CLT, SSP operates in two stages. In Stage I, using source MRL training data, target language’s test data is noisily labeled. In Stage II, these noisy test data points are used as exemplars in ICL for further improved labelling. Additionally, our implementation of SSP uses a novel Integer Linear Programming (ILP)-based exemplar selection that balances similarity, prediction confidence (when available) and label coverage. Experiments on three tasks and eleven LRLs (from three regions) demonstrate that SSP strongly outperforms existing SOTA fine-tuned and prompting-based baselines in 0-CLT setup. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.18880 [cs.CL] (or arXiv:2406.18880v1 [cs.CL] for this version)
摘要:最近,超大型语言模型(LLM)在几个仅有语境学习(ICL)的英语自然语言处理任务中表现出了优异的表现,但它们在其他语言中的应用仍未得到充分的探索。我们考察了它们在低资源语言(LRL)中的NLP任务中的有效性,特别是在零标记跨语言迁移(0-CLT)的情况下,目标语言没有标记的训练数据可用–然而来自一种或多种相关的中等资源语言(MRL)的训练数据与目标语言的可用的无标记测试数据一起被利用。我们引入了自我监督提示(SSP),这是一种为0-CLT环境量身定做的ICL方法。SSP是基于这样一个关键观察:如果上下文中的样本来自目标语言,LLMS会输出更准确的标签(即使它们的标签略有噪音)。为了实现这一点,由于目标语言训练数据在0-CLT中不可用,SSP分两个阶段进行操作。在第一阶段,使用源MRL训练数据,对目标语言的测试数据进行噪声标记。在第二阶段,这些有噪声的测试数据点被用作ICL中的样本,以进一步改进标签。此外,我们的SSP实现使用了一种新的基于整数线性规划(ILP)的样本选择,该选择平衡了相似性、预测置信度(当可用时)和标签覆盖。在三个任务和11个LRL(来自三个地区)上的实验表明,SSP在0-CLT设置中的表现明显优于现有的SOTA微调基线和基于提示的基线。科目:计算和语言(cs.CL)引用为:arxiv:2406.18880cs.CL

[NLP-48] Efficacy of Language Model Self-Play in Non-Zero-Sum Games
[NLP-48] 非零和游戏中语言模型自我游戏的有效性

链接: https://arxiv.org/abs/2406.18872
作者: Austen Liao,Nicholas Tomlin,Dan Klein
关键词: yield optimal policies, Game-playing agents, achieved superhuman performance, agents like AlphaGo, AlphaGo have achieved
中文关键词: 产生最佳策略,游戏代理,取得超人表现,AlphaGo等代理,AlphaGo已经取得
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Game-playing agents like AlphaGo have achieved superhuman performance through self-play, which is theoretically guaranteed to yield optimal policies in competitive games. However, most language tasks are partially or fully cooperative, so it is an open question whether techniques like self-play can effectively be used to improve language models. We empirically investigate this question in a negotiation game setting known as Deal or No Deal (DoND). Crucially, the objective in DoND can be modified to produce a fully cooperative game, a strictly competitive one, or anything in between. We finetune language models in self-play over multiple rounds of filtered behavior cloning in DoND for each of these objectives. Contrary to expectations, we find that language model self-play leads to significant performance gains in both cooperation and competition with humans, suggesting that self-play and related techniques have promise despite a lack of theoretical guarantees.
摘要:像AlphaGo这样的游戏代理通过自我游戏实现了超人的表现,理论上保证了在竞争游戏中产生最优策略。然而,大多数语言任务都是部分或完全合作的,因此自我游戏等技术是否可以有效地用于改进语言模型是一个悬而未决的问题。我们在名为“交易或不交易”(DoND)的谈判游戏环境中对这个问题进行了实证研究。至关重要的是,DoND中的目标可以被修改以产生完全合作的游戏、严格竞争的游戏或介于两者之间的任何游戏。我们在DoND中针对每一个目标进行多轮过滤行为克隆,在自我游戏中微调语言模型。与预期相反,我们发现语言模型自我游戏在与人类的合作和竞争中都会带来显着的绩效提升,这表明自我游戏和相关技术尽管缺乏理论保证,但仍有希望。

[NLP-49] wo-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification
[NLP-49] 放射学报告简化中ChatGPT自纠正的人类评估

链接: https://arxiv.org/abs/2406.18859
作者: Ziyu Yang,Santhosh Cherian,Slobodan Vucetic
关键词: highly technical documents, technical documents aimed, documents aimed primarily, Radiology reports, doctor-doctor communication
中文关键词: 高度技术性文件、针对的技术文件、主要针对的文件、放射学报告、医生与医生的沟通
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radiology reports are highly technical documents aimed primarily at doctor-doctor communication. There has been an increasing interest in sharing those reports with patients, necessitating providing them patient-friendly simplifications of the original reports. This study explores the suitability of large language models in automatically generating those simplifications. We examine the usefulness of chain-of-thought and self-correction prompting mechanisms in this domain. We also propose a new evaluation protocol that employs radiologists and laypeople, where radiologists verify the factual correctness of simplifications, and laypeople assess simplicity and comprehension. Our experimental results demonstrate the effectiveness of self-correction prompting in producing high-quality simplifications. Our findings illuminate the preferences of radiologists and laypeople regarding text simplification, informing future research on this topic.
摘要:放射学报告是一种高度技术性的文件,主要旨在进行医生与医生的沟通。与患者分享这些报告的兴趣越来越大,这就需要为他们提供方便患者的原始报告简化。这项研究探讨了大型语言模型在自动生成这些简化方面的适用性。我们研究了思想链和自我纠正激励机制在该领域的有用性。我们还提出了一种雇用放射科医生和非专业人士的新评估协议,放射科医生验证简化的事实正确性,而非专业人士则评估简单性和理解性。我们的实验结果证明了自我纠正在产生高质量简化方面的有效性。我们的研究结果阐明了放射科医生和非专业人士对文本简化的偏好,为未来关于该主题的研究提供信息。

[NLP-50] FFN: a Fine-grained Chinese-English Financial Domain Parallel Corpus
[NLP-50] FFN:一个细粒度的中英金融领域并行数据库

链接: https://arxiv.org/abs/2406.18856
作者: Yuxin Fu,Shijing Si,Leyi Mai,Xi-ang Li
关键词: Large Language Models, Large Language, financial domain remains, remains largely underexplored, domain remains largely
中文关键词: 大型语言模型、大型语言、金融领域仍然存在、仍然基本上未充分开发、领域仍然存在
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: a simplified version of this paper is accepted by International Conference on Asian Language Processing 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have stunningly advanced the field of machine translation, though their effectiveness within the financial domain remains largely underexplored. To probe this issue, we constructed a fine-grained Chinese-English parallel corpus of financial news called FFN. We acquired financial news articles spanning between January 1st, 2014, to December 31, 2023, from mainstream media websites such as CNN, FOX, and China Daily. The dataset consists of 1,013 main text and 809 titles, all of which have been manually corrected. We measured the translation quality of two LLMs – ChatGPT and ERNIE-bot, utilizing BLEU, TER and chrF scores as the evaluation metrics. For comparison, we also trained an OpenNMT model based on our dataset. We detail problems of LLMs and provide in-depth analysis, intending to stimulate further research and solutions in this largely uncharted territory. Our research underlines the need to optimize LLMs within the specific field of financial translation to ensure accuracy and quality.
摘要:大型语言模型在机器翻译领域取得了令人惊叹的进展,但其在金融领域的有效性仍未得到充分的研究。为了探讨这一问题,我们构建了一个细粒度的汉英财经新闻平行语料库FFN。我们从美国有线电视新闻网、福克斯、中国日报等主流媒体网站上获取了从2014年1月1日到2023年12月31日的财经新闻文章。该数据集由1013个正文和809个标题组成,所有这些内容都经过了手动更正。我们以BLEU、TER和CHRF分数为评价指标,对两个低成本翻译软件–ChatGPT和Ernie-bot的翻译质量进行了测试。为了进行比较,我们还基于我们的数据集训练了一个OpenNMT模型。我们详细介绍了低成本管理的问题,并提供了深入的分析,意在鼓励在这个基本上未知的领域进行进一步的研究和解决方案。我们的研究强调了在金融翻译的特定领域内优化LLMS的必要性,以确保准确性和质量。

[NLP-51] Learning Retrieval Augmentation for Personalized Dialogue Generation
[NLP-51] 用于个性化对话生成的学习检索增强

链接: https://arxiv.org/abs/2406.18847
作者: Qiushi Huang,Shuai Fu,Xubo Liu,Wenwu Wang,Tom Ko,Yu Zhang,Lilian Tang
关键词: generating highly tailored, gained significant attention, textbf, persona dialogue generation, Personalized dialogue generation
中文关键词: 生成高度定制、获得极大关注、文本BF、人物对话生成、个性化对话生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP-2023

点击查看摘要

Abstract:Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose \textbfL earning Retrieval \textbfA ugmentation for \textbfP ersonalized \textbfD ial \textbfO gue \textbfG eneration ( \textbfLAPDOG ), which studies the potential of leveraging external knowledge for persona dialogue generation. Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator. The story retriever uses a given persona profile as queries to retrieve relevant information from the story document, which serves as a supplementary context to augment the persona profile. The dialogue generator utilizes both the dialogue history and the augmented persona profile to generate personalized responses. For optimization, we adopt a joint training framework that collaboratively learns the story retriever and dialogue generator, where the story retriever is optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for the dialogue generator to generate personalized responses. Experiments conducted on the CONVAI2 dataset with ROCStory as a supplementary data source show that the proposed LAPDOG method substantially outperforms the baselines, indicating the effectiveness of the proposed method. The LAPDOG model code is publicly available for further exploration. this https URL
摘要:个性化对话生成通过利用人物模型和对话上下文来生成高度定制的响应,在对话式人工智能应用中得到了极大的关注。然而,人物角色简档是当前个性化对话数据集中的一种流行设置,通常只由四到五句话组成,可能不能提供关于代理的人物角色的全面描述,这给生成真正个性化的对话带来了挑战。为了解决这个问题,我们提出了一种基于文本检索的个性化文本生成算法(TextbfLAPDOG),它研究了利用外部知识生成人物角色对话的可能性。具体地说,提出的宠物狗模型由一个故事检索器和一个对话生成器组成。故事检索器使用给定的角色简档作为查询来从故事文档中检索相关信息,该故事文档用作补充上下文以增强角色简档。对话生成器利用对话历史和增强的人物简档两者来生成个性化响应。对于优化,我们采用了协作学习故事检索器和对话生成器的联合训练框架,其中故事检索器针对期望的最终度量(例如,BLEU)进行优化,以检索对话生成器的内容以生成个性化的响应。在以ROCStory为辅助数据源的CONVAI2数据集上进行的实验表明,该方法的性能明显优于基线,表明了该方法的有效性。拉普狗的模型代码已经公开,可供进一步探索。此HTTPS URL

[NLP-52] he global landscape of academic guidelines for generative AI and Large Language Models
[NLP-52] 生成性人工智能和大型语言模型学术指南的全球格局

链接: https://arxiv.org/abs/2406.18842
作者: Junfeng Jiao,Saleh Afroogh,Kevin Chen,David Atkinson,Amit Dhurandhar
关键词: Generative Artificial Intelligence, Large Language Models, Artificial Intelligence, Language Models, Generative Artificial
中文关键词: 生成人工智能,大型语言模型,人工智能,语言模型,生成人工
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The integration of Generative Artificial Intelligence (GAI) and Large Language Models (LLMs) in academia has spurred a global discourse on their potential pedagogical benefits and ethical considerations. Positive reactions highlight some potential, such as collaborative creativity, increased access to education, and empowerment of trainers and trainees. However, negative reactions raise concerns about ethical complexities, balancing innovation and academic integrity, unequal access, and misinformation risks. Through a systematic survey and text-mining-based analysis of global and national directives, insights from independent research, and eighty university-level guidelines, this study provides a nuanced understanding of the opportunities and challenges posed by GAI and LLMs in education. It emphasizes the importance of balanced approaches that harness the benefits of these technologies while addressing ethical considerations and ensuring equitable access and educational outcomes. The paper concludes with recommendations for fostering responsible innovation and ethical practices to guide the integration of GAI and LLMs in academia.
摘要:产生式人工智能(GAI)和大型语言模型(LLM)的结合在学术界引发了一场关于它们潜在的教学益处和伦理考量的全球讨论。积极的反应突出了一些潜力,如合作创造力、增加接受教育的机会以及增强培训者和受训者的能力。然而,负面反应引发了人们对伦理复杂性、平衡创新和学术诚信、机会不平等以及错误信息风险的担忧。通过对全球和国家指令的系统调查和基于文本挖掘的分析,来自独立研究的见解,以及80个大学层面的指导方针,本研究提供了对GAI和LLMS在教育方面所带来的机遇和挑战的细微差别的理解。它强调必须采取平衡的办法,利用这些技术的好处,同时处理道德方面的考虑,并确保平等的机会和教育成果。文章最后提出了培养负责任的创新和道德实践的建议,以指导学术界对GAI和LLMS的整合。

[NLP-53] Navigating LLM Ethics: Advancements Challenges and Future Directions
[NLP-53] 驾驭法学硕士道德:进步挑战和未来方向

链接: https://arxiv.org/abs/2406.18841
作者: Junfeng Jiao,Saleh Afroogh,Yiming Xu,Connor Phillips
关键词: Large Language Models, surrounding Large Language, issues surrounding Large, Language Models, Large Language
中文关键词: 大型语言模型,围绕大型语言,围绕大型的问题,语言模型,大型语言
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study addresses ethical issues surrounding Large Language Models (LLMs) within the field of artificial intelligence. It explores the common ethical challenges posed by both LLMs and other AI systems, such as privacy and fairness, as well as ethical challenges uniquely arising from LLMs. It highlights challenges such as hallucination, verifiable accountability, and decoding censorship complexity, which are unique to LLMs and distinct from those encountered in traditional AI systems. The study underscores the need to tackle these complexities to ensure accountability, reduce biases, and enhance transparency in the influential role that LLMs play in shaping information dissemination. It proposes mitigation strategies and future directions for LLM ethics, advocating for interdisciplinary collaboration. It recommends ethical frameworks tailored to specific domains and dynamic auditing systems adapted to diverse contexts. This roadmap aims to guide responsible development and integration of LLMs, envisioning a future where ethical considerations govern AI advancements in society.
摘要:这项研究解决了人工智能领域中围绕大型语言模型(LLM)的伦理问题。它探讨了LLMS和其他人工智能系统构成的共同伦理挑战,如隐私和公平,以及LLMS独有的伦理挑战。它强调了幻觉、可验证的问责和解码审查复杂性等挑战,这些挑战是LLMS独有的,不同于传统人工智能系统中遇到的挑战。这项研究强调需要处理这些复杂性,以确保问责制,减少偏见,并提高小岛屿发展中国家在塑造信息传播方面发挥的有影响力的作用的透明度。它为法学院伦理学提出了缓解策略和未来方向,倡导跨学科合作。它建议为特定领域量身定做的道德框架和适应不同情况的动态审计制度。这一路线图旨在指导低成本管理的负责任的发展和整合,展望一个伦理考虑主导社会人工智能进步的未来。

[NLP-54] OutlierTune: Efficient Channel-Wise Quantization for Large Language Models
[NLP-54] OutlierButton:大型语言模型的高效并行量化

链接: https://arxiv.org/abs/2406.18832
作者: Jinguang Wang,Yuexi Yin,Haifeng Sun,Qi Qi,Jingyu Wang,Zirui Zhuang,Tingting Yang,Jianxin Liao
关键词: significant challenge due, large language models, structured outliers, large language, significant challenge
中文关键词: 重大挑战,大型语言模型,结构化异常值,大型语言,重大挑战
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it difficult to achieve both accuracy and hardware efficiency. To address this problem, we propose OutlierTune, an efficient per-channel post-training quantization (PTQ) method for the activations of LLMs. OutlierTune consists of two components: pre-execution of dequantization and symmetrization. The pre-execution of dequantization updates the model weights by the activation scaling factors, avoiding the internal scaling and costly additional computational overheads brought by the per-channel activation quantization. The symmetrization further reduces the quantization differences arising from the weight updates by ensuring the balanced numerical ranges across different activation channels. OutlierTune is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference. Extensive experiments show that the proposed framework outperforms existing methods across multiple different tasks. Demonstrating better generalization, this framework improves the Int6 quantization of the instruction-tuning LLMs, such as OPT-IML, to the same level as half-precision (FP16). Moreover, we have shown that the proposed framework is 1.48x faster than the FP16 implementation while reducing approximately 2x memory usage.
摘要:由于结构离群点的存在,对大型语言模型的激活进行量化一直是一个巨大的挑战。现有的大多数方法都集中在每个令牌或每个张量的激活量化上,这使得很难同时达到精度和硬件效率。为了解决这一问题,我们提出了一种高效的每通道训练后量化(PTQ)方法来激活LLM。OutlierTune由两个部分组成:反量化预执行和对称化。反量化的预执行通过激活比例因子来更新模型权重,避免了每通道激活量化带来的内部缩放和昂贵的额外计算开销。对称化通过确保不同激活通道之间的均衡数值范围,进一步减少了因权重更新而产生的量化差异。OutlierTune易于实现且硬件效率高,在推理过程中几乎不会引入额外的计算开销。大量实验表明,该框架在多个不同任务上的性能优于已有的方法。该框架展示了更好的通用性,将指令调优LLM(如OPT-IML)的Int6量化改进到与半精度(Fp16)相同的水平。此外,我们已经证明,所提出的框架比FP16实现快1.48倍,同时减少了大约2倍的内存使用量。

[NLP-55] Psychological Profiling in Cybersecurity: A Look at LLMs and Psycholinguistic Features
[NLP-55] 网络安全中的心理剖析:法学硕士和心理语言学特征

链接: https://arxiv.org/abs/2406.18783
作者: Jean Marie Tshimula,D’Jeff K. Nkashama,Jean Tshibangu Muabila,René Manassé Galekwa,Hugues Kanda,Maximilien V. Dialufuma,Mbuyi Mukendi Didier,Kalala Kalonji,Serge Mundele,Patience Kinshie Lenye,Tighana Wenge Basele,Aristarque Ilunga,Christian N. Mayemba,Nathanaël M. Kasoro,Selain K. Kasereka,Hardy Mikese,Pierre-Martin Tardif,Marc Frappier,Froduald Kabanza,Belkacem Chikhaoui,Shengrui Wang,Ali Mulenda Sumbu,Xavier Ndona,Raoul Kienge-Kienge Intudi
关键词: necessitates innovative approaches, Large Language Models, cyber threats necessitates, threats necessitates innovative, increasing sophistication
中文关键词: 需要创新方法、大型语言模型、网络威胁、威胁需要创新、不断提高的复杂性
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing sophistication of cyber threats necessitates innovative approaches to cybersecurity. In this paper, we explore the potential of psychological profiling techniques, particularly focusing on the utilization of Large Language Models (LLMs) and psycholinguistic features. We investigate the intersection of psychology and cybersecurity, discussing how LLMs can be employed to analyze textual data for identifying psychological traits of threat actors. We explore the incorporation of psycholinguistic features, such as linguistic patterns and emotional cues, into cybersecurity frameworks. \iffalse Through case studies and experiments, we discuss the effectiveness of these methods in enhancing threat detection and mitigation strategies.\fi Our research underscores the importance of integrating psychological perspectives into cybersecurity practices to bolster defense mechanisms against evolving threats.
摘要:网络威胁日益复杂,需要创新的网络安全方法。在本文中,我们探索了心理剖析技术的潜力,特别关注大型语言模型(LLM)和心理语言特征的利用。我们研究心理学和网络安全的交叉点,讨论如何利用LLM来分析文本数据以识别威胁行为者的心理特征。我们探索将心理语言特征(例如语言模式和情感线索)融入网络安全框架。\iffalse通过案例研究和实验,我们讨论了这些方法在增强威胁检测和缓解策略方面的有效性。\ fi我们的研究强调了将心理学观点融入网络安全实践以加强防御机制针对不断变化的威胁的重要性。

[NLP-56] Implicit Discourse Relation Classification For Nigerian Pidgin
[NLP-56] 尼日利亚洋钦语的隐性话语关系分类

链接: https://arxiv.org/abs/2406.18776
作者: Muhammed Saeed,Peter Bourgonje,Vera Demberg
关键词: Large Language Models, Language Models multi-lingual, make Large Language, Models multi-lingual, make Large
中文关键词: 大型语言模型,语言模型多语言,制作大型语言,模型多语言,制作大型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite attempts to make Large Language Models multi-lingual, many of the world’s languages are still severely under-resourced. This widens the performance gap between NLP and AI applications aimed at well-financed, and those aimed at less-resourced languages. In this paper, we focus on Nigerian Pidgin (NP), which is spoken by nearly 100 million people, but has comparatively very few NLP resources and corpora. We address the task of Implicit Discourse Relation Classification (IDRC) and systematically compare an approach translating NP data to English and then using a well-resourced IDRC tool and back-projecting the labels versus creating a synthetic discourse corpus for NP, in which we translate PDTB and project PDTB labels, and then train an NP IDR classifier. The latter approach of learning a “native” NP classifier outperforms our baseline by 13.27% and 33.98% in f _1 score for 4-way and 11-way classification, respectively.
摘要:尽管人们试图使大型语言模型具有多语言功能,但世界上许多语言仍然严重资源不足。这扩大了针对资金充足的语言的NLP和人工智能应用程序与针对资源较少的语言的应用程序之间的性能差距。在本文中,我们重点关注尼日利亚Pidgin(NP),该地区有近1亿人使用,但NLP资源和数据库相对较少。我们解决了隐性话语关系分类(IDRC)的任务,并系统地比较了将NP数据翻译为英语,然后使用资源丰富的IDRC工具并反向投影标签的方法与为NP创建合成话语库的方法,其中我们翻译PDTB并投影PDTB标签,然后训练NP IDM分类器。后一种学习“原生”NP分类器的方法在4向和11向分类的f_1评分方面分别比我们的基线高出13.27%和33.98%。

[NLP-57] Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism
[NLP-57] 范畴论的重新审视:法学硕士分析范畴论的逻辑推理能力回顾

链接: https://arxiv.org/abs/2406.18762
作者: Shi Zong,Jimmy Lin
关键词: logic inference tasks, large language models, categorical syllogisms, behave for logic, inference tasks
中文关键词: 逻辑推理任务、大型语言模型、类别演绎、逻辑行为、推理任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There have been a huge number of benchmarks proposed to evaluate how large language models (LLMs) behave for logic inference tasks. However, it remains an open question how to properly evaluate this ability. In this paper, we provide a systematic overview of prior works on the logical reasoning ability of LLMs for analyzing categorical syllogisms. We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective and then examine the underlying configurations (i.e., mood and figure) tested by the existing datasets. Our results indicate that compared to template-based synthetic datasets, crowdsourcing approaches normally sacrifice the coverage of configurations (i.e., mood and figure) of categorical syllogisms for more language variations, thus bringing challenges to fully testing LLMs under different situations. We then proceed to summarize the findings and observations for the performances of LLMs to infer the validity of syllogisms from the current literature. The error rate breakdown analyses suggest that the interpretation of the quantifiers seems to be the current bottleneck that limits the performances of the LLMs and is thus worth more attention. Finally, we discuss several points that might be worth considering when researchers plan on the future release of categorical syllogism datasets. We hope our work will not only provide a timely review of the current literature regarding categorical syllogisms, but also motivate more interdisciplinary research between communities, specifically computational linguists and logicians.
摘要:已经提出了大量的基准来评估大型语言模型(LLM)在逻辑推理任务中的表现。然而,如何正确评估这一能力仍然是一个悬而未决的问题。在这篇文章中,我们系统地综述了前人关于LLMS用于分析范畴三段论的逻辑推理能力的工作。我们首先从纯逻辑的角度考察了范畴三段论的所有可能的变体,然后考察了现有数据集测试的潜在结构(即语气和图形)。我们的结果表明,与基于模板的合成数据集相比,众包方法通常会牺牲范畴三段论的配置覆盖率(即语气和图形)来获得更多的语言变体,从而给不同情况下的全面测试LLMS带来了挑战。然后,我们继续总结关于LLMS表现的发现和观察,以从当前文献中推断三段论的有效性。误码率细分分析表明,量词的解释似乎是目前限制LLMS性能的瓶颈,因此值得更多地关注。最后,我们讨论了几点,可能值得考虑的时候,研究人员计划未来发布的直言三段论数据集。我们希望我们的工作不仅能及时回顾当前关于范畴三段论的文献,还能推动社区之间更多的跨学科研究,特别是计算语言学家和逻辑学家。

[NLP-58] Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models
[NLP-58] 逐步重新排名:研究使用大型语言模型重新排名的预过滤

链接: https://arxiv.org/abs/2406.18740
作者: Baharan Nouriinanloo,Maxime Lamothe
关键词: natural language processing, language processing tasks, Large Language Models, natural language, language processing
中文关键词: 自然语言处理、语言处理任务、大型语言模型、自然语言、语言处理
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been revolutionizing a myriad of natural language processing tasks with their diverse zero-shot capabilities. Indeed, existing work has shown that LLMs can be used to great effect for many tasks, such as information retrieval (IR), and passage ranking. However, current state-of-the-art results heavily lean on the capabilities of the LLM being used. Currently, proprietary, and very large LLMs such as GPT-4 are the highest performing passage re-rankers. Hence, users without the resources to leverage top of the line LLMs, or ones that are closed source, are at a disadvantage. In this paper, we investigate the use of a pre-filtering step before passage re-ranking in IR. Our experiments show that by using a small number of human generated relevance scores, coupled with LLM relevance scoring, it is effectively possible to filter out irrelevant passages before re-ranking. Our experiments also show that this pre-filtering then allows the LLM to perform significantly better at the re-ranking task. Indeed, our results show that smaller models such as Mixtral can become competitive with much larger proprietary models (e.g., ChatGPT and GPT-4).
摘要:大型语言模型(LLM)以其多样化的零点计算能力,给众多的自然语言处理任务带来了革命性的变化。事实上,现有的工作已经表明,LLMS可以非常有效地用于许多任务,如信息检索(IR)和文章排名。然而,目前最先进的结果在很大程度上依赖于所使用的LLM的能力。目前,专有的和非常大的LLM,如GPT-4,是性能最好的通道重新排名器。因此,没有资源来利用顶级LLM的用户或封闭源代码的用户处于劣势。在本文中,我们研究了在信息检索中如何在文章重新排序之前使用预过滤步骤。我们的实验表明,通过使用少量的人类生成的相关性分数,结合LLM相关性评分,可以有效地在重新排序之前过滤掉不相关的段落。我们的实验还表明,这种预滤波可以使LLM在重新排序任务中表现得更好。事实上,我们的结果表明,像Mixtral这样的较小型号可以与更大的专有型号(如ChatGPT和GPT-4)竞争。

[NLP-59] Jailbreaking LLMs with Arabic Transliteration and Arabizi
[NLP-59] 使用阿拉伯语拼音和Arabizi语越狱LLM

链接: https://arxiv.org/abs/2406.18725
作者: Mansour Al Ghanim,Saleh Almohaimeed,Mengxin Zheng,Yan Solihin,Qian Lou
关键词: Large Language Models, vulnerabilities of Large, Large Language, Arabic language, specifically focusing
中文关键词: 大型语言模型、大型、大型语言、阿拉伯语的漏洞,特别关注
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:This study identifies the potential vulnerabilities of Large Language Models (LLMs) to ‘jailbreak’ attacks, specifically focusing on the Arabic language and its various forms. While most research has concentrated on English-based prompt manipulation, our investigation broadens the scope to investigate the Arabic language. We initially tested the AdvBench benchmark in Standardized Arabic, finding that even with prompt manipulation techniques like prefix injection, it was insufficient to provoke LLMs into generating unsafe content. However, when using Arabic transliteration and chatspeak (or arabizi), we found that unsafe content could be produced on platforms like OpenAI GPT-4 and Anthropic Claude 3 Sonnet. Our findings suggest that using Arabic and its various forms could expose information that might remain hidden, potentially increasing the risk of jailbreak attacks. We hypothesize that this exposure could be due to the model’s learned connection to specific words, highlighting the need for more comprehensive safety training across all language forms.
摘要:这项研究确定了大型语言模型(LLM)对‘越狱’攻击的潜在漏洞,特别是关注阿拉伯语及其各种形式。虽然大多数研究都集中在基于英语的即时操作上,但我们的调查扩大了对阿拉伯语的研究范围。我们最初用标准化的阿拉伯语测试了AdvBtch基准测试,发现即使使用前缀注入等快速操作技术,也不足以激发LLMS生成不安全的内容。然而,当使用阿拉伯语音译和聊天(或Arabizi)时,我们发现在OpenAI GPT-4和人类克劳德3十四行诗等平台上可能会产生不安全的内容。我们的发现表明,使用阿拉伯语及其各种形式可能会暴露可能仍然隐藏的信息,潜在地增加越狱攻击的风险。我们假设,这种接触可能是由于模型与特定单词的习得联系,强调了在所有语言形式中进行更全面的安全培训的必要性。

[NLP-60] Learn it or Leave it: Module Composition and Pruning for Continual Learning
[NLP-60] 学习或放弃:持续学习的模块构成和修剪

链接: https://arxiv.org/abs/2406.18708
作者: Mingyang Wang,Heike Adel,Lukas Lange,Jannik Strötgen,Hinrich Schütze
关键词: real-world environments, machine learning models, continual learning, essential for machine, learning
中文关键词: 现实世界环境、机器学习模型、持续学习、机器学习至关重要
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In real-world environments, continual learning is essential for machine learning models, as they need to acquire new knowledge incrementally without forgetting what they have already learned. While pretrained language models have shown impressive capabilities on various static tasks, applying them to continual learning poses significant challenges, including avoiding catastrophic forgetting, facilitating knowledge transfer, and maintaining parameter efficiency. In this paper, we introduce MoCL-P, a novel lightweight continual learning method that addresses these challenges simultaneously. Unlike traditional approaches that continuously expand parameters for newly arriving tasks, MoCL-P integrates task representation-guided module composition with adaptive pruning, effectively balancing knowledge integration and computational overhead. Our evaluation across three continual learning benchmarks with up to 176 tasks shows that MoCL-P achieves state-of-the-art performance and improves parameter efficiency by up to three times, demonstrating its potential for practical applications where resource requirements are constrained.
摘要:在现实环境中,持续学习对于机器学习模型是必不可少的,因为它们需要在不忘记已学知识的情况下增量地获取新知识。尽管预先训练的语言模型在各种静态任务中表现出了令人印象深刻的能力,但将它们应用于持续学习面临着巨大的挑战,包括避免灾难性遗忘、促进知识转移和保持参数效率。在本文中,我们介绍了一种新的轻量级持续学习方法MoCL-P,它同时解决了这些挑战。与传统的不断扩展新到达任务的参数的方法不同,MoCL-P将任务表示制导的模块组合与自适应剪枝相结合,有效地平衡了知识集成和计算开销。我们对多达176个任务的三个持续学习基准的评估表明,MoCL-P实现了最先进的性能,并将参数效率提高了三倍,展示了其在资源需求受限的实际应用中的潜力。

[NLP-61] Simulating The U.S. Senate: An LLM-Driven Agent Approach to Modeling Legislative Behavior and Bipartisanship
[NLP-61] 模拟美国参议院:LLM驱动的代理方法来建模立法行为和两党合作

链接: https://arxiv.org/abs/2406.18702
作者: Zachary R. Baker,Zarif L. Azher
关键词: Senate Intelligence Committee, Senate Intelligence, Intelligence Committee, study introduces, simulated committee discussions
中文关键词: 参议院情报委员会,参议院情报,情报委员会,研究介绍,模拟委员会讨论
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study introduces a novel approach to simulating legislative processes using LLM-driven virtual agents, focusing on the U.S. Senate Intelligence Committee. We developed agents representing individual senators and placed them in simulated committee discussions. The agents demonstrated the ability to engage in realistic debate, provide thoughtful reflections, and find bipartisan solutions under certain conditions. Notably, the simulation also showed promise in modeling shifts towards bipartisanship in response to external perturbations. Our results indicate that this LLM-driven approach could become a valuable tool for understanding and potentially improving legislative processes, supporting a broader pattern of findings highlighting how LLM-based agents can usefully model real-world phenomena. Future works will focus on enhancing agent complexity, expanding the simulation scope, and exploring applications in policy testing and negotiation.
摘要:本研究引入了一种使用LLM驱动的虚拟代理模拟立法过程的新颖方法,重点关注美国参议院情报委员会。我们开发了代表个别参议员的代理人,并将他们置于模拟委员会讨论中。代理人展示了参与现实辩论、提供深思熟虑的反思并在某些条件下找到两党解决方案的能力。值得注意的是,该模拟还显示出了为应对外部扰动而向两党合作转变进行建模的希望。我们的结果表明,这种LLM驱动的方法可能成为理解和潜在改进立法流程的宝贵工具,支持更广泛的调查结果模式,强调基于LLM的代理人如何有效地建模现实世界的现象。未来的工作将重点关注提高代理复杂性、扩大模拟范围以及探索在政策测试和谈判中的应用。

[NLP-62] Sequence Graph Network for Online Debate Analysis
[NLP-62] 用于在线辩论分析的序列图网络

链接: https://arxiv.org/abs/2406.18696
作者: Quan Mai,Susan Gauch,Douglas Adams,Miaoqing Huang
关键词: respond with counterarguments, opponents’ arguments, compelling arguments, ideas over time, discussion unfolds
中文关键词: 用反驳、对手论点、令人信服的论点、随着时间的推移的想法来回应,讨论展开
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Online debates involve a dynamic exchange of ideas over time, where participants need to actively consider their opponents’ arguments, respond with counterarguments, reinforce their own points, and introduce more compelling arguments as the discussion unfolds. Modeling such a complex process is not a simple task, as it necessitates the incorporation of both sequential characteristics and the capability to capture interactions effectively. To address this challenge, we employ a sequence-graph approach. Building the conversation as a graph allows us to effectively model interactions between participants through directed edges. Simultaneously, the propagation of information along these edges in a sequential manner enables us to capture a more comprehensive representation of context. We also introduce a Sequence Graph Attention layer to illustrate the proposed information update scheme. The experimental results show that sequence graph networks achieve superior results to existing methods in online debates.
摘要:在线辩论涉及随着时间的推移进行动态的思想交流,参与者需要积极考虑对手的论点,用相反的论点回应,强化自己的观点,并随着讨论的展开引入更具说服力的论点。对如此复杂的流程进行建模并不是一项简单的任务,因为它需要结合顺序特征和有效捕获交互的能力。为了解决这一挑战,我们采用了序列图方法。将对话构建为图形允许我们通过有向边有效地建模参与者之间的交互。同时,信息沿着这些边缘的顺序传播使我们能够捕获更全面的上下文表示。我们还引入了序列图关注层来说明所提出的信息更新方案。实验结果表明,序列图网络在在线辩论中取得了优于现有方法的结果。

[NLP-63] Learning to Correct for QA Reasoning with Black-box LLMs
[NLP-63] 学习使用黑匣子LLM纠正QA推理

链接: https://arxiv.org/abs/2406.18695
作者: Jaehyung Kim,Dongyoung Kim,Yiming Yang
关键词: output token probabilities, recent machine learning, large language models, token probabilities, open challenge
中文关键词: 输出令牌概率、最近的机器学习、大型语言模型、令牌概率、开放挑战
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint, 18 pages

点击查看摘要

Abstract:An open challenge in recent machine learning is about how to improve the reasoning capability of large language models (LLMs) in a black-box setting, i.e., without access to detailed information such as output token probabilities. Existing approaches either rely on accessibility (which is often unrealistic) or involve significantly increased train- and inference-time costs. This paper addresses those limitations or shortcomings by proposing a novel approach, namely CoBB (Correct for improving QA reasoning of Black-Box LLMs). It uses a trained adaptation model to perform a seq2seq mapping from the often-imperfect reasonings of the original black-box LLM to the correct or improved reasonings. Specifically, the adaptation model is initialized with a relatively small open-source LLM and adapted over a collection of sub-sampled training pairs. To select the representative pairs of correct and incorrect reasonings, we formulated the dataset construction as an optimization problem that minimizes the statistical divergence between the sampled subset and the entire collection, and solved it via a genetic algorithm. We then train the adaptation model over the sampled pairs by contrasting the likelihoods of correct and incorrect reasonings. Our experimental results demonstrate that CoBB significantly improves reasoning accuracy across various QA benchmarks, compared to the best-performing adaptation baselines.
摘要:当前机器学习面临的一个挑战是如何在黑盒环境下提高大语言模型的推理能力,即在不获取输出标记概率等详细信息的情况下提高推理能力。现有的方法要么依赖于可访问性(这通常是不现实的),要么涉及显著增加的训练和推理时间成本。本文针对这些局限性和不足,提出了一种新的改进黑盒LLMS的QA推理的方法–Cobb。它使用经过训练的自适应模型来执行从原始黑盒LLM的经常不完美的推理到正确或改进的推理的序列映射。具体地说,自适应模型是用相对较小的开源LLM来初始化的,并在子采样训练对的集合上进行自适应。为了选择具有代表性的正确和错误推理对,我们将数据集的构造描述为一个最小化样本子集和整个集合之间的统计差异的优化问题,并通过遗传算法进行求解。然后,我们通过对比正确和不正确推理的可能性来训练样本对上的适应模型。我们的实验结果表明,与性能最好的自适应基线相比,Cobb在各种QA基准上显著提高了推理精度。

[NLP-64] he Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
[NLP-64] 多语言协调棱镜:协调全球和地方偏好以减少伤害

链接: https://arxiv.org/abs/2406.18682
作者: Aakanksha,Arash Ahmadian,Beyza Ermis,Seraphina Goldfarb-Tarrant,Julia Kreutzer,Marzieh Fadaee,Sara Hooker
关键词: key concern, implicit question, alignment, languages, Abstract
中文关键词: 关键关注点、隐含问题、对齐、语言、摘要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A key concern with the concept of “alignment” is the implicit question of “alignment to what?”. AI systems are increasingly used across the world, yet safety alignment is often focused on homogeneous monolingual settings. Additionally, preference training and safety measures often overfit to harms common in Western-centric datasets. Here, we explore the viability of different alignment approaches when balancing dual objectives: addressing and optimizing for a non-homogeneous set of languages and cultural preferences while minimizing both global and local harms. We collect the first set of human annotated red-teaming prompts in different languages distinguishing between global and local harm, which serve as a laboratory for understanding the reliability of alignment techniques when faced with preference distributions that are non-stationary across geographies and languages. While this setting is seldom covered by the literature to date, which primarily centers on English harm mitigation, it captures real-world interactions with AI systems around the world. We establish a new precedent for state-of-the-art alignment techniques across 6 languages with minimal degradation in general performance. Our work provides important insights into cross-lingual transfer and novel optimization approaches to safeguard AI systems designed to serve global populations.
摘要:“调整”概念的一个关键问题是“调整到什么?”人工智能系统在世界各地的使用越来越多,但安全对齐通常集中在同质的单一语言设置上。此外,偏好培训和安全措施往往与以西方为中心的数据集中常见的危害过度匹配。在这里,我们探索在平衡双重目标时不同对齐方法的可行性:解决和优化不同语言和文化偏好集的问题,同时将全球和局部危害降至最低。我们收集了第一组人类标注的红团队提示,用不同的语言区分全球和局部危害,这是一个实验室,用于了解在面对跨地理和语言的非平稳偏好分布时,对齐技术的可靠性。虽然到目前为止,这种设置很少被文献所涵盖,主要集中在英语的伤害缓解上,但它捕捉到了与世界各地人工智能系统的现实世界交互。我们开创了跨6种语言的最先进的对齐技术的新先例,总体性能降级最小。我们的工作为跨语言转移和新的优化方法提供了重要的见解,以保护旨在服务全球人口的人工智能系统。

[NLP-65] Few-shot Personalization of LLMs with Mis-aligned Responses
[NLP-65] 响应不一致的LLM的少数个性化

链接: https://arxiv.org/abs/2406.18678
作者: Jaehyung Kim,Yiming Yang
关键词: large language models, providing personalized responses, language models, increasingly important, capability of providing
中文关键词: 大型语言模型,提供个性化响应,语言模型,越来越重要,提供的能力
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint, 30 pages

点击查看摘要

Abstract:As the diversity of users increases, the capability of providing personalized responses by large language models (LLMs) has become increasingly important. Existing approaches have only limited successes in LLM personalization, due to the absence of personalized learning or the reliance on shared personal data. This paper proposes a new approach for a few-shot personalization of LLMs with their mis-aligned responses (Fermi). Our key idea is to learn a set of personalized prompts for each user by progressively improving the prompts using LLMs, based on user profile (e.g., demographic information) and a few examples of previous opinions. During an iterative process of prompt improvement, we incorporate the contexts of mis-aligned responses by LLMs, which are especially crucial for the effective personalization of LLMs. In addition, we develop an effective inference method to further leverage the context of the test query and the personalized prompts. Our experimental results demonstrate that Fermi significantly improves performance across various benchmarks, compared to the best-performing baselines.
摘要:随着用户多样性的增加,通过大型语言模型提供个性化响应的能力变得越来越重要。由于缺乏个性化学习或依赖共享的个人数据,现有的方法在LLM个性化方面只取得了有限的成功。本文提出了一种利用错位响应(FERMI)实现LLMS的少镜头个性化的新方法。我们的主要想法是基于用户简档(例如,人口统计信息)和一些先前意见的例子,通过使用LLMS逐步改进提示来为每个用户学习一组个性化提示。在迅速改进的迭代过程中,我们纳入了LLM错误对准答复的背景,这对于LLM的有效个性化尤为关键。此外,我们开发了一种有效的推理方法来进一步利用测试查询的上下文和个性化提示。我们的实验结果表明,与性能最好的基准相比,费米显著提高了各种基准的性能。

[NLP-66] Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation
[NLP-66] 了解LLM需要什么:检索-增强一代的双重偏好对齐

链接: https://arxiv.org/abs/2406.18676
作者: Guanting Dong,Yutao Zhu,Chenghao Zhang,Zechen Wang,Zhicheng Dou,Ji-Rong Wen
关键词: large language models, Retrieval-augmented generation, RAG systems, reliable RAG system, RAG
中文关键词: 大型语言模型、检索增强生成、RAG系统、可靠的RAG系统、RAG
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has demonstrated effectiveness in mitigating the hallucination problem of large language models (LLMs). However, the difficulty of aligning the retriever with the diverse LLMs’ knowledge preferences inevitably poses an inevitable challenge in developing a reliable RAG system. To address this issue, we propose DPA-RAG, a universal framework designed to align diverse knowledge preferences within RAG systems. Specifically, we initially introduce a preference knowledge construction pipline and incorporate five novel query augmentation strategies to alleviate preference data scarcity. Based on preference data, DPA-RAG accomplishes both external and internal preference alignment: 1) It jointly integrate pair-wise, point-wise, and contrastive preference alignment abilities into the reranker, achieving external preference alignment among RAG components. 2) It further introduces a pre-aligned stage before vanilla Supervised Fine-tuning (SFT), enabling LLMs to implicitly capture knowledge aligned with their reasoning preferences, achieving LLMs’ internal alignment. Experimental results across four knowledge-intensive QA datasets demonstrate that DPA-RAG outperforms all baselines and seamlessly integrates both black-box and open-sourced LLM readers. Further qualitative analysis and discussions also provide empirical guidance for achieving reliable RAG systems. Our code is publicly available at this https URL.
摘要:检索增强生成(RAG)在缓解大型语言模型(LLM)的幻觉问题方面表现出了很好的效果。然而,在开发可靠的RAG系统时,难以使取回器与不同LLM的知识偏好相匹配,这不可避免地带来了一个不可避免的挑战。为了解决这个问题,我们提出了DPA-RAG,这是一个通用的框架,旨在协调RAG系统中不同的知识偏好。具体地说,我们首先引入了偏好知识构建流水线,并结合了五种新的查询扩充策略来缓解偏好数据的稀缺性。基于偏好数据,DPA-RAG同时完成外部和内部偏好对齐:1)将配对、点对和对比偏好对齐能力整合到重排器中,实现RAG组件之间的外部偏好对齐。2)在香草监督精调(Vanilla Supervised Fine-Tuning,SFT)之前引入了预对齐阶段,使LLMS能够隐含地获取与其推理偏好一致的知识,实现LLMS的内部对齐。在四个知识密集型QA数据集上的实验结果表明,DPA-RAG的性能优于所有基线,并无缝集成了黑盒和开源LLM阅读器。进一步的定性分析和讨论也为实现可靠的RAG系统提供了经验指导。我们的代码在此HTTPS URL上公开提供。

[NLP-67] Human-AI Collaborative Taxonomy Construction: A Case Study in Profession-Specific Writing Assistants
[NLP-67] 人与人工智能协同分类构建:特定对象写作助理的案例研究

链接: https://arxiv.org/abs/2406.18675
作者: Minhwa Lee,Zae Myung Kim,Vivek A. Khetan,Dongyeop Kang
关键词: Large Language Models, Large Language, Language Models, including text revision, including text
中文关键词: 大型语言模型、大型语言、语言模型,包括文本修订,包括文本
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to CHI 2024 In2Writing Workshop

点击查看摘要

Abstract:Large Language Models (LLMs) have assisted humans in several writing tasks, including text revision and story generation. However, their effectiveness in supporting domain-specific writing, particularly in business contexts, is relatively less explored. Our formative study with industry professionals revealed the limitations in current LLMs’ understanding of the nuances in such domain-specific writing. To address this gap, we propose an approach of human-AI collaborative taxonomy development to perform as a guideline for domain-specific writing assistants. This method integrates iterative feedback from domain experts and multiple interactions between these experts and LLMs to refine the taxonomy. Through larger-scale experiments, we aim to validate this methodology and thus improve LLM-powered writing assistance, tailoring it to meet the unique requirements of different stakeholder needs.
摘要:大型语言模型(LLM)已协助人类完成多项写作任务,包括文本修改和故事生成。然而,它们在支持特定领域写作(尤其是在商业环境中)方面的有效性相对较少被探索。我们与行业专业人士进行的形成性研究揭示了当前法学硕士对此类特定领域写作中细微差别的理解的局限性。为了解决这一差距,我们提出了一种人类与人工智能协作分类开发的方法,作为特定领域写作助理的指导方针。该方法集成了领域专家的迭代反馈以及这些专家与LLM之间的多次交互,以完善分类法。通过更大规模的实验,我们的目标是验证这种方法,从而改进LLM支持的写作辅助,对其进行定制以满足不同利益相关者需求的独特要求。

[NLP-68] RouteLLM: Learning to Route LLMs with Preference Data
[NLP-68] RouteLLM:学习使用偏好数据来路由LLM

链接: https://arxiv.org/abs/2406.18665
作者: Isaac Ong,Amjad Almahairi,Vincent Wu,Wei-Lin Chiang,Tianhao Wu,Joseph E. Gonzalez,M Waleed Kadous,Ion Stoica
关键词: Large language models, Large language, exhibit impressive capabilities, exhibit impressive, range of tasks
中文关键词: 大型语言模型,大型语言,展现出令人印象深刻的能力,展现出令人印象深刻的任务范围
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.
摘要:大型语言模型(LLM)在广泛的任务中表现出令人印象深刻的能力,然而选择使用哪种模型往往涉及到性能和成本之间的权衡。功能更强大的机型虽然有效,但费用更高,而性能较差的机型更具成本效益。为了解决这一困境,我们提出了几种有效的路由器模型,它们在推理过程中动态地在较强和较弱的LLM之间进行选择,旨在优化代价和响应质量之间的平衡。我们为这些路由器开发了一个训练框架,利用人类偏好数据和数据增强技术来提高性能。我们对广泛认可的基准的评估表明,我们的方法显著降低了成本-在某些情况下降低了2倍以上-而不会影响响应质量。有趣的是,我们的路由器模型还显示出显著的迁移学习能力,即使在测试时更改了强模型和弱模型,也保持了它们的性能。这凸显了这些路由器为部署LLM提供经济高效且高性能的解决方案的潜力。

[NLP-69] Evaluating Copyright Takedown Methods for Language Models
[NLP-69] 评估语言模型的版权删除方法

链接: https://arxiv.org/abs/2406.18664
作者: Boyi Wei,Weijia Shi,Yangsibo Huang,Noah A. Smith,Chiyuan Zhang,Luke Zettlemoyer,Kai Li,Peter Henderson
关键词: potentially copyrighted material, including potentially copyrighted, Language models, derive their capabilities, copyrighted material
中文关键词: 潜在受版权保护的材料,包括潜在受版权保护的语言模型,衍生其能力,受版权保护的材料
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 31 pages, 9 figures, 14 tables

点击查看摘要

Abstract:Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. These models can memorize and generate content similar to their training data, posing potential concerns. Therefore, model creators are motivated to develop mitigation methods that prevent generating protected content. We term this procedure as copyright takedowns for LMs, noting the conceptual similarity to (but legal distinction from) the DMCA takedown This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs. We propose CoTaEval, an evaluation framework to assess the effectiveness of copyright takedown methods, the impact on the model’s ability to retain uncopyrightable factual knowledge from the training data whose recitation is embargoed, and how well the model maintains its general utility and efficiency. We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches. Our findings indicate that no tested method excels across all metrics, showing significant room for research in this unique problem setting and indicating potential unresolved challenges for live policy proposals.
摘要:语言模型(LMS)的能力来自于对各种数据的广泛培训,包括潜在的受版权保护的材料。这些模型可以记忆和生成与其训练数据类似的内容,这带来了潜在的担忧。因此,模型创建者有动力开发防止生成受保护内容的缓解方法。我们将这一程序称为LMS的版权撤销,注意到与DMCA撤销在概念上的相似之处(但在法律上不同),本文介绍了对LMS版权撤销的可行性和副作用的首次评估。我们提出了CoTaEval,一个评估框架,用于评估版权撤销方法的有效性,对模型从背诵被禁止的训练数据中保留不可版权事实知识的能力的影响,以及模型保持其一般实用性和效率的程度。我们研究了几种策略,包括添加系统提示、解码时间过滤干预和遗忘方法。我们的发现表明,没有一种经过测试的方法在所有指标上都表现出色,这表明在这个独特的问题背景下有很大的研究空间,并表明实时政策建议面临着潜在的未解决的挑战。

[NLP-70] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
[NLP-70] Step-DPO:LLM长链推理的分步偏好优化

链接: https://arxiv.org/abs/2406.18629
作者: Xin Lai,Zhuotao Tian,Yukang Chen,Senqiao Yang,Xiangru Peng,Jiaya Jia
关键词: Large Language Models, Large Language, challenge for Large, Mathematical reasoning presents, Language Models
中文关键词: 大型语言模型,大型语言,大型挑战,数学推理呈现,语言模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code, data, and models are available at this https URL

点击查看摘要

Abstract:Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter’s out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at this https URL.
摘要:数学推理对于大型语言模型(LLM)来说是一个巨大的挑战,因为需要广泛而精确的推理链来保证精确度。确保每个推理步骤的正确性是至关重要的。为了解决这个问题,我们的目标是通过从人类反馈中学习来增强LLMS的稳健性和真实性。然而,直接偏好优化(DPO)对于长链数学推理的好处有限,因为使用DPO的模型难以识别不正确答案中的详细错误。这一限制源于缺乏细粒度的流程监督。我们提出了一种简单、有效和数据高效的方法STEP-DPO,它将单个推理步骤作为偏好优化的单元,而不是对答案进行整体评估。此外,我们还开发了用于STEP-DPO的数据构建管道,从而能够创建包含10K步进偏好对的高质量数据集。我们还观察到,在DPO中,由于后者的非分布性,自生成的数据比人类或GPT-4生成的数据更有效。我们的发现表明,对于参数超过70B的模型,只需10K个偏好数据对和不到500个步长DPO训练步骤,就可以在数学上产生近3%的准确率提升。值得注意的是,将STEP-DPO应用于Qwen2-72B-Indict,在数学和GSM8K的测试集上分别获得了70.8%和94.0%的分数,超过了一系列闭源模型,包括GPT-4-1106、Claude-3-Opus和Gemini-1.5-Pro。我们的代码、数据和模型都可以在这个HTTPS URL上找到。

[NLP-71] owards Large Language Model Aided Program Refinement
[NLP-71] owards大型语言模型辅助程序细化

链接: https://arxiv.org/abs/2406.18616
作者: Yufan Cai,Zhe Hou,Xiaokun Luan,David Miguel Sanan Baena,Yun Lin,Jun Sun,Jin Song Dong
关键词: involves correctness-preserving transformations, refinement involves correctness-preserving, Program refinement involves, high-level specification statements, Program refinement
中文关键词: 涉及正确性保留转换,细化涉及正确性保留,程序细化涉及,高级规范陈述,程序细化
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Program refinement involves correctness-preserving transformations from formal high-level specification statements into executable programs. Traditional verification tool support for program refinement is highly interactive and lacks automation. On the other hand, the emergence of large language models (LLMs) enables automatic code generations from informal natural language specifications. However, code generated by LLMs is often unreliable. Moreover, the opaque procedure from specification to code provided by LLM is an uncontrolled black box. We propose LLM4PR, a tool that combines formal program refinement techniques with informal LLM-based methods to (1) transform the specification to preconditions and postconditions, (2) automatically build prompts based on refinement calculus, (3) interact with LLM to generate code, and finally, (4) verify that the generated code satisfies the conditions of refinement calculus, thus guaranteeing the correctness of the code. We have implemented our tool using GPT4, Coq, and Coqhammer, and evaluated it on the HumanEval and EvalPlus datasets.
摘要:程序求精涉及从形式高级规范语句到可执行程序的保持正确性的转换。传统的验证工具对程序求精的支持是高度交互的,缺乏自动化。另一方面,大型语言模型(LLM)的出现使得能够从非正式的自然语言规范自动生成代码。然而,LLMS生成的代码通常是不可靠的。此外,LLM提供的从规范到代码的不透明过程是一个不受控制的黑匣子。我们提出了LLM4PR工具,它将形式化的程序求精技术与基于LLM的非正式方法相结合,(1)将规格说明转换为前置条件和后置条件,(2)基于精化演算自动生成提示,(3)与LLM交互生成代码,最后(4)验证生成的代码满足精化演算的条件,从而保证代码的正确性。我们已经使用GPT4、Coq和CoqHammer实现了我们的工具,并在HumanEval和EvalPlus数据集上对其进行了评估。

[NLP-72] Self-Supervised Time-Series Anomaly Detection Using Learnable Data Augmentation
[NLP-72] 使用可学习数据增强的自监督时间序列异常检测

链接: https://arxiv.org/abs/2406.12260
作者: Kukjin Choi,Jihun Yi,Jisoo Mok,Sungroh Yoon
关键词: Continuous efforts, advance anomaly detection, industrial sites, anomaly detection, made to advance
中文关键词: 持续努力,推进异常检测,工业现场,异常检测,努力推进
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 4 figures, IEEE Transactions on Emerging Topics in Computational Intelligence

点击查看摘要

Abstract:Continuous efforts are being made to advance anomaly detection in various manufacturing processes to increase the productivity and safety of industrial sites. Deep learning replaced rule-based methods and recently emerged as a promising method for anomaly detection in diverse industries. However, in the real world, the scarcity of abnormal data and difficulties in obtaining labeled data create limitations in the training of detection models. In this study, we addressed these shortcomings by proposing a learnable data augmentation-based time-series anomaly detection (LATAD) technique that is trained in a self-supervised manner. LATAD extracts discriminative features from time-series data through contrastive learning. At the same time, learnable data augmentation produces challenging negative samples to enhance learning efficiency. We measured anomaly scores of the proposed technique based on latent feature similarities. As per the results, LATAD exhibited comparable or improved performance to the state-of-the-art anomaly detection assessments on several benchmark datasets and provided a gradient-based diagnosis technique to help identify root causes.
摘要:为了提高工业现场的生产率和安全性,正在不断努力推进各种制造过程中的异常检测。深度学习取代了基于规则的方法,最近在不同的行业中成为一种很有前途的异常检测方法。然而,在现实世界中,异常数据的稀缺和获得标签数据的困难给检测模型的训练带来了限制。在这项研究中,我们通过提出一种可学习的基于数据增强的时间序列异常检测(LATAD)技术来解决这些缺点,该技术以自我监督的方式进行训练。LATAD通过对比学习从时间序列数据中提取区分性特征。同时,可学习数据扩充产生具有挑战性的负样本,以提高学习效率。我们根据潜在的特征相似性度量了该技术的异常分数。根据结果,LATAD在几个基准数据集上表现出与最先进的异常检测评估相当或更好的性能,并提供了基于梯度的诊断技术来帮助确定根本原因。

[NLP-73] Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over
[NLP-73] 应用LLM重新筛选临时对话的N-最佳ASB假设:领域适应和上下文结转的影响

链接: https://arxiv.org/abs/2406.18972
作者: Atsunori Ogawa,Naoyuki Kamo,Kohei Matsuura,Takanori Ashihara,Takafumi Moriya,Takatomo Kano,Naohiro Tawara,Marc Delcroix
关键词: Large language models, automatic speech recognition, Large language, N-best ASR hypotheses, rescoring automatic speech
中文关键词: 大型语言模型、自动语音识别、大型语言、N-最佳ASB假设、重新评分自动语音
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 5 pages

点击查看摘要

Abstract:Large language models (LLMs) have been successfully applied for rescoring automatic speech recognition (ASR) hypotheses. However, their ability to rescore ASR hypotheses of casual conversations has not been sufficiently explored. In this study, we reveal it by performing N-best ASR hypotheses rescoring using Llama2 on the CHiME-7 distant ASR (DASR) task. Llama2 is one of the most representative LLMs, and the CHiME-7 DASR task provides datasets of casual conversations between multiple participants. We investigate the effects of domain adaptation of the LLM and context carry-over when performing N-best rescoring. Experimental results show that, even without domain adaptation, Llama2 outperforms a standard-size domain-adapted Transformer-LM, especially when using a long context. Domain adaptation shortens the context length needed with Llama2 to achieve its best performance, i.e., it reduces the computational cost of Llama2.
摘要:大型语言模型(LLM)已成功应用于重新筛选自动语音识别(ASB)假设。然而,他们重新审视随意对话的ASC假设的能力尚未得到充分的探索。在这项研究中,我们通过使用Llama 2对CHiME-7远距离ZR(DSVR)任务进行N-最佳ZR假设重新评分来揭示这一点。Llama 2是最具代表性的LLM之一,CHiME-7 DASB任务提供了多个参与者之间随意对话的数据集。我们研究了执行N最佳重新评分时LLM的领域适应和上下文结转的影响。实验结果表明,即使没有域自适应,Llama 2的性能也优于标准大小的域自适应Transformer-LM,尤其是在使用长上下文时。域自适应缩短了Llama 2实现最佳性能所需的上下文长度,即它降低了Llama 2的计算成本。

[NLP-74] DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
[NLP-74] DeSTA:通过描述性语音-文本对齐增强语音语言模型

链接: https://arxiv.org/abs/2406.18871
作者: Ke-Han Lu,Zhehuai Chen,Szu-Wei Fu,He Huang,Boris Ginsburg,Yu-Chiang Frank Wang,Hung-yi Lee
关键词: typically incorporate pre-trained, Recent speech language, incorporate pre-trained speech, typically incorporate, large language models
中文关键词: 通常合并预训练的最近语音语言,合并预训练的语音,通常合并大型语言模型
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby facilitating the capability to understand both linguistic and non-linguistic features in speech. Enhanced with the proposed approach, our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. Moreover, we discover that the aligned model exhibits a zero-shot instruction-following capability without explicit speech instruction tuning. These findings highlight the potential to reshape instruction-following SLMs by incorporating rich, descriptive speech captions.
摘要:最近的语音语言模型(SLC)通常会合并预先训练的语音模型,以扩展大型语言模型(LLM)的功能。在本文中,我们提出了一种描述性语音-文本对齐方法,该方法利用语音字幕来弥合语音和文本模式之间的差距,使STM能够解释和生成全面的自然语言描述,从而促进理解语音中的语言和非语言特征的能力。通过提出的方法进行增强,我们的模型在Dynamic-SUPER基准测试上表现出卓越的性能,特别是在推广到未见任务方面。此外,我们发现对齐模型在没有明确的语音指令调整的情况下表现出零镜头描述跟随能力。这些研究结果凸显了通过整合丰富的描述性语音标题来重塑描述遵循的CRM的潜力。

[NLP-75] WavRx: a Disease-Agnostic Generalizable and Privacy-Preserving Speech Health Diagnostic Model
[NLP-75] WavRx:一种与疾病无关的可概括且隐私保护的语音健康诊断模型

链接: https://arxiv.org/abs/2406.18731
作者: Yi Zhu,Tiago Falk
关键词: carry health-related attributes, long-term health monitoring, health-related attributes, carry health-related, venue for remote
中文关键词: 携带健康相关属性,长期健康监测,健康相关属性,携带健康相关,远程场地
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review; Model script available at this https URL

点击查看摘要

Abstract:Speech is known to carry health-related attributes, which has emerged as a novel venue for remote and long-term health monitoring. However, existing models are usually tailored for a specific type of disease, and have been shown to lack generalizability across datasets. Furthermore, concerns have been raised recently towards the leakage of speaker identity from health embeddings. To mitigate these limitations, we propose WavRx, a speech health diagnostics model that captures the respiration and articulation related dynamics from a universal speech representation. Our in-domain and cross-domain experiments on six pathological speech datasets demonstrate WavRx as a new state-of-the-art health diagnostic model. Furthermore, we show that the amount of speaker identity entailed in the WavRx health embeddings is significantly reduced without extra guidance during training. An in-depth analysis of the model was performed, thus providing physiological interpretation of its improved generalizability and privacy-preserving ability.
摘要:语音被认为具有与健康相关的属性,它已经成为远程和长期健康监测的新场所。然而,现有的模型通常是为特定类型的疾病量身定做的,并已被证明缺乏跨数据集的通用性。此外,最近有人对健康嵌入造成的说话人身份泄露表示关切。为了缓解这些限制,我们提出了WavRx,一个语音健康诊断模型,从通用的语音表示中捕获与呼吸和发音相关的动力学。我们在六个病态语音数据集上的域内和跨域实验证明了WavRx是一种新的最先进的健康诊断模型。此外,我们还表明,在训练期间没有额外指导的情况下,WavRx健康嵌入所需的说话人身份数量显著减少。对该模型进行了深入的分析,从而为其改进的泛化和隐私保护能力提供了生理学解释。

[NLP-76] Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization
[NLP-76] 扬声器未嵌入:长形式神经扩张的免嵌入方法

链接: https://arxiv.org/abs/2406.18679
作者: Xiang Li,Vivek Govindan,Rohit Paturi,Sundararajan Srinivasan
关键词: embedding-based Speaker Diarization, neural diarization, traditional embedding-based Speaker, models offer significant, Speaker Diarization
中文关键词: 基于嵌入的Speaker Dializer,神经Dializer,传统的基于嵌入的Speaker,模型提供了重要的Speaker Dializer
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at INTERSPEECH 2024

点击查看摘要

Abstract:End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker embedding framework alongside the EEND module. In this paper, we propose a novel framework applying EEND both locally and globally for long-form audio without separate speaker embeddings. This approach achieves significant relative DER reduction of 13% and 10% over the conventional 1-pass EEND on Callhome American English and RT03-CTS datasets respectively and marginal improvements over EEND-vector-clustering without the need for additional speaker embeddings. Furthermore, we discuss the computational complexity of our proposed framework and explore strategies for reducing processing times.
摘要:端到端神经二元化(EEND)模型比传统的基于嵌入的说话人二元化(SD)方法有了显著的改进,但不能推广到具有大量说话人的长格式音频。EEND向量聚类方法通过将局部EEND与来自局部窗口的说话人嵌入的全局聚类相结合来缓解这一问题,但这需要在EEND模块之外附加说话人嵌入框架。在本文中,我们提出了一种新的框架,在不需要单独嵌入说话人的情况下,将EEND应用于局部和全局的长格式音频。该方法在Callhome、American English和RT03-CTS数据集上的DER分别比传统的1遍EEND降低了13%和10%,在不需要额外嵌入说话人的情况下,与EEND向量聚类相比有了轻微的改善。此外,我们还讨论了我们提出的框架的计算复杂性,并探索了减少处理时间的策略。

[NLP-77] An LLM-based Knowledge Synthesis and Scientific Reasoning Framework for Biomedical Discovery
[NLP-77] 基于LLM的生物医学发现知识合成和科学推理框架

链接: https://arxiv.org/abs/2406.18626
作者: Oskar Wysocki,Magdalena Wysocka,Danilo Carvalho,Alex Teodor Bogatu,Danilo Miranda Gusicuma,Maxime Delmas,Harriet Unsworth,Andre Freitas
关键词: supporting biological analyses, Large Language Models, Lunar framework, molecular-level evidence enrichment, integrates Large Language
中文关键词: 支持生物分析、大型语言模型、Lunar框架、分子级证据丰富、集成大型语言
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: accepted for ACL 2024 System Demonstration Track

点击查看摘要

Abstract:We present BioLunar, developed using the Lunar framework, as a tool for supporting biological analyses, with a particular emphasis on molecular-level evidence enrichment for biomarker discovery in oncology. The platform integrates Large Language Models (LLMs) to facilitate complex scientific reasoning across distributed evidence spaces, enhancing the capability for harmonizing and reasoning over heterogeneous data sources. Demonstrating its utility in cancer research, BioLunar leverages modular design, reusable data access and data analysis components, and a low-code user interface, enabling researchers of all programming levels to construct LLM-enabled scientific workflows. By facilitating automatic scientific discovery and inference from heterogeneous evidence, BioLunar exemplifies the potential of the integration between LLMs, specialised databases and biomedical tools to support expert-level knowledge synthesis and discovery.
摘要:我们介绍了使用Lunar框架开发的BioLunar,作为支持生物分析的工具,特别强调肿瘤学中生物标志物发现的分子水平证据丰富。该平台集成了大型语言模型(LLM),以促进跨分布式证据空间的复杂科学推理,增强对异类数据源的协调和推理能力。BioLunar展示了其在癌症研究中的实用性,利用模块化设计、可重复使用的数据访问和数据分析组件以及低代码用户界面,使所有编程级别的研究人员能够构建支持LLM的科学工作流程。通过促进自动科学发现和根据不同证据进行推理,BioLunar充分体现了LLM、专业数据库和生物医学工具之间集成的潜力,以支持专家级知识合成和发现。

计算机视觉

[CV-0] Dataset Size Recovery from LoRA Weights

链接: https://arxiv.org/abs/2406.19395
作者: Mohammad Salama,Jonathan Kahana,Eliahu Horwitz,Yedid Hoshen
关键词: membership inference attacks, inference attacks aim, inversion and membership, membership inference, reconstruct and verify
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Model inversion and membership inference attacks aim to reconstruct and verify the data which a model was trained on. However, they are not guaranteed to find all training samples as they do not know the size of the training set. In this paper, we introduce a new task: dataset size recovery, that aims to determine the number of samples used to train a model, directly from its weights. We then propose DSiRe, a method for recovering the number of images used to fine-tune a model, in the common case where fine-tuning uses LoRA. We discover that both the norm and the spectrum of the LoRA matrices are closely linked to the fine-tuning dataset size; we leverage this finding to propose a simple yet effective prediction algorithm. To evaluate dataset size recovery of LoRA weights, we develop and release a new benchmark, LoRA-WiSE, consisting of over 25000 weight snapshots from more than 2000 diverse LoRA fine-tuned models. Our best classifier can predict the number of fine-tuning images with a mean absolute error of 0.36 images, establishing the feasibility of this attack.

[CV-1] HUWSOD: Holistic Self-training for Unified Weakly Supervised Object Detection

链接: https://arxiv.org/abs/2406.19394
作者: Liujuan Cao,Jianghang Lin,Zebo Hong,Yunhang Shen,Shaohui Lin,Chao Chen,Rongrong Ji
关键词: poor local optimum, generate candidate regions, WSOD methods rely, local optimum, traditional object proposals
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most WSOD methods rely on traditional object proposals to generate candidate regions and are confronted with unstable training, which easily gets stuck in a poor local optimum. In this paper, we introduce a unified, high-capacity weakly supervised object detection (WSOD) network called HUWSOD, which utilizes a comprehensive self-training framework without needing external modules or additional supervision. HUWSOD innovatively incorporates a self-supervised proposal generator and an autoencoder proposal generator with a multi-rate resampling pyramid to replace traditional object proposals, enabling end-to-end WSOD training and inference. Additionally, we implement a holistic self-training scheme that refines detection scores and coordinates through step-wise entropy minimization and consistency-constraint regularization, ensuring consistent predictions across stochastic augmentations of the same image. Extensive experiments on PASCAL VOC and MS COCO demonstrate that HUWSOD competes with state-of-the-art WSOD methods, eliminating the need for offline proposals and additional data. The peak performance of HUWSOD approaches that of fully-supervised Faster R-CNN. Our findings also indicate that randomly initialized boxes, although significantly different from well-designed offline object proposals, are effective for WSOD training.

[CV-2] Looking 3D: Anomaly Detection with 2D-3D Alignment

链接: https://arxiv.org/abs/2406.19393
作者: Ankan Bhunia,Changjian Li,Hakan Bilen
关键词: product quality assessment, Automatic anomaly detection, visual cues holds, cues holds practical, holds practical significance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at CVPR’24. Codes dataset available at this https URL

点击查看摘要

Abstract:Automatic anomaly detection based on visual cues holds practical significance in various domains, such as manufacturing and product quality assessment. This paper introduces a new conditional anomaly detection problem, which involves identifying anomalies in a query image by comparing it to a reference shape. To address this challenge, we have created a large dataset, BrokenChairs-180K, consisting of around 180K images, with diverse anomalies, geometries, and textures paired with 8,143 reference 3D shapes. To tackle this task, we have proposed a novel transformer-based approach that explicitly learns the correspondence between the query image and reference 3D shape via feature alignment and leverages a customized attention mechanism for anomaly detection. Our approach has been rigorously evaluated through comprehensive experiments, serving as a benchmark for future research in this domain.

[CV-3] ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

链接: https://arxiv.org/abs/2406.19392
作者: Jr-Jen Chen,Yu-Chien Liao,Hsi-Che Lin,Yu-Chu Yu,Yen-Chun Chen,Yu-Chiang Frank Wang
关键词: designed to rigorously, models’ ability, ability to perform, video segments, video events
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce ReXTime, a benchmark designed to rigorously test AI models’ ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.

[CV-4] Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

链接: https://arxiv.org/abs/2406.19391
作者: Ali Khaleghi Rahimian,Manish Kumar Govind,Subhajit Maity,Dominick Reilly,Christian Kümmerle,Srijan Das,Aritra Dutta
关键词: solved by Vision, Vision Transformer, computational bottleneck due, predominantly solved, bottleneck due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The code is publicly available at this https URL

点击查看摘要

Abstract:Visual perception tasks are predominantly solved by Vision Transformer (ViT) architectures, which, despite their effectiveness, encounter a computational bottleneck due to the quadratic complexity of computing self-attention. This inefficiency is largely due to the self-attention heads capturing redundant token interactions, reflecting inherent redundancy within visual data. Many works have aimed to reduce the computational complexity of self-attention in ViTs, leading to the development of efficient and sparse transformer architectures. In this paper, viewing through the efficiency lens, we realized that introducing any sparse self-attention strategy in ViTs can keep the computational overhead low. However, these strategies are sub-optimal as they often fail to capture fine-grained visual details. This observation leads us to propose a general, efficient, sparse architecture, named Fibottention, for approximating self-attention with superlinear complexity that is built upon Fibonacci sequences. The key strategies in Fibottention include: it excludes proximate tokens to reduce redundancy, employs structured sparsity by design to decrease computational demands, and incorporates inception-like diversity across attention heads. This diversity ensures the capture of complementary information through non-overlapping token interactions, optimizing both performance and resource utilization in ViTs for visual representation learning. We embed our Fibottention mechanism into multiple state-of-the-art transformer architectures dedicated to visual tasks. Leveraging only 2-6% of the elements in the self-attention heads, Fibottention in conjunction with ViT and its variants, consistently achieves significant performance boosts compared to standard ViTs in nine datasets across three domains \unicodex2013 image classification, video understanding, and robot learning tasks.

[CV-5] SALVe: Semantic Alignment Verification for Floorplan Reconstruction from Sparse Panoramas

链接: https://arxiv.org/abs/2406.19390
作者: John Lambert,Yuguang Li,Ivaylo Boyadzhiev,Lambert Wixson,Manjunath Narayana,Will Hutchcroft,James Hays,Frank Dellaert,Sing Bing Kang
关键词: learned alignment verifier, pairwise learned alignment, alignment verifier, learned alignment, pairwise learned
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2022

点击查看摘要

Abstract:We propose a new system for automatic 2D floorplan reconstruction that is enabled by SALVe, our novel pairwise learned alignment verifier. The inputs to our system are sparsely located 360 ^\circ panoramas, whose semantic features (windows, doors, and openings) are inferred and used to hypothesize pairwise room adjacency or overlap. SALVe initializes a pose graph, which is subsequently optimized using GTSAM. Once the room poses are computed, room layouts are inferred using HorizonNet, and the floorplan is constructed by stitching the most confident layout boundaries. We validate our system qualitatively and quantitatively as well as through ablation studies, showing that it outperforms state-of-the-art SfM systems in completeness by over 200%, without sacrificing accuracy. Our results point to the significance of our work: poses of 81% of panoramas are localized in the first 2 connected components (CCs), and 89% in the first 3 CCs. Code and models are publicly available at this https URL.

[CV-6] OMG-LLaVA: Bridging Image-level Object-level Pixel-level Reasoning and Understanding

链接: https://arxiv.org/abs/2406.19389
作者: Tao Zhang,Xiangtai Li,Hao Fei,Haobo Yuan,Shengqiong Wu,Shunping Ji,Chen Change Loy,Shuicheng Yan
关键词: Current universal segmentation, demonstrate strong capabilities, methods demonstrate strong, Current universal, demonstrate strong
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user’s text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

[CV-7] aming Data and Transformers for Audio Generation

链接: https://arxiv.org/abs/2406.19388
作者: Moayed Haji-Ali,Willi Menapace,Aliaksandr Siarohin,Guha Balakrishnan,Sergey Tulyakov,Vicente Ordonez
关键词: Generating ambient sounds, Generating ambient, challenging problem due, employ large-scale generative, making it difficult
类目: ound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Project Webpage: this https URL

点击查看摘要

Abstract:Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available with the audio modality, we can substantially improve the quality of captions. AutoCap reaches CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761,000 audio clips with high-quality captions, forming the largest available audio-text dataset. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters and train with our new dataset. When compared to state-of-the-art audio generators, GenAu obtains significant improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. This shows that the quality of data is often as important as its quantity. Besides, since AutoCap is fully automatic, new audio samples can be added to the training dataset, unlocking the training of even larger generative models for audio synthesis.

[CV-8] Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model

链接: https://arxiv.org/abs/2406.19369
作者: Haobo Yuan,Xiangtai Li,Lu Qi,Tao Zhang,Ming-Hsuan Yang,Shuicheng Yan,Chen Change Loy
关键词: Transformer-based segmentation methods, Transformer-based segmentation, high-resolution images, face the challenge, inference when dealing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages; 8 figures

点击查看摘要

Abstract:Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much attention as they can process long sequences efficiently. In this work, we focus on designing an efficient segment-anything model by exploring these different architectures. Specifically, we design a mixed backbone that contains convolution and RWKV operation, which achieves the best for both accuracy and efficiency. In addition, we design an efficient decoder to utilize the multiscale tokens to obtain high-quality masks. We denote our method as RWKV-SAM, a simple, effective, fast baseline for SAM-like models. Moreover, we build a benchmark containing various high-quality segmentation datasets and jointly train one efficient yet high-quality segmentation model using this benchmark. Based on the benchmark results, our RWKV-SAM achieves outstanding performance in efficiency and segmentation quality compared to transformers and other linear attention models. For example, compared with the same-scale transformer model, RWKV-SAM achieves more than 2x speedup and can achieve better segmentation performance on various datasets. In addition, RWKV-SAM outperforms recent vision Mamba models with better classification and semantic segmentation results. Code and models will be publicly available.

[CV-9] SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues

链接: https://arxiv.org/abs/2406.19364
作者: Yuxin Xie,Tao Zhou,Yi Zhou,Geng Chen
关键词: Weakly-supervised medical image, Weakly-supervised medical, aims to reduce, reduce the annotation, annotation cost
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key components: an effective Textual-to-Visual Cue Converter that produces visual prompts from text prompts on medical images, and a text-guided segmentation model with Text-Vision Hybrid Attention that fuses text and image features. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation, and achieve consistent state-of-the-art performance.

[CV-10] STAL3D: Unsupervised Domain Adaptation for 3D Object Detection via Collaborating Self-Training and Adversarial Learning

链接: https://arxiv.org/abs/2406.19362
作者: Yanan Zhang,Chao Zhou,Di Huang
关键词: expensive annotation costs, unknown data due, generalize detection models, detection models trained, Unsupervised Domain Adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE-TIV

点击查看摘要

Abstract:Existing 3D object detection suffers from expensive annotation costs and poor transferability to unknown data due to the domain gap, Unsupervised Domain Adaptation (UDA) aims to generalize detection models trained in labeled source domains to perform robustly on unexplored target domains, providing a promising solution for cross-domain 3D object detection. Although Self-Training (ST) based cross-domain 3D detection methods with the assistance of pseudo-labeling techniques have achieved remarkable progress, they still face the issue of low-quality pseudo-labels when there are significant domain disparities due to the absence of a process for feature distribution alignment. While Adversarial Learning (AL) based methods can effectively align the feature distributions of the source and target domains, the inability to obtain labels in the target domain forces the adoption of asymmetric optimization losses, resulting in a challenging issue of source domain bias. To overcome these limitations, we propose a novel unsupervised domain adaptation framework for 3D object detection via collaborating ST and AL, dubbed as STAL3D, unleashing the complementary advantages of pseudo labels and feature distribution alignment. Additionally, a Background Suppression Adversarial Learning (BS-AL) module and a Scale Filtering Module (SFM) are designed tailored for 3D cross-domain scenes, effectively alleviating the issues of the large proportion of background interference and source domain size bias. Our STAL3D achieves state-of-the-art performance on multiple cross-domain tasks and even surpasses the Oracle results on Waymo \rightarrow KITTI and Waymo \rightarrow KITTI-rain.

[CV-11] CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement

链接: https://arxiv.org/abs/2406.19353
作者: Chengwen Zhang,Yun Liu,Ruofan Xing,Bingda Tang,Li Yi
关键词: humans cooperatively rearrange, cooperatively rearrange household, Understanding how humans, rearrange household objects, humans cooperatively
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding how humans cooperatively rearrange household objects is critical for VR/AR and human-robot interaction. However, in-depth studies on modeling these behaviors are under-researched due to the lack of relevant datasets. We fill this gap by presenting CORE4D, a novel large-scale 4D human-object-human interaction dataset focusing on collaborative object rearrangement, which encompasses diverse compositions of various object geometries, collaboration modes, and 3D scenes. With 1K human-object-human motion sequences captured in the real world, we enrich CORE4D by contributing an iterative collaboration retargeting strategy to augment motions to a variety of novel objects. Leveraging this approach, CORE4D comprises a total of 11K collaboration sequences spanning 3K real and virtual object shapes. Benefiting from extensive motion patterns provided by CORE4D, we benchmark two tasks aiming at generating human-object interaction: human-object motion forecasting and interaction synthesis. Extensive experiments demonstrate the effectiveness of our collaboration retargeting strategy and indicate that CORE4D has posed new challenges to existing human-object interaction generation methodologies. Our dataset and code are available at this https URL.

[CV-12] Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation

链接: https://arxiv.org/abs/2406.19341
作者: Yushun Tang,Shuoshuo Chen,Zhehan Kan,Yi Zhang,Qinghai Guo,Zhihai He
关键词: Fully test-time adaptation, deep neural networks, test-time adaptation aims, test-time adaptation performance, network model based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by TMM

点击查看摘要

Abstract:Fully test-time adaptation aims to adapt the network model based on sequential analysis of input samples during the inference stage to address the cross-domain performance degradation problem of deep neural networks. This work is based on the following interesting finding: in transformer-based image classification, the class token at the first transformer encoder layer can be learned to capture the domain-specific characteristics of target samples during test-time adaptation. This learned token, when combined with input image patch embeddings, is able to gradually remove the domain-specific information from the feature representations of input samples during the transformer encoding process, thereby significantly improving the test-time adaptation performance of the source model across different domains. We refer to this class token as visual conditioning token (VCT). To successfully learn the VCT, we propose a bi-level learning approach to capture the long-term variations of domain-specific characteristics while accommodating local variations of instance-specific characteristics. Experimental results on the benchmark datasets demonstrate that our proposed bi-level visual conditioning token learning method is able to achieve significantly improved test-time adaptation performance by up to 1.9%.

[CV-13] Efficient World Models with Context-Aware Tokenization

链接: https://arxiv.org/abs/2406.19320
作者: Vincent Micheli,Eloi Alonso,François Fleuret
关键词: deep Reinforcement Learning, Reinforcement Learning, Scaling up deep, deep Reinforcement, methods presents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024

点击查看摘要

Abstract:Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose \Delta -IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, \Delta -IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at this https URL.

[CV-14] Enhanced Data Transfer Cooperating with Artificial Triplets for Scene Graph Generation

链接: https://arxiv.org/abs/2406.19316
作者: KuanChao Chu,Satoshi Yamazaki,Hideki Nakayama
关键词: Scene Graph Generation, Graph Generation, Scene Graph, informative relational triplets, Soft Transfer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to IEICE Transactions on Information and Systems in April 2024

点击查看摘要

Abstract:This work focuses on training dataset enhancement of informative relational triplets for Scene Graph Generation (SGG). Due to the lack of effective supervision, the current SGG model predictions perform poorly for informative relational triplets with inadequate training samples. Therefore, we propose two novel training dataset enhancement modules: Feature Space Triplet Augmentation (FSTA) and Soft Transfer. FSTA leverages a feature generator trained to generate representations of an object in relational triplets. The biased prediction based sampling in FSTA efficiently augments artificial triplets focusing on the challenging ones. In addition, we introduce Soft Transfer, which assigns soft predicate labels to general relational triplets to make more supervisions for informative predicate classes effectively. Experimental results show that integrating FSTA and Soft Transfer achieve high levels of both Recall and mean Recall in Visual Genome dataset. The mean of Recall and mean Recall is the highest among all the existing model-agnostic methods.

[CV-15] Mapping Land Naturalness from Sentinel-2 using Deep Contextual and Geographical Priors

链接: https://arxiv.org/abs/2406.19302
作者: Burak Ekim,Michael Schmitt
关键词: recent decades, affecting our planet, unprecedented scale, climate change, combating climate change
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, ICLR 2024 Tackling Climate Change with Machine Learning Workshop

点击查看摘要

Abstract:In recent decades, the causes and consequences of climate change have accelerated, affecting our planet on an unprecedented scale. This change is closely tied to the ways in which humans alter their surroundings. As our actions continue to impact natural areas, using satellite images to observe and measure these effects has become crucial for understanding and combating climate change. Aiming to map land naturalness on the continuum of modern human pressure, we have developed a multi-modal supervised deep learning framework that addresses the unique challenges of satellite data and the task at hand. We incorporate contextual and geographical priors, represented by corresponding coordinate information and broader contextual information, including and surrounding the immediate patch to be predicted. Our framework improves the model’s predictive performance in mapping land naturalness from Sentinel-2 data, a type of multi-spectral optical satellite imagery. Recognizing that our protective measures are only as effective as our understanding of the ecosystem, quantifying naturalness serves as a crucial step toward enhancing our environmental stewardship.

[CV-16] PNeRV: A Polynomial Neural Representation for Videos

链接: https://arxiv.org/abs/2406.19299
作者: Sonam Gupta,Snehal Singh Tomar,Grigorios G Chrysos,Sukhendu Das,A. N. Rajagopalan
关键词: Extracting Implicit Neural, additional temporal dimension, Polynomial Neural Representation, Implicit Neural Representations, Extracting Implicit
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 25 pages, 17 figures, published at TMLR, Feb 2024

点击查看摘要

Abstract:Extracting Implicit Neural Representations (INRs) on video data poses unique challenges due to the additional temporal dimension. In the context of videos, INRs have predominantly relied on a frame-only parameterization, which sacrifices the spatiotemporal continuity observed in pixel-level (spatial) representations. To mitigate this, we introduce Polynomial Neural Representation for Videos (PNeRV), a parameter-wise efficient, patch-wise INR for videos that preserves spatiotemporal continuity. PNeRV leverages the modeling capabilities of Polynomial Neural Networks to perform the modulation of a continuous spatial (patch) signal with a continuous time (frame) signal. We further propose a custom Hierarchical Patch-wise Spatial Sampling Scheme that ensures spatial continuity while retaining parameter efficiency. We also employ a carefully designed Positional Embedding methodology to further enhance PNeRV’s performance. Our extensive experimentation demonstrates that PNeRV outperforms the baselines in conventional Implicit Neural Representation tasks like compression along with downstream applications that require spatiotemporal continuity in the underlying representation. PNeRV not only addresses the challenges posed by video data in the realm of INRs but also opens new avenues for advanced video processing and analysis.

[CV-17] Compositional Image Decomposition with Diffusion Models

链接: https://arxiv.org/abs/2406.19298
作者: Jocelin Su,Nan Liu,Yanbo Wang,Joshua B. Tenenbaum,Yilun Du
关键词: scene, set, quickly decompose, components, natural scene
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ICML 2024, Webpage: this https URL

点击查看摘要

Abstract:Given an image of a natural scene, we are able to quickly decompose it into a set of components such as objects, lighting, shadows, and foreground. We can then envision a scene where we combine certain components with those from other images, for instance a set of objects from our bedroom and animals from a zoo under the lighting conditions of a forest, even if we have never encountered such a scene before. In this paper, we present a method to decompose an image into such compositional components. Our approach, Decomp Diffusion, is an unsupervised method which, when given a single image, infers a set of different components in the image, each represented by a diffusion model. We demonstrate how components can capture different factors of the scene, ranging from global scene descriptors like shadows or facial expression to local scene descriptors like constituent objects. We further illustrate how inferred factors can be flexibly composed, even with factors inferred from other models, to generate a variety of scenes sharply different than those seen in training time. Website and code at this https URL.

[CV-18] Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

链接: https://arxiv.org/abs/2406.19297
作者: Malvina Nikandrou,Georgios Pantazopoulos,Ioannis Konstas,Alessandro Suglia
关键词: minimizing performance drop, Visual Question Answering, Continual learning focuses, Continual learning, multimodal continual learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Continual learning focuses on incrementally training a model on a sequence of tasks with the aim of learning new tasks while minimizing performance drop on previous tasks. Existing approaches at the intersection of Continual Learning and Visual Question Answering (VQA) do not study how the multimodal nature of the input affects the learning dynamics of a model. In this paper, we demonstrate that each modality evolves at different rates across a continuum of tasks and that this behavior occurs in established encoder-only models as well as modern recipes for developing Vision Language (VL) models. Motivated by this observation, we propose a modality-aware feature distillation (MAFED) approach which outperforms existing baselines across models of varying scale in three multimodal continual learning settings. Furthermore, we provide ablations showcasing that modality-aware distillation complements experience replay. Overall, our results emphasize the importance of addressing modality-specific dynamics to prevent forgetting in multimodal continual learning.

[CV-19] Human Modelling and Pose Estimation Overview

链接: https://arxiv.org/abs/2406.19290
作者: Pawel Knap
关键词: Computer Vision, Computer Graphics, Machine Learning, crossroads of Computer, Human modelling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human modelling and pose estimation stands at the crossroads of Computer Vision, Computer Graphics, and Machine Learning. This paper presents a thorough investigation of this interdisciplinary field, examining various algorithms, methodologies, and practical applications. It explores the diverse range of sensor technologies relevant to this domain and delves into a wide array of application areas. Additionally, we discuss the challenges and advancements in 2D and 3D human modelling methodologies, along with popular datasets, metrics, and future research directions. The main contribution of this paper lies in its up-to-date comparison of state-of-the-art (SOTA) human pose estimation algorithms in both 2D and 3D domains. By providing this comprehensive overview, the paper aims to enhance understanding of 3D human modelling and pose estimation, offering insights into current SOTA achievements, challenges, and future prospects within the field.

[CV-20] HuatuoGPT-Vision Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

链接: https://arxiv.org/abs/2406.19280
作者: Junying Chen,Ruyi Ouyang,Anningzhe Gao,Shunian Chen,Guiming Hardy Chen,Xidong Wang,Ruifei Zhang,Zhenyang Cai,Ke Ji,Guangjun Yu,Xiang Wan,Benyou Wang
关键词: large language models, multimodal large language, rapid development, large language, medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed’s large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an ‘unblinded’ capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

[CV-21] Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

链接: https://arxiv.org/abs/2406.19263
作者: Yue Fan,Lei Ding,Ching-Chen Kuo,Shan Jiang,Yang Zhao,Xinze Guan,Jie Yang,Yi Zhang,Xin Eric Wang
关键词: Graphical User Interfaces, Graphical User, User Interfaces, ToL agent, digital devices
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (SPR) task. This task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the SPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements. Such layout information is crucial for accurately interpreting information on the screen, distinguishing our ToL agent from other screen reading tools. We also thoroughly evaluate the ToL agent against other baselines on a newly proposed SPR benchmark, which includes GUIs from mobile, web, and operating systems. Last but not least, we test the ToL agent on mobile GUI navigation tasks, demonstrating its utility in identifying incorrect actions along the path of agent execution trajectories. Code and data: this http URL

[CV-22] Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

链接: https://arxiv.org/abs/2406.19255
作者: Hao Fei,Shengqiong Wu,Meishan Zhang,Min Zhang,Tat-Seng Chua,Shuicheng Yan
关键词: coarse-grained cross-modal aligning, shown remarkable potential, detached video-language view, large-scale video-language models, pre-training large-scale video-language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted by IEEE TPAMI 2024

点击查看摘要

Abstract:While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions. Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as a plugplay system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.

[CV-23] Local Manifold Learning for No-Reference Image Quality Assessment

链接: https://arxiv.org/abs/2406.19247
作者: Timin Gao,Wensheng Pan,Yan Zhang,Sicheng Zhao,Shengchuan Zhang,Xiawu Zheng,Ke Li,Liujuan Cao,Rongrong Ji
关键词: widely adopted technique, Image Quality Assessment, Contrastive learning, Quality Assessment, adopted technique
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Contrastive learning has considerably advanced the field of Image Quality Assessment (IQA), emerging as a widely adopted technique. The core mechanism of contrastive learning involves minimizing the distance between quality-similar (positive) examples while maximizing the distance between quality-dissimilar (negative) examples. Despite its successes, current contrastive learning methods often neglect the importance of preserving the local manifold structure. This oversight can result in a high degree of similarity among hard examples within the feature space, thereby impeding effective differentiation and assessment. To address this issue, we propose an innovative framework that integrates local manifold learning with contrastive learning for No-Reference Image Quality Assessment (NR-IQA). Our method begins by sampling multiple crops from a given image, identifying the most visually salient crop. This crop is then used to cluster other crops from the same image as the positive class, while crops from different images are treated as negative classes to increase inter-class distance. Uniquely, our approach also considers non-saliency crops from the same image as intra-class negative classes to preserve their distinctiveness. Additionally, we employ a mutual learning framework, which further enhances the model’s ability to adaptively learn and identify visual saliency regions. Our approach demonstrates a better performance compared to state-of-the-art methods in 7 standard datasets, achieving PLCC values of 0.942 (compared to 0.908 in TID2013) and 0.914 (compared to 0.894 in LIVEC).

[CV-24] FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts

链接: https://arxiv.org/abs/2406.19237
作者: Shubhankar Singh,Purvi Chaurasia,Yerram Varun,Pranshu Pandya,Vatsal Gupta,Vivek Gupta,Dan Roth
关键词: question answering lack, spatial reasoning skills, visual question answering, evaluating spatial reasoning, Existing benchmarks
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark’s potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.

[CV-25] Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

链接: https://arxiv.org/abs/2406.19236
作者: Minghan Li,Heng Li,Zhi-Qi Cheng,Yifei Dong,Yuxuan Zhou,Jun-Yan He,Qi Dai,Teruko Mitamura,Alexander G. Hauptmann
关键词: aims to develop, navigate based, dynamic human activities, current VLN frameworks, VLN
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 30 pages, 18 figures, Project Page: this https URL

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN’s unique challenges, underscores the need for further research to enhance HA-VLN agents’ real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.

[CV-26] ProtoGMM: Multi-prototype Gaussian-Mixture-based Domain Adaptation Model for Semantic Segmentation

链接: https://arxiv.org/abs/2406.19225
作者: Nazanin Moradinasab,Laura S. Shankman,Rebecca A. Deaton,Gary K. Owens,Donald E. Brown
关键词: unlabeled target domain, labeled source domain, supervised model trained, target domain, adaptive semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Domain adaptive semantic segmentation aims to generate accurate and dense predictions for an unlabeled target domain by leveraging a supervised model trained on a labeled source domain. The prevalent self-training approach involves retraining the dense discriminative classifier of p(class|pixel feature) using the pseudo-labels from the target domain. While many methods focus on mitigating the issue of noisy pseudo-labels, they often overlook the underlying data distribution p(pixel feature|class) in both the source and target domains. To address this limitation, we propose the multi-prototype Gaussian-Mixture-based (ProtoGMM) model, which incorporates the GMM into contrastive losses to perform guided contrastive learning. Contrastive losses are commonly executed in the literature using memory banks, which can lead to class biases due to underrepresented classes. Furthermore, memory banks often have fixed capacities, potentially restricting the model’s ability to capture diverse representations of the target/source domains. An alternative approach is to use global class prototypes (i.e. averaged features per category). However, the global prototypes are based on the unimodal distribution assumption per class, disregarding within-class variation. To address these challenges, we propose the ProtoGMM model. This novel approach involves estimating the underlying multi-prototype source distribution by utilizing the GMM on the feature space of the source samples. The components of the GMM model act as representative prototypes. To achieve increased intra-class semantic similarity, decreased inter-class similarity, and domain alignment between the source and target domains, we employ multi-prototype contrastive learning between source distribution and target samples. The experiments show the effectiveness of our method on UDA benchmarks.

[CV-27] hink Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

链接: https://arxiv.org/abs/2406.19217
作者: Zhimin Shao,Jialang Xu,Danail Stoyanov,Evangelos B. Mazomenos,Yueming Jin
关键词: minimally invasive surgery, robot-assisted minimally invasive, surgical data science, data science, ensuring safe
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Despite significant advancements in robotic systems and surgical data science, ensuring safe and optimal execution in robot-assisted minimally invasive surgery (RMIS) remains a complex challenge. Current surgical error detection methods involve two parts: identifying surgical gestures and then detecting errors within each gesture clip. These methods seldom consider the rich contextual and semantic information inherent in surgical videos, limiting their performance due to reliance on accurate gesture identification. Motivated by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Thought (COG) prompting, leveraging contextual information from surgical videos. This encompasses two reasoning modules designed to mimic the decision-making processes of expert surgeons. Concretely, we first design a Gestural-Visual Reasoning module, which utilizes transformer and attention architectures for gesture prompting, while the second, a Multi-Scale Temporal Reasoning module, employs a multi-stage temporal convolutional network with both slow and fast paths for temporal information extraction. We extensively validate our method on the public benchmark RMIS dataset JIGSAWS. Our method encapsulates the reasoning processes inherent to surgical activities enabling it to outperform the state-of-the-art by 4.6% in F1 score, 4.6% in Accuracy, and 5.9% in Jaccard index while processing each frame in 6.69 milliseconds on average, demonstrating the great potential of our approach in enhancing the safety and efficacy of RMIS procedures and surgical education. The code will be available.

[CV-28] owards Reducing Data Acquisition and Labeling for Defect Detection using Simulated Data

链接: https://arxiv.org/abs/2406.19175
作者: Lukas Malte Kemeter,Rasmus Hvingelby,Paulina Sierak,Tobias Schön,Bishwajit Gosswam
关键词: machine learning, data, manufacturing settings, vision is costly, synthetic data
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In many manufacturing settings, annotating data for machine learning and computer vision is costly, but synthetic data can be generated at significantly lower cost. Substituting the real-world data with synthetic data is therefore appealing for many machine learning applications that require large amounts of training data. However, relying solely on synthetic data is frequently inadequate for effectively training models that perform well on real-world data, primarily due to domain shifts between the synthetic and real-world data. We discuss approaches for dealing with such a domain shift when detecting defects in X-ray scans of aluminium wheels. Using both simulated and real-world X-ray images, we train an object detection model with different strategies to identify the training approach that generates the best detection results while minimising the demand for annotated real-world training samples. Our preliminary findings suggest that the sim-2-real domain adaptation approach is more cost-efficient than a fully supervised oracle - if the total number of available annotated samples is fixed. Given a certain number of labeled real-world samples, training on a mix of synthetic and unlabeled real-world data achieved comparable or even better detection results at significantly lower cost. We argue that future research into the cost-efficiency of different training strategies is important for a better understanding of how to allocate budget in applied machine learning projects.

[CV-29] Single Image Estimation of Cell Migration Direction by Deep Circular Regression

链接: https://arxiv.org/abs/2406.19162
作者: Lennart Bruns,Lucas Lamparter,Milos Galic,Xiaoyi Jiang
关键词: paper we study, estimating the migration, migration direction, direction of cells, cells based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper we study the problem of estimating the migration direction of cells based on a single image. To the best of our knowledge, there is only one related work that uses a classification CNN for four classes (quadrants). This approach does not allow detailed directional resolution. We solve the single image estimation problem using deep circular regression with special attention to cycle-sensitive methods. On two databases we achieve an average accuracy of \sim 17 degrees, which is a significant improvement over the previous work.

[CV-30] RAVEN: Multitask Retrieval Augmented Vision-Language Learning

链接: https://arxiv.org/abs/2406.19150
作者: Varun Nagaraj Rao,Siddharth Choudhary,Aditya Deshpande,Ravi Kumar Satzoda,Srikar Appalaraju
关键词: exacerbated resource barriers, large language models, scaling of large, large language, world knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The scaling of large language models to encode all the world’s knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they’re limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks. Our results and extensive ablations across retrieved modalities for the image captioning and VQA tasks indicate significant performance improvements compared to non retrieved baselines +1 CIDEr on MSCOCO, +4 CIDEr on NoCaps and nearly a +3% accuracy on specific VQA question types. This underscores the efficacy of applying RAG approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.

[CV-31] BackMix: Mitigating Shortcut Learning in Echocardiography with Minimal Supervision

链接: https://arxiv.org/abs/2406.19148
作者: Kit Mills Bransby,Arian Beqiri,Woo-Jin Cho Kim,Jorge Oliveira,Agisilaos Chartsias,Alberto Gomez
关键词: learn spurious correlations, Neural networks, Clever Hans effect, correct prediction, wrong reason
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at MICCAI 2024 (Pre-print)

点击查看摘要

Abstract:Neural networks can learn spurious correlations that lead to the correct prediction in a validation set, but generalise poorly because the predictions are right for the wrong reason. This undesired learning of naive shortcuts (Clever Hans effect) can happen for example in echocardiogram view classification when background cues (e.g. metadata) are biased towards a class and the model learns to focus on those background features instead of on the image content. We propose a simple, yet effective random background augmentation method called BackMix, which samples random backgrounds from other examples in the training set. By enforcing the background to be uncorrelated with the outcome, the model learns to focus on the data within the ultrasound sector and becomes invariant to the regions outside this. We extend our method in a semi-supervised setting, finding that the positive effects of BackMix are maintained with as few as 5% of segmentation labels. A loss weighting mechanism, wBackMix, is also proposed to increase the contribution of the augmented examples. We validate our method on both in-distribution and out-of-distribution datasets, demonstrating significant improvements in classification accuracy, region focus and generalisability. Our source code is available at: this https URL

[CV-32] CELLO: Causal Evaluation of Large Vision-Language Models

链接: https://arxiv.org/abs/2406.19131
作者: Meiqi Chen,Bo Peng,Yan Zhang,Chaochao Lu
关键词: real-world environments, intelligence and crucial, crucial for effective, effective decision-making, decision-making in real-world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Causal reasoning is fundamental to human intelligence and crucial for effective decision-making in real-world environments. Despite recent advancements in large vision-language models (LVLMs), their ability to comprehend causality remains unclear. Previous work typically focuses on commonsense causality between events and/or actions, which is insufficient for applications like embodied agents and lacks the explicitly defined causal graphs required for formal causal reasoning. To overcome these limitations, we introduce a fine-grained and unified definition of causality involving interactions between humans and/or objects. Building on the definition, we construct a novel dataset, CELLO, consisting of 14,094 causal questions across all four levels of causality: discovery, association, intervention, and counterfactual. This dataset surpasses traditional commonsense causality by including explicit causal graphs that detail the interactions between humans and objects. Extensive experiments on CELLO reveal that current LVLMs still struggle with causal reasoning tasks, but they can benefit significantly from our proposed CELLO-CoT, a causally inspired chain-of-thought prompting strategy. Both quantitative and qualitative analyses from this study provide valuable insights for future research. Our project page is at this https URL.

[CV-33] Evidential Concept Embedding Models: Towards Reliable Concept Explanations for Skin Disease Diagnosis

链接: https://arxiv.org/abs/2406.19130
作者: Yibo Gao,Zheyao Gao,Xin Gao,Yuanye Liu,Bomin Wang,Xiahai Zhuang
关键词: medical image analysis, Concept, stakes in medical, medical image, interpretable deep learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by MICCAI 2024

点击查看摘要

Abstract:Due to the high stakes in medical decision-making, there is a compelling demand for interpretable deep learning methods in medical image analysis. Concept Bottleneck Models (CBM) have emerged as an active interpretable framework incorporating human-interpretable concepts into decision-making. However, their concept predictions may lack reliability when applied to clinical diagnosis, impeding concept explanations’ quality. To address this, we propose an evidential Concept Embedding Model (evi-CEM), which employs evidential learning to model the concept uncertainty. Additionally, we offer to leverage the concept uncertainty to rectify concept misalignments that arise when training CBMs using vision-language models without complete concept supervision. With the proposed methods, we can enhance concept explanations’ reliability for both supervised and label-efficient settings. Furthermore, we introduce concept uncertainty for effective test-time intervention. Our evaluation demonstrates that evi-CEM achieves superior performance in terms of concept prediction, and the proposed concept rectification effectively mitigates concept misalignments for label-efficient training. Our code is available at this https URL.

[CV-34] FDLite: A Single Stage Lightweight Face Detector Network

链接: https://arxiv.org/abs/2406.19107
作者: Yogesh Aggarwal,Prithwijit Guha
关键词: heavy pre-trained backbone, pre-trained backbone networks, detection is frequently, frequently attempted, heavy pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 14 figures

点击查看摘要

Abstract:Face detection is frequently attempted by using heavy pre-trained backbone networks like ResNet-50/101/152 and VGG16/19. Few recent works have also proposed lightweight detectors with customized backbones, novel loss functions and efficient training strategies. The novelty of this work lies in the design of a lightweight detector while training with only the commonly used loss functions and learning strategies. The proposed face detector grossly follows the established RetinaFace architecture. The first contribution of this work is the design of a customized lightweight backbone network (BLite) having 0.167M parameters with 0.52 GFLOPs. The second contribution is the use of two independent multi-task losses. The proposed lightweight face detector (FDLite) has 0.26M parameters with 0.94 GFLOPs. The network is trained on the WIDER FACE dataset. FDLite is observed to achieve 92.3%, 89.8%, and 82.2% Average Precision (AP) on the easy, medium, and hard subsets of the WIDER FACE validation dataset, respectively.

[CV-35] DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

链接: https://arxiv.org/abs/2406.19101
作者: Jiaxin Zhang,Wentao Yang,Songxuan Lai,Zecheng Xie,Lianwen Jin
关键词: Current multimodal large, large language models, multimodal large language, complex layouts typical, Current multimodal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models’ ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. DocKylin utilizes an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, DocKylin incorporates a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to create a compressed, adaptive visual sequence. Experiments demonstrate DocKylin’s promising performance across various VDU benchmarks. Notably, both the proposed APS and DTS are parameter-free, facilitating easy integration into existing MLLMs, and our experiments indicate their potential for broader applications.

[CV-36] Dimensions underlying the representational alignment of deep neural networks with humans

链接: https://arxiv.org/abs/2406.19087
作者: Florian P. Mahner,Lukas Muttenthaler,Umut Güçlü,Martin N. Hebart
关键词: artificial intelligence, machine learning, Determining the similarities, Determining, DNN
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Determining the similarities and differences between humans and artificial intelligence is an important goal both in machine learning and cognitive neuroscience. However, similarities in representations only inform us about the degree of alignment, not the factors that determine it. Drawing upon recent developments in cognitive science, we propose a generic framework for yielding comparable representations in humans and deep neural networks (DNN). Applying this framework to humans and a DNN model of natural images revealed a low-dimensional DNN embedding of both visual and semantic dimensions. In contrast to humans, DNNs exhibited a clear dominance of visual over semantic features, indicating divergent strategies for representing images. While in-silico experiments showed seemingly-consistent interpretability of DNN dimensions, a direct comparison between human and DNN representations revealed substantial differences in how they process images. By making representations directly comparable, our results reveal important challenges for representational alignment, offering a means for improving their comparability.

[CV-37] FAGhead: Fully Animate Gaussian Head from Monocular Videos

链接: https://arxiv.org/abs/2406.19070
作者: Yixin Xuan,Xinyang Li,Gongxin Yao,Shiwei Zhou,Donghui Sun,Xiaoxin Chen,Yu Pan
关键词: visual reality, wild application, application in visual, Learnable Representation Field, Point-based Learnable Representation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:High-fidelity reconstruction of 3D human avatars has a wild application in visual reality. In this paper, we introduce FAGhead, a method that enables fully controllable human portraits from monocular videos. We explicit the traditional 3D morphable meshes (3DMM) and optimize the neutral 3D Gaussians to reconstruct with complex expressions. Furthermore, we employ a novel Point-based Learnable Representation Field (PLRF) with learnable Gaussian point positions to enhance reconstruction performance. Meanwhile, to effectively manage the edges of avatars, we introduced the alpha rendering to supervise the alpha value of each pixel. Extensive experimental results on the open-source datasets and our capturing datasets demonstrate that our approach is able to generate high-fidelity 3D head avatars and fully control the expression and pose of the virtual avatars, which is outperforming than existing works.

[CV-38] Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO

链接: https://arxiv.org/abs/2406.19057
作者: Fuseini Mumuni,Alhassan Mumuni
关键词: Segment Anything Model, Grounding DINO, Model, achieved impressive performance, zero-shot object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Grounding DINO and the Segment Anything Model (SAM) have achieved impressive performance in zero-shot object detection and image segmentation, respectively. Together, they have a great potential in revolutionizing zero-shot semantic segmentation or data annotation. Yet, in specialized domains like medical image segmentation, objects of interest (e.g., organs, tissues, and tumors) may not fall in existing class names. To address this problem, the referring expression comprehension (REC) ability of Grounding DINO is leveraged to detect arbitrary targets by their language descriptions. However, recent studies have highlighted severe limitation of the REC framework in this application setting owing to its tendency to make false positive predictions when the target is absent in the given image. And, while this bottleneck is central to the prospect of open-set semantic segmentation, it is still largely unknown how much improvement can be achieved by studying the prediction errors. To this end, we perform empirical studies on eight publicly available datasets and reveal that these errors consistently follow a predictable pattern and can, thus, be mitigated by a simple strategy. Specifically, we show that these false positive detections with appreciable confidence scores generally occupy large image areas and can usually be filtered by their relative sizes. More importantly, we expect these observations to inspire future research in improving REC-based detection and automated segmentation. Using this technique, we evaluate the performance of SAM on multiple datasets from various specialized domains and report significant improvement in segmentation performance and annotation time savings over manual approaches.

[CV-39] SimpleFusion: A Simple Fusion Framework for Infrared and Visible Images

链接: https://arxiv.org/abs/2406.19055
作者: Ming Chen,Yuxuan Cheng,Xinwei He,Xinyue Wang,Yan Aze,Jinhai Xiang
关键词: downstream vision tasks, Integrating visible, infrared image fusion, downstream vision, visible and infrared
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: code: this https URL

点击查看摘要

Abstract:Integrating visible and infrared images into one high-quality image, also known as visible and infrared image fusion, is a challenging yet critical task for many downstream vision tasks. Most existing works utilize pretrained deep neural networks or design sophisticated frameworks with strong priors for this task, which may be unsuitable or lack flexibility. This paper presents SimpleFusion, a simple yet effective framework for visible and infrared image fusion. Our framework follows the decompose-and-fusion paradigm, where the visible and the infrared images are decomposed into reflectance and illumination components via Retinex theory and followed by the fusion of these corresponding elements. The whole framework is designed with two plain convolutional neural networks without downsampling, which can perform image decomposition and fusion efficiently. Moreover, we introduce decomposition loss and a detail-to-semantic loss to preserve the complementary information between the two modalities for fusion. We conduct extensive experiments on the challenging benchmarks, verifying the superiority of our method over previous state-of-the-arts. Code is available at \hrefthis https URLthis https URL

[CV-40] BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection

链接: https://arxiv.org/abs/2406.19048
作者: Yang Song,Lin Wang
关键词: camera features, features, Lidar features, autonomous driving, Image Enhancement Module
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:3D object detection is an important task that has been widely applied in autonomous driving. Recently, fusing multi-modal inputs, i.e., LiDAR and camera data, to perform this task has become a new trend. Existing methods, however, either ignore the sparsity of Lidar features or fail to preserve the original spatial structure of LiDAR and the semantic density of camera features simultaneously due to the modality gap. To address issues, this letter proposes a novel bidirectional complementary Lidar-camera fusion framework, called BiCo-Fusion that can achieve robust semantic- and spatial-aware 3D object detection. The key insight is to mutually fuse the multi-modal features to enhance the semantics of LiDAR features and the spatial awareness of the camera features and adaptatively select features from both modalities to build a unified 3D representation. Specifically, we introduce Pre-Fusion consisting of a Voxel Enhancement Module (VEM) to enhance the semantics of voxel features from 2D camera features and Image Enhancement Module (IEM) to enhance the spatial characteristics of camera features from 3D voxel features. Both VEM and IEM are bidirectionally updated to effectively reduce the modality gap. We then introduce Unified Fusion to adaptively weight to select features from the enchanted Lidar and camera features to build a unified 3D representation. Extensive experiments demonstrate the superiority of our BiCo-Fusion against the prior arts. Project page: this https URL.

[CV-41] Using diffusion model as constraint: Empower Image Restoration Network Training with Diffusion Model

链接: https://arxiv.org/abs/2406.19030
作者: Jiangtong Tan,Feng Zhao
关键词: made marvelous progress, deep learning, made marvelous, marvelous progress, advent of deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image restoration has made marvelous progress with the advent of deep learning. Previous methods usually rely on designing powerful network architecture to elevate performance, however, the natural visual effect of the restored results is limited by color and texture distortions. Besides the visual perceptual quality, the semantic perception recovery is an important but often overlooked perspective of restored image, which is crucial for the deployment in high-level tasks. In this paper, we propose a new perspective to resort these issues by introducing a naturalness-oriented and semantic-aware optimization mechanism, dubbed DiffLoss. Specifically, inspired by the powerful distribution coverage capability of the diffusion model for natural image generation, we exploit the Markov chain sampling property of diffusion model and project the restored results of existing networks into the sampling space. Besides, we reveal that the bottleneck feature of diffusion models, also dubbed h-space feature, is a natural high-level semantic space. We delve into this property and propose a semantic-aware loss to further unlock its potential of semantic perception recovery, which paves the way to connect image restoration task and downstream high-level recognition task. With these two strategies, the DiffLoss can endow existing restoration methods with both more natural and semantic-aware results. We verify the effectiveness of our method on substantial common image restoration tasks and benchmarks. Code will be available at this https URL.

[CV-42] VideoMambaPro: A Leap Forward for Mamba in Video Understanding

链接: https://arxiv.org/abs/2406.19006
作者: Hui Lu,Albert Ali Salah,Ronald Poppe
关键词: rich spatio-temporal representations, Video understanding requires, spatio-temporal representations, understanding requires, requires the extraction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba’s successes do not trivially extend to computer vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba’s token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. VideoMambaPro shows state-of-the-art video action recognition performance compared to transformer models, and surpasses VideoMamba by clear margins: 7.9% and 8.1% top-1 on Kinetics-400 and Something-Something V2, respectively. Our VideoMambaPro-M model achieves 91.9% top-1 on Kinetics-400, only 0.2% below InternVideo2-6B but with only 1.2% of its parameters. The combination of high performance and efficiency makes VideoMambaPro an interesting alternative for transformer models.

[CV-43] Improving Taxonomic Image-based Out-of-distribution Detection With DNA Barcodes

链接: https://arxiv.org/abs/2406.18999
作者: Mikko Impiö,Jenni Raitoharju
关键词: Image-based species identification, scaling biodiversity monitoring, global scale, species identification, scaling biodiversity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to EUSIPCO 2024

点击查看摘要

Abstract:Image-based species identification could help scaling biodiversity monitoring to a global scale. Many challenges still need to be solved in order to implement these systems in real-world applications. A reliable image-based monitoring system must detect out-of-distribution (OOD) classes it has not been presented before. This is challenging especially with fine-grained classes. Emerging environmental monitoring techniques, DNA metabarcoding and eDNA, can help by providing information on OOD classes that are present in a sample. In this paper, we study if DNA barcodes can also support in finding the outlier images based on the outlier DNA sequence’s similarity to the seen classes. We propose a re-ordering approach that can be easily applied on any pre-trained models and existing OOD detection methods. We experimentally show that the proposed approach improves taxonomic OOD detection compared to all common baselines. We also show that the method works thanks to a correlation between visual similarity and DNA barcode proximity. The code and data are available at this https URL.

[CV-44] Zero-shot domain adaptation based on dual-level mix and contrast

链接: https://arxiv.org/abs/2406.18996
作者: Yu Zhe,Jun Sakuma
关键词: Zero-shot domain adaptation, learn domain-invariant features, Zero-shot domain, domain adaptation, task
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by IEEE conference on Artificial intelligence 2024

点击查看摘要

Abstract:Zero-shot domain adaptation (ZSDA) is a domain adaptation problem in the situation that labeled samples for a target task (task of interest) are only available from the source domain at training time, but for a task different from the task of interest (irrelevant task), labeled samples are available from both source and target domains. In this situation, classical domain adaptation techniques can only learn domain-invariant features in the irrelevant task. However, due to the difference in sample distribution between the two tasks, domain-invariant features learned in the irrelevant task are biased and not necessarily domain-invariant in the task of interest. To solve this problem, this paper proposes a new ZSDA method to learn domain-invariant features with low task bias. To this end, we propose (1) data augmentation with dual-level mixups in both task and domain to fill the absence of target task-of-interest data, (2) an extension of domain adversarial learning to learn domain-invariant features with less task bias, and (3) a new dual-level contrastive learning method that enhances domain-invariance and less task biasedness of features. Experimental results show that our proposal achieves good performance on several benchmarks.

[CV-45] Semi-supervised Concept Bottleneck Models

链接: https://arxiv.org/abs/2406.18992
作者: Lijie Hu,Tianhao Huang,Huanyi Xie,Chenyang Ren,Zhengyu Hu,Lu Yu,Di Wang
关键词: garnered increasing attention, increasing attention due, provide concept-based explanations, black-box deep learning, achieving high final
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) have garnered increasing attention due to their ability to provide concept-based explanations for black-box deep learning models while achieving high final prediction accuracy using human-like concepts. However, the training of current CBMs heavily relies on the accuracy and richness of annotated concepts in the dataset. These concept labels are typically provided by experts, which can be costly and require significant resources and effort. Additionally, concept saliency maps frequently misalign with input saliency maps, causing concept predictions to correspond to irrelevant input features - an issue related to annotation alignment. To address these limitations, we propose a new framework called SSCBM (Semi-supervised Concept Bottleneck Model). Our SSCBM is suitable for practical situations where annotated data is scarce. By leveraging joint training on both labeled and unlabeled data and aligning the unlabeled data at the concept level, we effectively solve these issues. We proposed a strategy to generate pseudo labels and an alignment loss. Experiments demonstrate that our SSCBM is both effective and efficient. With only 20% labeled data, we achieved 93.19% (96.39% in a fully supervised setting) concept accuracy and 75.51% (79.82% in a fully supervised setting) prediction accuracy.

[CV-46] RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton

链接: https://arxiv.org/abs/2406.18977
作者: Fanfan Liu,Feng Yan,Liming Zheng,Chengjian Feng,Yiyang Huang,Lin Ma
关键词: Utilizing Vision-Language Models, Utilizing Vision-Language, robotic manipulation represents, unified view representation, aiming to enhance
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model’s ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform’s camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the D \to D setting from 88.7% to 96.2%, and in the ABC \to D setting from 82.4% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. this https URL

[CV-47] Structural Attention: Rethinking Transformer for Unpaired Medical Image Synthesis

链接: https://arxiv.org/abs/2406.18967
作者: Vu Minh Hieu Phan,Yutong Xie,Bowen Zhang,Yuankai Qi,Zhibin Liao,Antonios Perperidis,Son Lam Phung,Johan W. Verjans,Minh-Son To
关键词: accurate clinical diagnostics, provide complementary information, obtaining aligned multi-modal, multi-modal medical scans, aligned multi-modal medical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI2024 - Early Accept Top 11%

点击查看摘要

Abstract:Unpaired medical image synthesis aims to provide complementary information for an accurate clinical diagnostics, and address challenges in obtaining aligned multi-modal medical scans. Transformer-based models excel in imaging translation tasks thanks to their ability to capture long-range dependencies. Although effective in supervised training settings, their performance falters in unpaired image synthesis, particularly in synthesizing structural details. This paper empirically demonstrates that, lacking strong inductive biases, Transformer can converge to non-optimal solutions in the absence of paired data. To address this, we introduce UNet Structured Transformer (UNest), a novel architecture incorporating structural inductive biases for unpaired medical image synthesis. We leverage the foundational Segment-Anything Model to precisely extract the foreground structure and perform structural attention within the main anatomy. This guides the model to learn key anatomical regions, thus improving structural synthesis under the lack of supervision in unpaired training. Evaluated on two public datasets, spanning three modalities, i.e., MR, CT, and PET, UNest improves recent methods by up to 19.30% across six medical image synthesis tasks. Our code is released at this https URL.

[CV-48] AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation

链接: https://arxiv.org/abs/2406.18958
作者: Yanan Sun,Yanchen Liu,Yinhao Tang,Wenjie Pei,Kai Chen
关键词: made significant progress, recent years, largely driven, made significant, significant progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The field of text-to-image (T2I) generation has made significant progress in recent years, largely driven by advancements in diffusion models. Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. This challenge has been explored, to a great extent, by incorporating additional user-supplied spatial conditions, such as depth maps and edge maps, into pre-trained T2I models through extra encoding. However, multi-control image synthesis still faces several challenges. Specifically, current approaches are limited in handling free combinations of diverse input control signals, overlook the complex relationships among multiple spatial conditions, and often fail to maintain semantic alignment with provided textual prompts. This can lead to suboptimal user experiences. To address these challenges, we propose AnyControl, a multi-control image synthesis framework that supports arbitrary combinations of diverse control signals. AnyControl develops a novel Multi-Control Encoder that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals, as demonstrated by extensive quantitative and qualitative evaluations. Our project page is available in \urlthis https URL.

[CV-49] Investigating and Defending Shortcut Learning in Personalized Diffusion Models

链接: https://arxiv.org/abs/2406.18944
作者: Yixin Liu,Ruoxi Chen,Lichao Sun
关键词: Personalized diffusion models, adapting pre-trained, gained popularity, popularity for adapting, specific topics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Preprint

点击查看摘要

Abstract:Personalized diffusion models have gained popularity for adapting pre-trained text-to-image models to generate images of specific topics with only a few images. However, recent studies find that these models are vulnerable to minor adversarial perturbation, and the fine-tuning performance is largely degraded on corrupted datasets. Such characteristics are further exploited to craft protective perturbation on sensitive images like portraits that prevent unauthorized generation. In response, diffusion-based purification methods have been proposed to remove these perturbations and retain generation performance. However, existing works lack detailed analysis of the fundamental shortcut learning vulnerability of personalized diffusion models and also turn to over-purifying the images cause information loss. In this paper, we take a closer look at the fine-tuning process of personalized diffusion models through the lens of shortcut learning and propose a hypothesis that could explain the underlying manipulation mechanisms of existing perturbation methods. Specifically, we find that the perturbed images are greatly shifted from their original paired prompt in the CLIP-based latent space. As a result, training with this mismatched image-prompt pair creates a construction that causes the models to dump their out-of-distribution noisy patterns to the identifier, thus causing serious performance degradation. Based on this observation, we propose a systematic approach to retain the training performance with purification that realigns the latent image and its semantic meaning and also introduces contrastive learning with a negative token to decouple the learning of wanted clean identity and the unwanted noisy pattern, that shows strong potential capacity against further adaptive perturbation.

[CV-50] CLIP3D-AD: Extending CLIP for 3D Few-Shot Anomaly Detection with Multi-View Images Generation

链接: https://arxiv.org/abs/2406.18941
作者: Zuo Zuo,Jiahao Dong,Yao Wu,Yanyun Qu,Zongze Wu
关键词: Few-shot anomaly detection, Few-shot anomaly, anomaly detection, few-shot anomaly classification, Few-shot
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Few-shot anomaly detection methods can effectively address data collecting difficulty in industrial scenarios. Compared to 2D few-shot anomaly detection (2D-FSAD), 3D few-shot anomaly detection (3D-FSAD) is still an unexplored but essential task. In this paper, we propose CLIP3D-AD, an efficient 3D-FSAD method extended on CLIP. We successfully transfer strong generalization ability of CLIP into 3D-FSAD. Specifically, we synthesize anomalous images on given normal images as sample pairs to adapt CLIP for 3D anomaly classification and segmentation. For classification, we introduce an image adapter and a text adapter to fine-tune global visual features and text features. Meanwhile, we propose a coarse-to-fine decoder to fuse and facilitate intermediate multi-layer visual representations of CLIP. To benefit from geometry information of point cloud and eliminate modality and data discrepancy when processed by CLIP, we project and render point cloud to multi-view normal and anomalous images. Then we design multi-view fusion module to fuse features of multi-view images extracted by CLIP which are used to facilitate visual representations for further enhancing vision-language correlation. Extensive experiments demonstrate that our method has a competitive performance of 3D few-shot anomaly classification and segmentation on MVTec-3D AD dataset.

[CV-51] RoFIR: Robust Fisheye Image Rectification Framework Impervious to Optical Center Deviation

链接: https://arxiv.org/abs/2406.18927
作者: Zhaokang Liao,Hao Feng,Shaokai Liu,Wengang Zhou,Houqiang Li
关键词: optical center position, Fisheye images, deviated fisheye image, Fisheye, fisheye image rectification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fisheye images are categorized fisheye into central and deviated based on the optical center position. Existing rectification methods are limited to central fisheye images, while this paper proposes a novel method that extends to deviated fisheye image rectification. The challenge lies in the variant global distortion distribution pattern caused by the random optical center position. To address this challenge, we propose a distortion vector map (DVM) that measures the degree and direction of local distortion. By learning the DVM, the model can independently identify local distortions at each pixel without relying on global distortion patterns. The model adopts a pre-training and fine-tuning training paradigm. In the pre-training stage, it predicts the distortion vector map and perceives the local distortion features of each pixel. In the fine-tuning stage, it predicts a pixel-wise flow map for deviated fisheye image rectification. We also propose a data augmentation method mixing central, deviated, and distorted-free images. Such data augmentation promotes the model performance in rectifying both central and deviated fisheye images, compared with models trained on single-type fisheye images. Extensive experiments demonstrate the effectiveness and superiority of the proposed method.

[CV-52] Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding

链接: https://arxiv.org/abs/2406.18925
作者: Jiwan Chung,Sungjae Lee,Minseo Kim,Seungju Han,Ashkan Yousefpour,Jack Hessel,Youngjae Yu
关键词: Visual, Visual arguments, advertising or social, persuade viewers, arguments
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences, we ask: are today’s AI capable of similar understanding? We collect and release VisArgs, an annotated corpus designed to make explicit the (usually implicit) structures underlying visual arguments. VisArgs includes 1,611 images accompanied by three types of textual annotations: 5,112 visual premises (with region annotations), 5,574 commonsense premises, and reasoning trees connecting them to a broader argument. We propose three tasks over VisArgs to probe machine capacity for visual argument understanding: localization of premises, identification of premises, and deduction of conclusions. Experiments demonstrate that 1) machines cannot fully identify the relevant visual cues. The top-performing model, GPT-4-O, achieved an accuracy of only 78.5%, whereas humans reached 98.0%. All models showed a performance drop, with an average decrease in accuracy of 19.5%, when the comparison set was changed from objects outside the image to irrelevant objects within the image. Furthermore, 2) this limitation is the greatest factor impacting their performance in understanding visual arguments. Most models improved the most when given relevant visual premises as additional inputs, compared to other inputs, for deducing the conclusion of the visual argument. Comments: 12 pages, 5 figures Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.18925 [cs.CL] (or arXiv:2406.18925v1 [cs.CL] for this version)

[CV-53] Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

链接: https://arxiv.org/abs/2406.18915
作者: Jiafei Duan,Wentao Yuan,Wilbert Pumacay,Yi Ru Wang,Kiana Ehsani,Dieter Fox,Ranjay Krishna
关键词: widespread community efforts, Large-scale endeavors, robot demonstration data, robot demonstration, widespread community
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Large-scale endeavors like RT-1 and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances. We propose Manipulate-Anything, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object. We evaluate our method using two setups. First, Manipulate-Anything successfully generates trajectories for all 5 real-world and 12 simulation tasks, significantly outperforming existing methods like VoxPoser. Second, Manipulate-Anything’s demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser and Code-As-Policies. We believe \methodLong\ can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting.

[CV-54] A Universal Railway Obstacle Detection System based on Semi-supervised Segmentation And Optical Flow

链接: https://arxiv.org/abs/2406.18908
作者: Qiushi Guo
关键词: varying ambient conditions, Detecting obstacles, weather and light, obstacle categories, railway scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detecting obstacles in railway scenarios is both crucial and challenging due to the wide range of obstacle categories and varying ambient conditions such as weather and light. Given the impossibility of encompassing all obstacle categories during the training stage, we address this out-of-distribution (OOD) issue with a semi-supervised segmentation approach guided by optical flow clues. We reformulate the task as a binary segmentation problem instead of the traditional object detection approach. To mitigate data shortages, we generate highly realistic synthetic images using Segment Anything (SAM) and YOLO, eliminating the need for manual annotation to produce abundant pixel-level annotations. Additionally, we leverage optical flow as prior knowledge to train the model effectively. Several experiments are conducted, demonstrating the feasibility and effectiveness of our approach.

[CV-55] Autoencoder based approach for the mitigation of spurious correlations

链接: https://arxiv.org/abs/2406.18901
作者: Srinitish Srinivasan,Karthik Seemakurthy
关键词: Deep neural networks, exhibited remarkable performance, Deep neural, spurious correlations poses, spurious correlations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have exhibited remarkable performance across various tasks, yet their susceptibility to spurious correlations poses a significant challenge for out-of-distribution (OOD) generalization. Spurious correlations refer to erroneous associations in data that do not reflect true underlying relationships but are instead artifacts of dataset characteristics or biases. These correlations can lead DNNs to learn patterns that are not robust across diverse datasets or real-world scenarios, hampering their ability to generalize beyond training data. In this paper, we propose an autoencoder-based approach to analyze the nature of spurious correlations that exist in the Global Wheat Head Detection (GWHD) 2021 dataset. We then use inpainting followed by Weighted Boxes Fusion (WBF) to achieve a 2% increase in the Average Domain Accuracy (ADA) over the YOLOv5 baseline and consistently show that our approach has the ability to suppress some of the spurious correlations in the GWHD 2021 dataset. The key advantage of our approach is that it is more suitable in scenarios where there is limited scope to adapt or fine-tune the trained model in unseen test environments.

[CV-56] 360 in the Wild: Dataset for Depth Prediction and View Synthesis

链接: https://arxiv.org/abs/2406.18898
作者: Kibaek Park,Francois Rameau,Jaesik Park,In So Kweon
关键词: abundance of perspective, facilitated the emergence, learning-based strategies, single image depth, image depth estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The large abundance of perspective camera datasets facilitated the emergence of novel learning-based strategies for various tasks, such as camera localization, single image depth estimation, or view synthesis. However, panoramic or omnidirectional image datasets, including essential information, such as pose and depth, are mostly made with synthetic scenes. In this work, we introduce a large scale 360 ^\circ videos dataset in the wild. This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide. Hence, this dataset exhibits very diversified environments (e.g., indoor and outdoor) and contexts (e.g., with and without moving objects). Each of the 25K images constituting our dataset is provided with its respective camera’s pose and depth map. We illustrate the relevance of our dataset for two main tasks, namely, single image depth estimation and view synthesis.

[CV-57] AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models

链接: https://arxiv.org/abs/2406.18893
作者: Aishwarya Agarwal,Srikrishna Karanam,Balaji Vasan Srinivasan
关键词: user-supplied reference images, existing customization methods, reference images, customization methods, problem of customizing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:We consider the problem of customizing text-to-image diffusion models with user-supplied reference images. Given new prompts, the existing methods can capture the key concept from the reference images but fail to align the generated image with the prompt. In this work, we seek to address this key issue by proposing new methods that can easily be used in conjunction with existing customization methods that optimize the embeddings/weights at various intermediate stages of the text encoding process. The first contribution of this paper is a dissection of the various stages of the text encoding process leading up to the conditioning vector for text-to-image models. We take a holistic view of existing customization methods and notice that key and value outputs from this process differs substantially from their corresponding baseline (non-customized) models (e.g., baseline stable diffusion). While this difference does not impact the concept being customized, it leads to other parts of the generated image not being aligned with the prompt (see first row in Fig 1). Further, we also observe that these keys and values allow independent control various aspects of the final generation, enabling semantic manipulation of the output. Taken together, the features spanning these keys and values, serve as the basis for our next contribution where we fix the aforementioned issues with existing methods. We propose a new post-processing algorithm, \textbfAlignIT, that infuses the keys and values for the concept of interest while ensuring the keys and values for all other tokens in the input prompt are unchanged. Our proposed method can be plugged in directly to existing customization methods, leading to a substantial performance improvement in the alignment of the final result with the input prompt while retaining the customization quality. Comments: 10 pages, 9 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.18893 [cs.CV] (or arXiv:2406.18893v1 [cs.CV] for this version)

[CV-58] Advancing Cross-domain Discriminability in Continual Learning of Vison-Language Models

链接: https://arxiv.org/abs/2406.18868
作者: Yicheng Xu,Yuxin Chen,Jiahao Nie,Yusong Wang,Huiping Zhuang,Manabu Okumura
关键词: previously encountered classes, Vision-Language Models, Continual learning, encountered classes, overcome the constraints
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Continual learning (CL) with Vision-Language Models (VLMs) has overcome the constraints of traditional CL, which only focuses on previously encountered classes. During the CL of VLMs, we need not only to prevent the catastrophic forgetting on incrementally learned knowledge but also to preserve the zero-shot ability of VLMs. However, existing methods require additional reference datasets to maintain such zero-shot ability and rely on domain-identity hints to classify images across different domains. In this study, we propose Regression-based Analytic Incremental Learning (RAIL), which utilizes a recursive ridge regression-based adapter to learn from a sequence of domains in a non-forgetting manner and decouple the cross-domain correlations by projecting features to a higher-dimensional space. Cooperating with a training-free fusion module, RAIL absolutely preserves the VLM’s zero-shot ability on unseen domains without any reference data. Additionally, we introduce Cross-domain Task-Agnostic Incremental Learning (X-TAIL) setting. In this setting, a CL learner is required to incrementally learn from multiple domains and classify test images from both seen and unseen domains without any domain-identity hint. We theoretically prove RAIL’s absolute memorization on incrementally learned domains. Experiment results affirm RAIL’s state-of-the-art performance in both X-TAIL and existing Multi-domain Task-Incremental Learning settings. The code will be released upon acceptance.

[CV-59] Learning Modality Knowledge Alignment for Cross-Modality Transfer

链接: https://arxiv.org/abs/2406.18864
作者: Wenxuan Ma,Shuang Li,Lincan Cai,Jingxuan Kang
关键词: leverage large pretrained, large pretrained models, Cross-modality transfer aims, aims to leverage, leverage large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024

点击查看摘要

Abstract:Cross-modality transfer aims to leverage large pretrained models to complete tasks that may not belong to the modality of pretraining data. Existing works achieve certain success in extending classical finetuning to cross-modal scenarios, yet we still lack understanding about the influence of modality gap on the transfer. In this work, a series of experiments focusing on the source representation quality during transfer are conducted, revealing the connection between larger modality gap and lesser knowledge reuse which means ineffective transfer. We then formalize the gap as the knowledge misalignment between modalities using conditional distribution P(Y|X). Towards this problem, we present Modality kNowledge Alignment (MoNA), a meta-learning approach that learns target data transformation to reduce the modality knowledge discrepancy ahead of the transfer. Experiments show that out method enables better reuse of source modality knowledge in cross-modality transfer, which leads to improvements upon existing finetuning methods.

[CV-60] Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

链接: https://arxiv.org/abs/2406.18849
作者: Jie Zhang,Zhongqi Wang,Mengqi Lei,Zheng Yuan,Bei Yan,Shiguang Shan,Xilin Chen
关键词: Large Vision-Language Models, Vision-Language Models, Large Vision-Language, Models, Large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 8 advanced open-source LVLMs with 10 checkpoints are evaluated on Dysca, revealing the drawbacks of current LVLMs. The benchmark is released in \urlthis https URL.

[CV-61] Retain Blend and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition

链接: https://arxiv.org/abs/2406.18845
作者: Lan Chen,Dong Li,Xiao Wang,Pengpeng Shao,Wei Zhang,Yaowei Wang,Yonghong Tian,Jin Tang
关键词: Existing event stream-based, deep neural networks, stream-based pattern recognition, Existing event, event stream-based pattern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: In Peer Review, Journal Extension of PRCV 2023

点击查看摘要

Abstract:Existing event stream-based pattern recognition models usually represent the event stream as the point cloud, voxel, image, etc., and design various deep neural networks to learn their features. Although considerable results can be achieved in simple cases, however, the model performance may be limited by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this paper, we propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be learned separately by utilizing Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be fed into the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on multiple widely used event stream-based classification datasets. Specifically, we achieve new state-of-the-art performance on the Bullying10k dataset, i.e., 90.51% , which exceeds the second place by +2.21% . The source code of this paper has been released on \urlthis https URL.

[CV-62] Revisiting Backdoor Attacks against Large Vision-Language Models

链接: https://arxiv.org/abs/2406.18844
作者: Siyuan Liang,Jiawei Liang,Tianyu Pang,Chao Du,Aishan Liu,Ee-Chien Chang,Xiaochun Cao
关键词: enhances large vision-language, raises security risks, tuning enhances large, Instruction tuning enhances, large vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 8 figures

点击查看摘要

Abstract:Instruction tuning enhances large vision-language models (LVLMs) but raises security risks through potential backdoor attacks due to their openness. Previous backdoor studies focus on enclosed scenarios with consistent training and testing instructions, neglecting the practical domain gaps that could affect attack effectiveness. This paper empirically examines the generalizability of backdoor attacks during the instruction tuning of LVLMs for the first time, revealing certain limitations of most backdoor strategies in practical scenarios. We quantitatively evaluate the generalizability of six typical backdoor attacks on image caption benchmarks across multiple LVLMs, considering both visual and textual domain offsets. Our findings indicate that attack generalizability is positively correlated with the backdoor trigger’s irrelevance to specific images/models and the preferential correlation of the trigger pattern. Additionally, we modify existing backdoor attacks based on the above key observations, demonstrating significant improvements in cross-domain scenario generalizability (+86% attack success rate). Notably, even without access to the instruction datasets, a multimodal instruction set can be successfully poisoned with a very low poisoning rate (0.2%), achieving an attack success rate of over 97%. This paper underscores that even simple traditional backdoor strategies pose a serious threat to LVLMs, necessitating more attention and in-depth research.

[CV-63] Dense Monocular Motion Segmentation Using Optical Flow and Pseudo Depth Map: A Zero-Shot Approach

链接: https://arxiv.org/abs/2406.18837
作者: Yuxiang Huang,Yuhao Chen,John Zelek
关键词: single moving camera, moving camera presents, optical flow, computer vision, single moving
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: For the offical publication, see this https URL

点击查看摘要

Abstract:Motion segmentation from a single moving camera presents a significant challenge in the field of computer vision. This challenge is compounded by the unknown camera movements and the lack of depth information of the scene. While deep learning has shown impressive capabilities in addressing these issues, supervised models require extensive training on massive annotated datasets, and unsupervised models also require training on large volumes of unannotated data, presenting significant barriers for both. In contrast, traditional methods based on optical flow do not require training data, however, they often fail to capture object-level information, leading to over-segmentation or under-segmentation. In addition, they also struggle in complex scenes with substantial depth variations and non-rigid motion, due to the overreliance of optical flow. To overcome these challenges, we propose an innovative hybrid approach that leverages the advantages of both deep learning methods and traditional optical flow based methods to perform dense motion segmentation without requiring any training. Our method initiates by automatically generating object proposals for each frame using foundation models. These proposals are then clustered into distinct motion groups using both optical flow and relative depth maps as motion cues. The integration of depth maps derived from state-of-the-art monocular depth estimation models significantly enhances the motion cues provided by optical flow, particularly in handling motion parallax issues. Our method is evaluated on the DAVIS-Moving and YTVOS-Moving datasets, and the results demonstrate that our method outperforms the best unsupervised method and closely matches with the state-of-theart supervised methods.

[CV-64] Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

链接: https://arxiv.org/abs/2406.18836
作者: Huaying Zhang,Rintaro Yanagi,Ren Togo,Takahiro Ogawa,Miki Haseyama
关键词: composed image retrieval, textual inversion network, zero-shot composed image, inversion network, zero-shot CIR
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Accepted as a conference paper in IEEE ICIP 2024

点击查看摘要

Abstract:This paper proposes a novel zero-shot composed image retrieval (CIR) method considering the query-target relationship by masked image-text pairs. The objective of CIR is to retrieve the target image using a query image and a query text. Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text and use a pre-trained visual-language model to realize the retrieval. However, they do not consider the query-target relationship to train the textual inversion network to acquire information for retrieval. In this paper, we propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs. By exploiting the abundant image-text pairs that are convenient to obtain with a masking strategy for learning the query-target relationship, it is expected that accurate zero-shot CIR using a retrieval-focused textual inversion network can be realized. Experimental results show the effectiveness of the proposed method.

[CV-65] Correspondence-Free Non-Rigid Point Set Registration Using Unsupervised Clustering Analysis

链接: https://arxiv.org/abs/2406.18817
作者: Mingyang Zhao,Jingen Jiang,Lei Ma,Shiqing Xin,Gaofeng Meng,Dong-Ming Yan
关键词: unsupervised clustering analysis, non-rigid point set, target point sets, paper presents, inspired by unsupervised
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: [CVPR 2024 Highlight] Project and code at: this https URL

点击查看摘要

Abstract:This paper presents a novel non-rigid point set registration method that is inspired by unsupervised clustering analysis. Unlike previous approaches that treat the source and target point sets as separate entities, we develop a holistic framework where they are formulated as clustering centroids and clustering members, separately. We then adopt Tikhonov regularization with an \ell_1 -induced Laplacian kernel instead of the commonly used Gaussian kernel to ensure smooth and more robust displacement fields. Our formulation delivers closed-form solutions, theoretical guarantees, independence from dimensions, and the ability to handle large deformations. Subsequently, we introduce a clustering-improved Nyström method to effectively reduce the computational complexity and storage of the Gram matrix to linear, while providing a rigorous bound for the low-rank approximation. Our method achieves high accuracy results across various scenarios and surpasses competitors by a significant margin, particularly on shapes with substantial deformations. Additionally, we demonstrate the versatility of our method in challenging tasks such as shape transfer and medical registration.

[CV-66] Divide Ensemble and Conquer: The Last Mile on Unsupervised Domain Adaptation for On-Board Semantic Segmentation

链接: https://arxiv.org/abs/2406.18809
作者: Tao Lian,Jose L. Gómez,Antonio M. López
关键词: unsupervised domain adaptation, challenge of solving, UDA, mile of unsupervised, UDA methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The last mile of unsupervised domain adaptation (UDA) for semantic segmentation is the challenge of solving the syn-to-real domain gap. Recent UDA methods have progressed significantly, yet they often rely on strategies customized for synthetic single-source datasets (e.g., GTA5), which limits their generalisation to multi-source datasets. Conversely, synthetic multi-source datasets hold promise for advancing the last mile of UDA but remain underutilized in current research. Thus, we propose DEC, a flexible UDA framework for multi-source datasets. Following a divide-and-conquer strategy, DEC simplifies the task by categorizing semantic classes, training models for each category, and fusing their outputs by an ensemble model trained exclusively on synthetic datasets to obtain the final segmentation mask. DEC can integrate with existing UDA methods, achieving state-of-the-art performance on Cityscapes, BDD100K, and Mapillary Vistas, significantly narrowing the syn-to-real domain gap.

[CV-67] MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

链接: https://arxiv.org/abs/2406.18790
作者: William Berman,Alexander Peysakhovich
关键词: man man, dog dog, picture, prompts of interleaved, interleaved text
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We train a model to generate images from multimodal prompts of interleaved text and images such as “a picture of a man man and his picture of a dog dog in an picture of a cartoon animated style.” We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

[CV-68] WV-Net: A foundation model for SAR WV-mode satellite imagery trained using contrastive self-supervised learning on 10 million images

链接: https://arxiv.org/abs/2406.18765
作者: Yannik Glaser,Justin E. Stopa,Linnea M. Wolniewicz,Ralph Foster,Doug Vandemark,Alexis Mouche,Bertrand Chapron,Peter Sadowski
关键词: Space Agency Copernicus, European Space Agency, C-band synthetic aperture, Agency Copernicus, European Space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 9 figures, submitted to NeurIPS 2024

点击查看摘要

Abstract:The European Space Agency’s Copernicus Sentinel-1 (S-1) mission is a constellation of C-band synthetic aperture radar (SAR) satellites that provide unprecedented monitoring of the world’s oceans. S-1’s wave mode (WV) captures 20x20 km image patches at 5 m pixel resolution and is unaffected by cloud cover or time-of-day. The mission’s open data policy has made SAR data easily accessible for a range of applications, but the need for manual image annotations is a bottleneck that hinders the use of machine learning methods. This study uses nearly 10 million WV-mode images and contrastive self-supervised learning to train a semantic embedding model called WV-Net. In multiple downstream tasks, WV-Net outperforms a comparable model that was pre-trained on natural images (ImageNet) with supervised learning. Experiments show improvements for estimating wave height (0.50 vs 0.60 RMSE using linear probing), estimating near-surface air temperature (0.90 vs 0.97 RMSE), and performing multilabel-classification of geophysical and atmospheric phenomena (0.96 vs 0.95 micro-averaged AUROC). WV-Net embeddings are also superior in an unsupervised image-retrieval task and scale better in data-sparse settings. Together, these results demonstrate that WV-Net embeddings can support geophysical research by providing a convenient foundation model for a variety of data analysis and exploration tasks.

[CV-69] 3D Feature Distillation with Object-Centric Priors

链接: https://arxiv.org/abs/2406.18742
作者: Georgios Tziafas,Yucheng Xu,Zhibin Li,Hamidreza Kasaei
关键词: Grounding natural language, natural language, physical world, ubiquitous topic, wide range
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Submitted CoRL-24

点击查看摘要

Abstract:Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific and hence lack generalization, or focus on indoor room scan data that require access to multiple camera views, which is not practical in robot manipulation scenarios. Additionally, related methods typically fuse features at pixel-level and assume that all camera views are equally informative. In this work, we show that this approach leads to sub-optimal 3D features, both in terms of grounding accuracy, as well as segmentation crispness. To alleviate this, we propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views based on semantic information, and fuse features at object-level via instance segmentation masks. To distill our object-centric 3D features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop scenes, spawning 15k scenes from over 3300 unique object instances, which we make publicly available. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency, while doing so from single-view RGB-D, thus departing from the assumption of multiple camera views at test time. Finally, we show that our approach can generalize to novel tabletop domains and be re-purposed for 3D instance segmentation without fine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter

[CV-70] owards Open-World Grasping with Large Vision-Language Models

链接: https://arxiv.org/abs/2406.18722
作者: Georgios Tziafas,Hamidreza Kasaei
关键词: language instructions constitutes, instructions constitutes, constitutes a fundamental, fundamental challenge, open-ended language instructions
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted CoRL24

点击查看摘要

Abstract:The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and reason in robotic context, but rely on external vision and action models to ground such knowledge into the environment and parameterize actuation. This setup suffers from two major bottlenecks: a) the LLM’s reasoning capacity is constrained by the quality of visual grounding, and b) LLMs do not contain low-level spatial understanding of the world, which is essential for grasping in contact-rich scenarios. In this work we demonstrate that modern vision-language models (VLMs) are capable of tackling such limitations, as they are implicitly grounded and can jointly reason about semantics and geometry. We propose OWG, an open-world grasping pipeline that combines VLMs with segmentation and grasp synthesis models to unlock grounded world understanding in three stages: open-ended referring segmentation, grounded grasp planning and grasp ranking via contact reasoning, all of which can be applied zero-shot via suitable visual prompting mechanisms. We conduct extensive evaluation in cluttered indoor scene datasets to showcase OWG’s robustness in grounding from open-ended language, as well as open-world robotic grasping experiments in both simulation and hardware that demonstrate superior performance compared to previous supervised and zero-shot LLM-based methods.

[CV-71] Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

链接: https://arxiv.org/abs/2406.18717
作者: Colton Stearns,Adam Harley,Mikaela Uy,Florian Dubost,Federico Tombari,Gordon Wetzstein,Leonidas Guibas
关键词: exhibiting clear strengths, Gaussian, Dynamic Gaussian Marbles, exhibiting clear, compositional edibility
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision, constraining their use to controlled capture settings. In this work, we extend the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose Dynamic Gaussian Marbles (DGMarbles), consisting of three core modifications that target the difficulties of the monocular setting. First, DGMarbles uses isotropic Gaussian “marbles”, reducing the degrees of freedom of each Gaussian, and constraining the optimization to focus on motion and appearance over local shape. Second, DGMarbles employs a hierarchical divide-and-conquer learning strategy to guide the optimization towards solutions with coherent motion. Finally, DGMarbles adds image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization in these ways, DGMarbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the (monocular) Nvidia Dynamic Scenes dataset and the Dycheck iPhone dataset, and show that DGMarbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians.

[CV-72] SpY: A Context-Based Approach to Spacecraft Component Detection

链接: https://arxiv.org/abs/2406.18709
作者: Trupti Mahendrakar,Ryan T. White,Madhur Tiwari
关键词: autonomous on-orbit servicing, active debris removal, aid autonomous on-orbit, body panels, resident space object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:This paper focuses on autonomously characterizing components such as solar panels, body panels, antennas, and thrusters of an unknown resident space object (RSO) using camera feed to aid autonomous on-orbit servicing (OOS) and active debris removal. Significant research has been conducted in this area using convolutional neural networks (CNNs). While CNNs are powerful at learning patterns and performing object detection, they struggle with missed detections and misclassifications in environments different from the training data, making them unreliable for safety in high-stakes missions like OOS. Additionally, failures exhibited by CNNs are often easily rectifiable by humans using commonsense reasoning and contextual knowledge. Embedding such reasoning in an object detector could improve detection accuracy. To validate this hypothesis, this paper presents an end-to-end object detector called SpaceYOLOv2 (SpY), which leverages the generalizability of CNNs while incorporating contextual knowledge using traditional computer vision techniques. SpY consists of two main components: a shape detector and the SpaceYOLO classifier (SYC). The shape detector uses CNNs to detect primitive shapes of RSOs and SYC associates these shapes with contextual knowledge, such as color and texture, to classify them as spacecraft components or “unknown” if the detected shape is uncertain. SpY’s modular architecture allows customizable usage of contextual knowledge to improve detection performance, or SYC as a secondary fail-safe classifier with an existing spacecraft component detector. Performance evaluations on hardware-in-the-loop images of a mock-up spacecraft demonstrate that SpY is accurate and an ensemble of SpY with YOLOv5 trained for satellite component detection improved the performance by 23.4% in recall, demonstrating enhanced safety for vision-based navigation tasks.

[CV-73] Geometric Features Enhanced Human-Object Interaction Detection

链接: https://arxiv.org/abs/2406.18691
作者: Manli Zhu,Edmond S. L. Ho,Shuang Chen,Longzhi Yang,Hubert P. H. Shum
关键词: Cameras are essential, essential vision instruments, Transformer-style HOI detection, HOI detection, essential vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to IEEE TIM

点击查看摘要

Abstract:Cameras are essential vision instruments to capture images for pattern detection and measurement. Human-object interaction (HOI) detection is one of the most popular pattern detection approaches for captured human-centric visual scenes. Recently, Transformer-based models have become the dominant approach for HOI detection due to their advanced network architectures and thus promising results. However, most of them follow the one-stage design of vanilla Transformer, leaving rich geometric priors under-exploited and leading to compromised performance especially when occlusion occurs. Given that geometric features tend to outperform visual ones in occluded scenarios and offer information that complements visual cues, we propose a novel end-to-end Transformer-style HOI detection model, i.e., geometric features enhanced HOI detector (GeoHOI). One key part of the model is a new unified self-supervised keypoint learning method named UniPointNet that bridges the gap of consistent keypoint representation across diverse object categories, including humans. GeoHOI effectively upgrades a Transformer-based HOI detector benefiting from the keypoints similarities measuring the likelihood of human-object interactions as well as local keypoint patches to enhance interaction query representation, so as to boost HOI predictions. Extensive experiments show that the proposed method outperforms the state-of-the-art models on V-COCO and achieves competitive performance on HICO-DET. Case study results on the post-disaster rescue with vision-based instruments showcase the applicability of the proposed GeoHOI in real-world applications.

[CV-74] CSI4Free: GAN-Augmented mmWave CSI for Improved Pose Classification

链接: https://arxiv.org/abs/2406.18684
作者: Nabeel Nisar Bhat,Rafael Berkvens Jeroen Famaey
关键词: Joint Communication, demonstrated significant success, COTS Wi-Fi sensing, gesture recognition, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, Joint Communication and Sensing (JCS), has demonstrated significant success, particularly in utilizing sub-6 GHz frequencies with commercial-off-the-shelf (COTS) Wi-Fi devices for applications such as localization, gesture recognition, and pose classification. Deep learning and the existence of large public datasets has been pivotal in achieving such results. However, at mmWave frequencies (30-300 GHz), which has shown potential for more accurate sensing performance, there is a noticeable lack of research in the domain of COTS Wi-Fi sensing. Challenges such as limited research hardware, the absence of large datasets, limited functionality in COTS hardware, and the complexities of data collection present obstacles to a comprehensive exploration of this field. In this work, we aim to address these challenges by developing a method that can generate synthetic mmWave channel state information (CSI) samples. In particular, we use a generative adversarial network (GAN) on an existing dataset, to generate 30,000 additional CSI samples. The augmented samples exhibit a remarkable degree of consistency with the original data, as indicated by the notably high GAN-train and GAN-test scores. Furthermore, we integrate the augmented samples in training a pose classification model. We observe that the augmented samples complement the real data and improve the generalization of the classification model.

[CV-75] IDA-UIE: An Iterative Framework for Deep Network-based Degradation Aware Underwater Image Enhancement

链接: https://arxiv.org/abs/2406.18628
作者: Pranjali Singh,Prithwijit Guha
关键词: Underwater image, affected by fluorescence, degradation, UIEB and EUVP, network
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Underwater image quality is affected by fluorescence, low illumination, absorption, and scattering. Recent works in underwater image enhancement have proposed different deep network architectures to handle these problems. Most of these works have proposed a single network to handle all the challenges. We believe that deep networks trained for specific conditions deliver better performance than a single network learned from all degradation cases. Accordingly, the first contribution of this work lies in the proposal of an iterative framework where a single dominant degradation condition is identified and resolved. This proposal considers the following eight degradation conditions – low illumination, low contrast, haziness, blurred image, presence of noise and color imbalance in three different channels. A deep network is designed to identify the dominant degradation condition. Accordingly, an appropriate deep network is selected for degradation condition-specific enhancement. The second contribution of this work is the construction of degradation condition specific datasets from good quality images of two standard datasets (UIEB and EUVP). This dataset is used to learn the condition specific enhancement networks. The proposed approach is found to outperform nine baseline methods on UIEB and EUVP datasets.

[CV-76] Vox-UDA: Voxel-wise Unsupervised Domain Adaptation for Cryo-Electron Subtomogram Segmentation with Denoised Pseudo Labeling

链接: https://arxiv.org/abs/2406.18610
作者: Haoran Li,Xingjian Li,Jiahua Shi,Huaming Chen,Bo Du,Daisuke Kihara,Johan Barthelemy,Jun Shen,Min Xu
关键词: imaging technology facilitating, Cryo-Electron Tomography, imaging technology, near-atomic resolution, technology facilitating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Reviewing

点击查看摘要

Abstract:Cryo-Electron Tomography (cryo-ET) is a 3D imaging technology facilitating the study of macromolecular structures at near-atomic resolution. Recent volumetric segmentation approaches on cryo-ET images have drawn widespread interest in biological sector. However, existing methods heavily rely on manually labeled data, which requires highly professional skills, thereby hindering the adoption of fully-supervised approaches for cryo-ET images. Some unsupervised domain adaptation (UDA) approaches have been designed to enhance the segmentation network performance using unlabeled data. However, applying these methods directly to cryo-ET images segmentation tasks remains challenging due to two main issues: 1) the source data, usually obtained through simulation, contain a certain level of noise, while the target data, directly collected from raw-data from real-world scenario, have unpredictable noise levels. 2) the source data used for training typically consists of known macromoleculars, while the target domain data are often unknown, causing the model’s segmenter to be biased towards these known macromolecules, leading to a domain shift problem. To address these challenges, in this work, we introduce the first voxel-wise unsupervised domain adaptation approach, termed Vox-UDA, specifically for cryo-ET subtomogram segmentation. Vox-UDA incorporates a noise generation module to simulate target-like noises in the source dataset for cross-noise level adaptation. Additionally, we propose a denoised pseudo-labeling strategy based on improved Bilateral Filter to alleviate the domain shift problem. Experimental results on both simulated and real cryo-ET subtomogram datasets demonstrate the superiority of our proposed approach compared to state-of-the-art UDA methods.

[CV-77] Realtime Dynamic Gaze Target Tracking and Depth-Level Estimation

链接: https://arxiv.org/abs/2406.18595
作者: Esmaeil Seraj,Harsh Bhate,Walter Talamonti
关键词: revolutionize user experiences, burgeoning field, poised to revolutionize, Transparent Displays, revolutionize user
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of Transparent Displays (TD) in various applications, such as Heads-Up Displays (HUDs) in vehicles, is a burgeoning field, poised to revolutionize user experiences. However, this innovation brings forth significant challenges in realtime human-device interaction, particularly in accurately identifying and tracking a user’s gaze on dynamically changing TDs. In this paper, we present a two-fold robust and efficient systematic solution for realtime gaze monitoring, comprised of: (1) a tree-based algorithm for identifying and dynamically tracking gaze targets (i.e., moving, size-changing, and overlapping 2D content) projected on a transparent display, in realtime; (2) a multi-stream self-attention architecture to estimate the depth-level of human gaze from eye tracking data, to account for the display’s transparency and preventing undesired interactions with the TD. We collected a real-world eye-tracking dataset to train and test our gaze monitoring system. We present extensive results and ablation studies, including inference experiments on System on Chip (SoC) evaluation boards, demonstrating our model’s scalability, precision, and realtime feasibility in both static and dynamic contexts. Our solution marks a significant stride in enhancing next-generation user-device interaction and experience, setting a new benchmark for algorithmic gaze monitoring technology in dynamic transparent displays.

[CV-78] Neural Appearance Modeling From Single Images

链接: https://arxiv.org/abs/2406.18593
作者: Jay Idema,Pieter Peers
关键词: material appearance modeling, appearance modeling neural, single input photograph, modeling neural network, appearance estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:We propose a material appearance modeling neural network for visualizing plausible, spatially-varying materials under diverse view and lighting conditions, utilizing only a single photograph of a material under co-located light and view as input for appearance estimation. Our neural architecture is composed of two network stages: a network that infers learned per-pixel neural parameters of a material from a single input photograph, and a network that renders the material utilizing these neural parameters, similar to a BRDF. We train our model on a set of 312,165 synthetic spatially-varying exemplars. Since our method infers learned neural parameters rather than analytical BRDF parameters, our method is capable of encoding anisotropic and global illumination (inter-pixel interaction) information into individual pixel parameters. We demonstrate our model’s performance compared to prior work and demonstrate the feasibility of the render network as a BRDF by implementing it into the Mitsuba3 rendering engine. Finally, we briefly discuss the capability of neural parameters to encode global illumination information.

[CV-79] Composition Vision-Language Understanding via Segment and Depth Anything Model

链接: https://arxiv.org/abs/2406.18591
作者: Mingxiao Huo,Pengliang Ji,Haotian Lin,Junchen Liu,Yixiao Wang,Yijun Chen
关键词: augment neural comprehension, model zero-shot understanding, language-vision model zero-shot, pioneering unified library, zero-shot understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a pioneering unified library that leverages depth anything, segment anything models to augment neural comprehension in language-vision model zero-shot understanding. This library synergizes the capabilities of the Depth Anything Model (DAM), Segment Anything Model (SAM), and GPT-4V, enhancing multimodal tasks such as vision-question-answering (VQA) and composition reasoning. Through the fusion of segmentation and depth analysis at the symbolic instance level, our library provides nuanced inputs for language models, significantly advancing image interpretation. Validated across a spectrum of in-the-wild real-world images, our findings showcase progress in vision-language models through neural-symbolic integration. This novel approach melds visual and language analysis in an unprecedented manner. Overall, our library opens new directions for future research aimed at decoding the complexities of the real world through advanced multimodal technologies and our code is available at \urlthis https URL.

[CV-80] xt-Guided Alternative Image Clustering

链接: https://arxiv.org/abs/2406.18589
作者: Andreas Stephan,Lukas Miklautz,Collin Leiber,Pedro Henrique Luz de Araujo,Dominik Répás,Claudia Plant,Benjamin Roth
关键词: Traditional image clustering, Traditional image, alternative image clustering, image clustering techniques, alternative image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional image clustering techniques only find a single grouping within visual data. In particular, they do not provide a possibility to explicitly define multiple types of clustering. This work explores the potential of large vision-language models to facilitate alternative image clustering. We propose Text-Guided Alternative Image Consensus Clustering (TGAICC), a novel approach that leverages user-specified interests via prompts to guide the discovery of diverse clusterings. To achieve this, it generates a clustering for each prompt, groups them using hierarchical clustering, and then aggregates them using consensus clustering. TGAICC outperforms image- and text-based baselines on four alternative image clustering benchmark datasets. Furthermore, using count-based word statistics, we are able to obtain text-based explanations of the alternative clusterings. In conclusion, our research illustrates how contemporary large vision-language models can transform explanatory data analysis, enabling the generation of insightful, customizable, and diverse image clusterings.

[CV-81] Varying Manifolds in Diffusion: From Time-varying Geometries to Visual Saliency

链接: https://arxiv.org/abs/2406.18588
作者: Junhao Chen,Manyi Li,Zherong Pan,Xifeng Gao,Changhe Tu
关键词: Deep generative models, Deep generative, generative models learn, generation, generation rate
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep generative models learn the data distribution, which is concentrated on a low-dimensional manifold. The geometric analysis of distribution transformation provides a better understanding of data structure and enables a variety of applications. In this paper, we study the geometric properties of the diffusion model, whose forward diffusion process and reverse generation process construct a series of distributions on manifolds which vary over time. Our key contribution is the introduction of generation rate, which corresponds to the local deformation of manifold over time around an image component. We show that the generation rate is highly correlated with intuitive visual properties, such as visual saliency, of the image component. Further, we propose an efficient and differentiable scheme to estimate the generation rate for a given image component over time, giving rise to a generation curve. The differentiable nature of our scheme allows us to control the shape of the generation curve via optimization. Using different loss functions, our generation curve matching algorithm provides a unified framework for a range of image manipulation tasks, including semantic transfer, object removal, saliency manipulation, image blending, etc. We conduct comprehensive analytical evaluations to support our findings and evaluate our framework on various manipulation tasks. The results show that our method consistently leads to better manipulation results, compared to recent baselines.

[CV-82] Nomic Embed Vision: Expanding the Latent Space

链接: https://arxiv.org/abs/2406.18587
作者: Zach Nussbaum,Brandon Duderstadt,Andriy Mulyar
关键词: open-weights image embedding, technical report describes, image embedding model, highly performant, open-weights image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This technical report describes the training of nomic-embed-vision, a highly performant, open-code, open-weights image embedding model that shares the same latent space as nomic-embed-text. Together, nomic-embed-vision and nomic-embed-text form the first unified latent space to achieve high performance across vision, language, and multimodal tasks.

[CV-83] Cut-and-Paste with Precision: a Content and Perspective-aware Data Augmentation for Road Damage Detection

链接: https://arxiv.org/abs/2406.18586
作者: Punnawat Siripathitti,Florent Forest,Olga Fink
关键词: issues posing significant, posing significant challenges, Road Damage Detection, Damage Detection Challenge, damage detection
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Extended abstract. 2 pages

点击查看摘要

Abstract:Damage to road pavement can develop into cracks, potholes, spallings, and other issues posing significant challenges to the integrity, safety, and durability of the road structure. Detecting and monitoring the evolution of these damages is crucial for maintaining the condition and structural health of road infrastructure. In recent years, researchers have explored various data-driven methods for image-based damage detection in road monitoring applications. The field gained attention with the introduction of the Road Damage Detection Challenge (RDDC2018), encouraging competition in developing object detectors on street-view images from various countries. Leading teams have demonstrated the effectiveness of ensemble models, mostly based on the YOLO and Faster R-CNN series. Data augmentations have also shown benefits in object detection within the computer vision field, including transformations such as random flipping, cropping, cutting out patches, as well as cut-and-pasting object instances. Applying cut-and-paste augmentation to road damages appears to be a promising approach to increase data diversity. However, the standard cut-and-paste technique, which involves sampling an object instance from a random image and pasting it at a random location onto the target image, has demonstrated limited effectiveness for road damage detection. This method overlooks the location of the road and disregards the difference in perspective between the sampled damage and the target image, resulting in unrealistic augmented images. In this work, we propose an improved Cut-and-Paste augmentation technique that is both content-aware (i.e. considers the true location of the road in the image) and perspective-aware (i.e. takes into account the difference in perspective between the injected damage and the target image).

[CV-84] Flexible ViG: Learning the Self-Saliency for Flexible Object Recognition

链接: https://arxiv.org/abs/2406.18585
作者: Lin Zuo,Kunshan Yang,Xianlong Tian,Kunbin He,Yongqi Ding,Mengmeng Jing
关键词: Existing computer vision, flexible objects, objects remains unexplored, flexible objects remains, Existing computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: under review

点击查看摘要

Abstract:Existing computer vision methods mainly focus on the recognition of rigid objects, whereas the recognition of flexible objects remains unexplored. Recognizing flexible objects poses significant challenges due to their inherently diverse shapes and sizes, translucent attributes, ambiguous boundaries, and subtle inter-class differences. In this paper, we claim that these problems primarily arise from the lack of object saliency. To this end, we propose the Flexible Vision Graph Neural Network (FViG) to optimize the self-saliency and thereby improve the discrimination of the representations for flexible objects. Specifically, on one hand, we propose to maximize the channel-aware saliency by extracting the weight of neighboring nodes, which adapts to the shape and size variations in flexible objects. On the other hand, we maximize the spatial-aware saliency based on clustering to aggregate neighborhood information for the centroid nodes, which introduces local context information for the representation learning. To verify the performance of flexible objects recognition thoroughly, for the first time we propose the Flexible Dataset (FDA), which consists of various images of flexible objects collected from real-world scenarios or online. Extensive experiments evaluated on our Flexible Dataset demonstrate the effectiveness of our method on enhancing the discrimination of flexible objects.

[CV-85] Assessment of Sentinel-2 spatial and temporal coverage based on the scene classification layer

链接: https://arxiv.org/abs/2406.18584
作者: Cristhian Sanchez,Francisco Mena,Marcela Charfuelan,Marlon Nuske,Andreas Dengel
关键词: diverse applications, coverage, SCL, high cloud coverage, SCL data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at IEEE International Geoscience and Remote Sensing Symposium 2024

点击查看摘要

Abstract:Since the launch of the Sentinel-2 (S2) satellites, many ML models have used the data for diverse applications. The scene classification layer (SCL) inside the S2 product provides rich information for training, such as filtering images with high cloud coverage. However, there is more potential in this. We propose a technique to assess the clean optical coverage of a region, expressed by a SITS and calculated with the S2-based SCL data. With a manual threshold and specific labels in the SCL, the proposed technique assigns a percentage of spatial and temporal coverage across the time series and a high/low assessment. By evaluating the AI4EO challenge for Enhanced Agriculture, we show that the assessment is correlated to the predictive results of ML models. The classification results in a region with low spatial and temporal coverage is worse than in a region with high coverage. Finally, we applied the technique across all continents of the global dataset LandCoverNet.

[CV-86] Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

链接: https://arxiv.org/abs/2406.18583
作者: Le Zhuo,Ruoyi Du,Han Xiao,Yangguang Li,Dongyang Liu,Rongjie Huang,Wenze Liu,Lirui Zhao,Fu-Yun Wang,Zhanyu Ma,Xu Luo,Zehan Wang,Kaipeng Zhang,Xiangyang Zhu,Si Liu,Xiangyu Yue,Dingning Liu,Wanli Ouyang,Ziwei Liu,Yu Qiao,Hongsheng Li,Peng Gao
关键词: Flow-based Large Diffusion, Flow-based Large, Large Diffusion Transformers, family of Flow-based, Large Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code at: this https URL

点击查看摘要

Abstract:Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.

[CV-87] Canonical Consolidation Fields: Reconstructing Dynamic Shapes from Point Clouds

链接: https://arxiv.org/abs/2406.18582
作者: Miaowei Wang,Changjian Li,Amir Vaxman
关键词: single deforming coherent, deforming coherent shape, present Canonical Consolidation, independently-sampled point clouds, reconstructing a time
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We present Canonical Consolidation Fields (CanFields): a method for reconstructing a time series of independently-sampled point clouds into a single deforming coherent shape. Such input often comes from motion capture. Existing methods either couple the geometry and the deformation, where by doing so they smooth fine details and lose the ability to track moving points, or they track the deformation explicitly, but introduce topological and geometric artifacts. Our novelty lies in the consolidation of the point clouds into a single canonical shape in a way that reduces the effect of noise and outliers, and enables us to overcome missing regions. We simultaneously reconstruct the velocity fields that guide the deformation. This consolidation allows us to retain the high-frequency details of the geometry, while faithfully reproducing the low-frequency deformation. Our architecture comprises simple components, and fits any single input shape without using datasets. We demonstrate the robustness and accuracy of our methods on a diverse benchmark of dynamic point clouds, including missing regions, sparse frames, and noise.

[CV-88] Dream-in-Style: Text-to-3D Generation using Stylized Score Distillation

链接: https://arxiv.org/abs/2406.18581
作者: Hubert Kompanowski,Binh-Son Hua
关键词: reference image, stylized score distillation, text prompt, style reference image, stylized score
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We present a method to generate 3D objects in styles. Our method takes a text prompt and a style reference image as input and reconstructs a neural radiance field to synthesize a 3D model with the content aligning with the text prompt and the style following the reference image. To simultaneously generate the 3D object and perform style transfer in one go, we propose a stylized score distillation loss to guide a text-to-3D optimization process to output visually plausible geometry and appearance. Our stylized score distillation is based on a combination of an original pretrained text-to-image model and its modified sibling with the key and value features of self-attention layers manipulated to inject styles from the reference image. Comparisons with state-of-the-art methods demonstrated the strong visual performance of our method, further supported by the quantitative results from our user study.

[CV-89] Shedding Light on Large Generative Networks: Estimating Epistemic Uncertainty in Diffusion Models

链接: https://arxiv.org/abs/2406.18580
作者: Lucas Berry,Axel Brando,David Meger
关键词: pose significant challenges, Generative diffusion models, large parameter count, Generative diffusion, traditional uncertainty estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative diffusion models, notable for their large parameter count (exceeding 100 million) and operation within high-dimensional image spaces, pose significant challenges for traditional uncertainty estimation methods due to computational demands. In this work, we introduce an innovative framework, Diffusion Ensembles for Capturing Uncertainty (DECU), designed for estimating epistemic uncertainty for diffusion models. The DECU framework introduces a novel method that efficiently trains ensembles of conditional diffusion models by incorporating a static set of pre-trained parameters, drastically reducing the computational burden and the number of parameters that require training. Additionally, DECU employs Pairwise-Distance Estimators (PaiDEs) to accurately measure epistemic uncertainty by evaluating the mutual information between model outputs and weights in high-dimensional spaces. The effectiveness of this framework is demonstrated through experiments on the ImageNet dataset, highlighting its capability to capture epistemic uncertainty, specifically in under-sampled image classes.

[CV-90] Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching

链接: https://arxiv.org/abs/2406.18579
作者: Xuri Ge,Fuhai Chen,Songpei Xu,Fuxiang Tao,Jie Wang,Joemon M. Jose
关键词: Image-text matching, computer vision, fundamental problem, problem in computer, explicit
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: 22pages, 5 Figures, 6 tables, the extension of CMSEI in WACV23, and submitted to ACM TIST. arXiv admin note: text overlap with arXiv:2210.08908

点击查看摘要

Abstract:Image-text matching (ITM) is a fundamental problem in computer vision. The key issue lies in jointly learning the visual and textual representation to estimate their similarity accurately. Most existing methods focus on feature enhancement within modality or feature interaction across modalities, which, however, neglects the contextual information of the object representation based on the inter-object relationships that match the corresponding sentences with rich contextual semantics. In this paper, we propose a Hybrid-modal Interaction with multiple Relational Enhancements (termed \textitHire) for image-text matching, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modelling. In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects with salient spatial and semantic relational connectivities, guided by the explicit relationships of the objects’ spatial positions and their scene graph. We use implicit relationship modelling for potential relationship interactions before explicit modelling to improve the fault tolerance of explicit relationship detection. Then the visual and textual semantic representations are refined jointly via inter-modal interactive attention and cross-modal alignment. To correlate the context of objects with the textual context, we further refine the visual semantic representation via cross-level object-sentence and word-image-based interactive attention. Extensive experiments validate that the proposed hybrid-modal interaction with implicit and explicit modelling is more beneficial for image-text matching. And the proposed \textitHire obtains new state-of-the-art results on MS-COCO and Flickr30K benchmarks.

[CV-91] Negative Prototypes Guided Contrastive Learning for WSOD

链接: https://arxiv.org/abs/2406.18576
作者: Yu Zhang,Chuang Zhu,Guoqing Yang,Siqi Chen
关键词: Weakly Supervised Object, Supervised Object Detection, Weakly Supervised, Object Detection, Supervised Object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Weakly Supervised Object Detection (WSOD) with only image-level annotation has recently attracted wide attention. Many existing methods ignore the inter-image relationship of instances which share similar characteristics while can certainly be determined not to belong to the same category. Therefore, in order to make full use of the weak label, we propose the Negative Prototypes Guided Contrastive learning (NPGC) architecture. Firstly, we define Negative Prototype as the proposal with the highest confidence score misclassified for the category that does not appear in the label. Unlike other methods that only utilize category positive feature, we construct an online updated global feature bank to store both positive prototypes and negative prototypes. Meanwhile, we propose a pseudo label sampling module to mine reliable instances and discard the easily misclassified instances based on the feature similarity with corresponding prototypes in global feature bank. Finally, we follow the contrastive learning paradigm to optimize the proposal’s feature representation by attracting same class samples closer and pushing different class samples away in the embedding space. Extensive experiments have been conducted on VOC07, VOC12 datasets, which shows that our proposed method achieves the state-of-the-art performance.

[CV-92] Research on Driver Facial Fatigue Detection Based on Yolov8 Model

链接: https://arxiv.org/abs/2406.18575
作者: Chang Zhou,Yang Zhao,Shaobo Liu,Yi Zhao,Xingchen Li,Chiyu Cheng
关键词: accidents frequently occur, frequently occur, grave issue, traffic accidents frequently, fatigue driving
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by the 5th International Conference on Information Science, Parallel and Distributed Systems (ISPDS 2024), 2024 IEEE

点击查看摘要

Abstract:In a society where traffic accidents frequently occur, fatigue driving has emerged as a grave issue. Fatigue driving detection technology, especially those based on the YOLOv8 deep learning model, has seen extensive research and application as an effective preventive measure. This paper discusses in depth the methods and technologies utilized in the YOLOv8 model to detect driver fatigue, elaborates on the current research status both domestically and internationally, and systematically introduces the processing methods and algorithm principles for various datasets. This study aims to provide a robust technical solution for preventing and detecting fatigue driving, thereby contributing significantly to reducing traffic accidents and safeguarding lives.

[CV-93] Unsupervised Few-Shot Continual Learning for Remote Sensing Image Scene Classification

链接: https://arxiv.org/abs/2406.18574
作者: Muhammad Anwar Ma’sum,Mahardhika Pratama,Ramasamy Savitha,Lin Liu,Habibullah,Ryszard Kowalczyk
关键词: varying camera parameters, remote sensing image, sensing image analysis, remote sensing, spectral ranges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under Review for Publication in IEEE TGRS

点击查看摘要

Abstract:A continual learning (CL) model is desired for remote sensing image analysis because of varying camera parameters, spectral ranges, resolutions, etc. There exist some recent initiatives to develop CL techniques in this domain but they still depend on massive labelled samples which do not fully fit remote sensing applications because ground truths are often obtained via field-based surveys. This paper addresses this problem with a proposal of unsupervised flat-wide learning approach (UNISA) for unsupervised few-shot continual learning approaches of remote sensing image scene classifications which do not depend on any labelled samples for its model updates. UNISA is developed from the idea of prototype scattering and positive sampling for learning representations while the catastrophic forgetting problem is tackled with the flat-wide learning approach combined with a ball generator to address the data scarcity problem. Our numerical study with remote sensing image scene datasets and a hyperspectral dataset confirms the advantages of our solution. Source codes of UNISA are shared publicly in \urlthis https URL to allow convenient future studies and reproductions of our numerical results.

[CV-94] Generating grid maps via the snake model

链接: https://arxiv.org/abs/2406.18573
作者: Zhiwei Wei,Nai Yang,Wenjia Xu,Su Ding
关键词: possessing unique attributes, region centroids, displace region centroids, geospatial visualization, possessing unique
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Graphics (cs.GR)
*备注: 10 Pages, 8 Figures

点击查看摘要

Abstract:The grid map, often referred to as the tile map, stands as a vital tool in geospatial visualization, possessing unique attributes that differentiate it from more commonly known techniques such as choropleths and cartograms. It transforms geographic regions into grids, which requires the displacement of both region centroids and boundary nodes to establish a coherent grid arrangement. However, existing approaches typically displace region centroids and boundary nodes separately, potentially resulting in self-intersected boundaries and compromised relative orientation relations between regions. In this paper, we introduce a novel approach that leverages the Snake displacement algorithm from cartographic generalization to concurrently displace region centroids and boundary nodes. The revised Constrained Delaunay triangulation (CDT) is employed to represent the relations between regions and serves as a structural foundation for the Snake algorithm. Forces for displacing the region centroids into a grid-like pattern are then computed. These forces are iteratively applied within the Snake model until a satisfactory new boundary is achieved. Subsequently, the grid map is created by aligning the grids with the newly generated boundary, utilizing a one-to-one match algorithm to assign each region to a specific grid. Experimental results demonstrate that the proposed approach excels in maintaining the relative orientation and global shape of regions, albeit with a potential increase in local location deviations. We also present two strategies aligned with existing approaches to generate diverse grid maps for user preferences. Further details and resources are available on our project website: this https URL.

[CV-95] GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

链接: https://arxiv.org/abs/2406.18572
作者: Ling Li,Yu Ye,Bingchuan Jiang,Wei Zeng
关键词: large vision-language model, vision-language model, work tackles, tackles the problem, large vision-language
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at this https URL.

[CV-96] UltraCortex: Submillimeter Ultra-High Field 9.4 T1 Brain MR Image Collection and Manual Cortical Segmentations

链接: https://arxiv.org/abs/2406.18571
作者: Lucas Mahler,Julius Steiglechner,Benjamin Bender,Tobias Lindig,Dana Ramadan,Jonas Bause,Florian Birk,Rahel Heule,Edyta Charyasz,Michael Erb,Vinod Jangir Kumar,Gisela E Hagberg,Pascal Martin,Gabriele Lohmann,Klaus Scheffler
关键词: houses magnetic resonance, https URL, magnetic resonance imaging, human brain obtained, houses magnetic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The UltraCortex repository (this https URL) houses magnetic resonance imaging data of the human brain obtained at an ultra-high field strength of 9.4 T. It contains 86 structural MR images with spatial resolutions ranging from 0.6 to 0.8 mm. Additionally, the repository includes segmentations of 12 brains into gray and white matter compartments. These segmentations have been independently validated by two expert neuroradiologists, thus establishing them as a reliable gold standard. This resource provides researchers with access to high-quality brain imaging data and validated segmentations, facilitating neuroimaging studies and advancing our understanding of brain structure and function. Existing repositories do not accommodate field strengths beyond 7 T, nor do they offer validated segmentations, underscoring the significance of this new resource.

[CV-97] Its a Feature Not a Bug: Measuring Creative Fluidity in Image Generators

链接: https://arxiv.org/abs/2406.18570
作者: Aditi Ramaswamy,Melane Navaratnarajah,Hana Chockler
关键词: AI-generated art, heated debates, rise of freely, concerns the concept, concept of human
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the rise of freely available image generators, AI-generated art has become the centre of a series of heated debates, one of which concerns the concept of human creativity. Can an image generation AI exhibit ``creativity’’ of the same type that artists do, and if so, how does that manifest? Our paper attempts to define and empirically measure one facet of creative behavior in AI, by conducting an experiment to quantify the “fluidity of prompt interpretation”, or just “fluidity”, in a series of selected popular image generators. To study fluidity, we (1) introduce a clear definition for it, (2) create chains of auto-generated prompts and images seeded with an initial "ground-truth: image, (3) measure these chains’ breakage points using preexisting visual and semantic metrics, and (4) use both statistical tests and visual explanations to study these chains and determine whether the image generators used to produce them exhibit significant fluidity.

[CV-98] FLOW: Fusing and Shuffling Global and Local Views for Cross-User Human Activity Recognition with IMUs

链接: https://arxiv.org/abs/2406.18569
作者: Qi Qiu,Tao Zhu,Furong Duan,Kevin I-Kai Wang,Liming Chen,Mingxing Nie,Mingxing Nie
关键词: Inertial Measurement Unit, Human Activity Recognition, Inertial Measurement, Measurement Unit, Activity Recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Inertial Measurement Unit (IMU) sensors are widely employed for Human Activity Recognition (HAR) due to their portability, energy efficiency, and growing research interest. However, a significant challenge for IMU-HAR models is achieving robust generalization performance across diverse users. This limitation stems from substantial variations in data distribution among individual users. One primary reason for this distribution disparity lies in the representation of IMU sensor data in the local coordinate system, which is susceptible to subtle user variations during IMU wearing. To address this issue, we propose a novel approach that extracts a global view representation based on the characteristics of IMU data, effectively alleviating the data distribution discrepancies induced by wearing styles. To validate the efficacy of the global view representation, we fed both global and local view data into model for experiments. The results demonstrate that global view data significantly outperforms local view data in cross-user experiments. Furthermore, we propose a Multi-view Supervised Network (MVFNet) based on Shuffling to effectively fuse local view and global view data. It supervises the feature extraction of each view through view division and view shuffling, so as to avoid the model ignoring important features as much as possible. Extensive experiments conducted on OPPORTUNITY and PAMAP2 datasets demonstrate that the proposed algorithm outperforms the current state-of-the-art methods in cross-user HAR.

[CV-99] A Diagnostic Model for Acute Lymphoblastic Leukemia Using Metaheuristics and Deep Learning Methods

链接: https://arxiv.org/abs/2406.18568
作者: M. Hosseinzadeh,P. Khoshaght,S. Sadeghi,P. Asghari,Z. Arabi,J. Lansky,P. Budinsky,A. Masoud Rahmani,S. W. Lee
关键词: Acute lymphoblastic leukemia, abnormal white blood, white blood cells, Acute lymphoblastic, blast cell characteristics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Acute lymphoblastic leukemia (ALL) severity is determined by the presence and ratios of blast cells (abnormal white blood cells) in both bone marrow and peripheral blood. Manual diagnosis of this disease is a tedious and time-consuming operation, making it difficult for professionals to accurately examine blast cell characteristics. To address this difficulty, researchers use deep learning and machine learning. In this paper, a ResNet-based feature extractor is utilized to detect ALL, along with a variety of feature selectors and classifiers. To get the best results, a variety of transfer learning models, including the Resnet, VGG, EfficientNet, and DensNet families, are used as deep feature extractors. Following extraction, different feature selectors are used, including Genetic algorithm, PCA, ANOVA, Random Forest, Univariate, Mutual information, Lasso, XGB, Variance, and Binary ant colony. After feature qualification, a variety of classifiers are used, with MLP outperforming the others. The recommended technique is used to categorize ALL and HEM in the selected dataset which is C-NMC 2019. This technique got an impressive 90.71% accuracy and 95.76% sensitivity for the relevant classifications, and its metrics on this dataset outperformed others.

[CV-100] Research on Image Processing and Vectorization Storage Based on Garage Electronic Maps

链接: https://arxiv.org/abs/2406.18567
作者: Nan Dou,Qi Shi,Zhigang Lian
关键词: large underground parking, purpose of achieving, precise definition, large underground, rasterization storage
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:For the purpose of achieving a more precise definition and data analysis of images, this study conducted a research on vectorization and rasterization storage of electronic maps, focusing on a large underground parking garage map. During the research, image processing, vectorization and rasterization storage were performed. The paper proposed a method for the vectorization classification storage of indoor two-dimensional map raster data. This method involves converting raster data into vector data and classifying elements such as parking spaces, pathways, and obstacles based on their coordinate positions with the grid indexing method, thereby facilitating efficient storage and rapid querying of indoor maps. Additionally, interpolation algorithms were employed to extract vector data and convert it into raster data. Navigation testing was conducted to validate the accuracy and reliability of the map model under this method, providing effective technical support for the digital storage and navigation of garage maps.

[CV-101] Memorized Images in Diffusion Models share a Subspace that can be Located and Deleted

链接: https://arxiv.org/abs/2406.18566
作者: Ruchika Chavhan,Ondrej Bohdal,Yongshuo Zong,Da Li,Timothy Hospedales
关键词: samples raising copyright, raising copyright infringement, generating high-quality images, replicate exact training, training samples raising
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale text-to-image diffusion models excel in generating high-quality images from textual inputs, yet concerns arise as research indicates their tendency to memorize and replicate training data, raising We also addressed the issue of memorization in diffusion models, where models tend to replicate exact training samples raising copyright infringement and privacy issues. Efforts within the text-to-image community to address memorization explore causes such as data duplication, replicated captions, or trigger tokens, proposing per-prompt inference-time or training-time mitigation strategies. In this paper, we focus on the feed-forward layers and begin by contrasting neuron activations of a set of memorized and non-memorized prompts. Experiments reveal a surprising finding: many different sets of memorized prompts significantly activate a common subspace in the model, demonstrating, for the first time, that memorization in the diffusion models lies in a special subspace. Subsequently, we introduce a novel post-hoc method for editing pre-trained models, whereby memorization is mitigated through the straightforward pruning of weights in specialized subspaces, avoiding the need to disrupt the training or inference process as seen in prior research. Finally, we demonstrate the robustness of the pruned model against training data extraction attacks, thereby unveiling new avenues for a practical and one-for-all solution to memorization.

[CV-102] Pseudo-label Based Domain Adaptation for Zero-Shot Text Steganalysis

链接: https://arxiv.org/abs/2406.18565
作者: Yufei Luo,Zhen Yang,Ru Zhang,Jianyi Liu
关键词: deep neural networks, deep neural, text steganalysis, neural networks, domain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: The 30th International Conference on Computational Experimental Engineering and Sciences (ICCES2024)

点击查看摘要

Abstract:Currently, most methods for text steganalysis are based on deep neural networks (DNNs). However, in real-life scenarios, obtaining a sufficient amount of labeled stego-text for correctly training networks using a large number of parameters is often challenging and costly. Additionally, due to a phenomenon known as dataset bias or domain shift, recognition models trained on a large dataset exhibit poor generalization performance on novel datasets and tasks. Therefore, to address the issues of missing labeled data and inadequate model generalization in text steganalysis, this paper proposes a cross-domain stego-text analysis method (PDTS) based on pseudo-labeling and domain adaptation (unsupervised learning). Specifically, we propose a model architecture combining pre-trained BERT with a single-layer Bi-LSTM to learn and extract generic features across tasks and generate task-specific representations. Considering the differential contributions of different features to steganalysis, we further design a feature filtering mechanism to achieve selective feature propagation, thereby enhancing classification performance. We train the model using labeled source domain data and adapt it to target domain data distribution using pseudo-labels for unlabeled target domain data through self-training. In the label estimation step, instead of using a static sampling strategy, we propose a progressive sampling strategy to gradually increase the number of selected pseudo-label candidates. Experimental results demonstrate that our method performs well in zero-shot text steganalysis tasks, achieving high detection accuracy even in the absence of labeled data in the target domain, and outperforms current zero-shot text steganalysis methods.

[CV-103] Rotation Averaging: A Primal-Dual Method and Closed-Forms in Cycle Graphs

链接: https://arxiv.org/abs/2406.18564
作者: Gabriel Moreira,Manuel Marques,João Paulo Costeira
关键词: measured relative orientations, geometric reconstruction, rotation averaging seeks, cornerstone of geometric, optimally explains
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: arXiv admin note: text overlap with arXiv:2109.08046

点击查看摘要

Abstract:A cornerstone of geometric reconstruction, rotation averaging seeks the set of absolute rotations that optimally explains a set of measured relative orientations between them. In addition to being an integral part of bundle adjustment and structure-from-motion, the problem of synchronizing rotations also finds applications in visual simultaneous localization and mapping, where it is used as an initialization for iterative solvers, and camera network calibration. Nevertheless, this optimization problem is both non-convex and high-dimensional. In this paper, we address it from a maximum likelihood estimation standpoint and make a twofold contribution. Firstly, we set forth a novel primal-dual method, motivated by the widely accepted spectral initialization. Further, we characterize stationary points of rotation averaging in cycle graphs topologies and contextualize this result within spectral graph theory. We benchmark the proposed method in multiple settings and certify our solution via duality theory, achieving a significant gain in precision and performance.

[CV-104] Interdisciplinary Expertise to Advance Equitable Explainable AI

链接: https://arxiv.org/abs/2406.18563
作者: Chloe R. Bennett,Heather Cole-Lewis,Stephanie Farquhar,Naama Haamel,Boris Babenko,Oran Lang,Mat Fleck,Ilana Traynis,Charles Lau,Ivor Horn,Courtney Lyles
关键词: widespread structural oppression, face widespread structural, poor performance persists, rapidly influencing health, artificial intelligence
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of artificial intelligence (AI) is rapidly influencing health and healthcare, but bias and poor performance persists for populations who face widespread structural oppression. Previous work has clearly outlined the need for more rigorous attention to data representativeness and model performance to advance equity and reduce bias. However, there is an opportunity to also improve the explainability of AI by leveraging best practices of social epidemiology and health equity to help us develop hypotheses for associations found. In this paper, we focus on explainable AI (XAI) and describe a framework for interdisciplinary expert panel review to discuss and critically assess AI model explanations from multiple perspectives and identify areas of bias and directions for future research. We emphasize the importance of the interdisciplinary expert panel to produce more accurate, equitable interpretations which are historically and contextually informed. Interdisciplinary panel discussions can help reduce bias, identify potential confounders, and identify opportunities for additional research where there are gaps in the literature. In turn, these insights can suggest opportunities for AI model improvement.

[CV-105] Views Can Be Deceiving: Improved SSL Through Feature Space Augmentation

链接: https://arxiv.org/abs/2406.18562
作者: Kimia Hamidieh,Haoran Zhang,Swami Sankaranarayanan,Marzyeh Ghassemi
关键词: exhibit inductive biases, inductive biases favoring, biases favoring simpler, Supervised learning methods, favoring simpler features
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised learning methods have been found to exhibit inductive biases favoring simpler features. When such features are spuriously correlated with the label, this can result in suboptimal performance on minority subgroups. Despite the growing popularity of methods which learn from unlabeled data, the extent to which these representations rely on spurious features for prediction is unclear. In this work, we explore the impact of spurious features on Self-Supervised Learning (SSL) for visual representation learning. We first empirically show that commonly used augmentations in SSL can cause undesired invariances in the image space, and illustrate this with a simple example. We further show that classical approaches in combating spurious correlations, such as dataset re-sampling during SSL, do not consistently lead to invariant representations. Motivated by these findings, we propose LateTVG to remove spurious information from these representations during pre-training, by regularizing later layers of the encoder via pruning. We find that our method produces representations which outperform the baselines on several benchmarks, without the need for group or label information during SSL.

[CV-106] SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching

链接: https://arxiv.org/abs/2406.18561
作者: Yongmin Lee,Hye Won Chung
关键词: minimal performance loss, full dataset training, approximate full dataset, IPC, images per class
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:Dataset distillation aims to synthesize a small number of images per class (IPC) from a large dataset to approximate full dataset training with minimal performance loss. While effective in very small IPC ranges, many distillation methods become less effective, even underperforming random sample selection, as IPC increases. Our examination of state-of-the-art trajectory-matching based distillation methods across various IPC scales reveals that these methods struggle to incorporate the complex, rare features of harder samples into the synthetic dataset even with the increased IPC, resulting in a persistent coverage gap between easy and hard test samples. Motivated by such observations, we introduce SelMatch, a novel distillation method that effectively scales with IPC. SelMatch uses selection-based initialization and partial updates through trajectory matching to manage the synthetic dataset’s desired difficulty level tailored to IPC scales. When tested on CIFAR-10/100 and TinyImageNet, SelMatch consistently outperforms leading selection-only and distillation-only methods across subset ratios from 5% to 30%.

[CV-107] Revision Matters: Generative Design Guided by Revision Edits

链接: https://arxiv.org/abs/2406.18559
作者: Tao Li,Chin-Yi Cheng,Amber Xie,Gang Li,Yang Li
关键词: user interface, interface or graphical, Layout, graphical layout, iterative revision process
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Layout design, such as user interface or graphical layout in general, is fundamentally an iterative revision process. Through revising a design repeatedly, the designer converges on an ideal layout. In this paper, we investigate how revision edits from human designer can benefit a multimodal generative model. To do so, we curate an expert dataset that traces how human designers iteratively edit and improve a layout generation with a prompted language goal. Based on such data, we explore various supervised fine-tuning task setups on top of a Gemini multimodal backbone, a large multimodal model. Our results show that human revision plays a critical role in iterative layout refinement. While being noisy, expert revision edits lead our model to a surprisingly strong design FID score ~10 which is close to human performance (~6). In contrast, self-revisions that fully rely on model’s own judgement, lead to an echo chamber that prevents iterative improvement, and sometimes leads to generative degradation. Fortunately, we found that providing human guidance plays at early stage plays a critical role in final generation. In such human-in-the-loop scenario, our work paves the way for iterative design revision based on pre-trained large multimodal models.

[CV-108] BAISeg: Boundary Assisted Weakly Supervised Instance Segmentation

链接: https://arxiv.org/abs/2406.18558
作者: Tengbo Wang,Yu Bai
关键词: extract instance-level masks, weakly supervised instance, supervised instance segmentation, extract instance-level, instance-level masks
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:How to extract instance-level masks without instance-level supervision is the main challenge of weakly supervised instance segmentation (WSIS). Popular WSIS methods estimate a displacement field (DF) via learning inter-pixel relations and perform clustering to identify instances. However, the resulting instance centroids are inherently unstable and vary significantly across different clustering algorithms. In this paper, we propose Boundary-Assisted Instance Segmentation (BAISeg), which is a novel paradigm for WSIS that realizes instance segmentation with pixel-level annotations. BAISeg comprises an instance-aware boundary detection (IABD) branch and a semantic segmentation branch. The IABD branch identifies instances by predicting class-agnostic instance boundaries rather than instance centroids, therefore, it is different from previous DF-based approaches. In particular, we proposed the Cascade Fusion Module (CFM) and the Deep Mutual Attention (DMA) in the IABD branch to obtain rich contextual information and capture instance boundaries with weak responses. During the training phase, we employed Pixel-to-Pixel Contrast to enhance the discriminative capacity of the IABD branch. This further strengthens the continuity and closedness of the instance boundaries. Extensive experiments on PASCAL VOC 2012 and MS COCO demonstrate the effectiveness of our approach, and we achieve considerable performance with only pixel-level annotations. The code will be available at this https URL.

[CV-109] Planted: a dataset for planted forest identification from multi-satellite time series

链接: https://arxiv.org/abs/2406.18554
作者: Luis Miguel Pazos-Outón,Cristina Nader Vasconcelos,Anton Raichuk,Anurag Arnab,Dan Morris,Maxim Neumann
关键词: Protecting and restoring, restoring forest ecosystems, carbon sequestration, ecosystems is critical, critical for biodiversity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Protecting and restoring forest ecosystems is critical for biodiversity conservation and carbon sequestration. Forest monitoring on a global scale is essential for prioritizing and assessing conservation efforts. Satellite-based remote sensing is the only viable solution for providing global coverage, but to date, large-scale forest monitoring is limited to single modalities and single time points. In this paper, we present a dataset consisting of data from five public satellites for recognizing forest plantations and planted tree species across the globe. Each satellite modality consists of a multi-year time series. The dataset, named \PlantD, includes over 2M examples of 64 tree label classes (46 genera and 40 species), distributed among 41 countries. This dataset is released to foster research in forest monitoring using multimodal, multi-scale, multi-temporal data sources. Additionally, we present initial baseline results and evaluate modality fusion and data augmentation approaches for this dataset.

[CV-110] A PST Algorithm for FPs Suppression in Two-stage CNN Detection Methods

链接: https://arxiv.org/abs/2406.18553
作者: Qiang Guo
关键词: two-stage CNN detection, CNN detection methods, Convolutional Neural Network-based, Neural Network-based detection, past decades due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pedestrian detection has been a hot spot in computer vision over the past decades due to the wide spectrum of promising applications, the major challenge of which is False Positives (FPs) that occur during pedestrian detection. The emergence various Convolutional Neural Network-based detection strategies substantially enhance the pedestrian detection accuracy but still not well solve this problem. This paper deeply analysis the detection framework of the two-stage CNN detection methods and find out false positives in detection results is due to its training strategy miss classify some false proposals, thus weakens the classification capability of following subnetwork and hardly to suppress false ones. To solve this problem, This paper proposes a pedestrian-sensitive training algorithm to effectively help two-stage CNN detection methods learn to distinguish the pedestrian and non-pedestrian samples and suppress the false positives in final detection results. The core of the proposed training algorithm is to redesign the training proposal generating pipeline of the two-stage CNN detection methods, which can avoid a certain number of false ones that mislead its training process. With the help of the proposed algorithm, the detection accuracy of the MetroNext, an smaller and accurate metro passenger detector, is further improved, which further decreases false ones in its metro passengers detection results. Based on various challenging benchmark datasets, experiment results have demonstrated that feasibility of the proposed algorithm to improve pedestrian detection accuracy by removing the false positives. Compared with the competitors, MetroNext-PST demonstrates better overall prediction performance in accuracy, total number of parameters, and inference time, thus it can become a practical solution for hunting pedestrian tailored for mobile and edge devices.

[CV-111] Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery

链接: https://arxiv.org/abs/2406.18552
作者: Yingying Fang,Zihao Jin,Xiaodan Xing,Simon Walsh,Guang Yang
关键词: early disease detection, discerning the rationale, crucial for evaluating, medical imaging, medical prognosis task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In medical imaging, particularly in early disease detection and prognosis tasks, discerning the rationale behind an AI model’s predictions is crucial for evaluating the reliability of its decisions. Conventional explanation methods face challenges in identifying discernible decisive features in medical image classifications, where discriminative features are subtle or not immediately apparent. To bridge this gap, we propose an explainable model that is equipped with both decision reasoning and feature identification capabilities. Our approach not only detects influential image patterns but also uncovers the decisive features that drive the model’s final predictions. By implementing our method, we can efficiently identify and visualise class-specific features leveraged by the data-driven model, providing insights into the decision-making processes of deep learning models. We validated our model in the demanding realm of medical prognosis task, demonstrating its efficacy and potential in enhancing the reliability of AI in healthcare and in discovering new knowledge in diseases where prognostic understanding is limited.

[CV-112] GFFE: G-buffer Free Frame Extrapolation for Low-latency Real-time Rendering

链接: https://arxiv.org/abs/2406.18551
作者: Songyin Wu,Deepak Vembar,Anton Sochenov,Selvakumar Panneer,Sungye Kim,Anton Kaplanyan,Ling-Qi Yan
关键词: embracing ever-demanding effects, ray tracing, embracing ever-demanding, ever-demanding effects, frame
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Real-time rendering has been embracing ever-demanding effects, such as ray tracing. However, rendering such effects in high resolution and high frame rate remains challenging. Frame extrapolation methods, which don’t introduce additional latency as opposed to frame interpolation methods such as DLSS 3 and FSR 3, boost the frame rate by generating future frames based on previous frames. However, it is a more challenging task because of the lack of information in the disocclusion regions, and recent methods also have a high engine integration cost due to requiring G-buffers as input. We propose a \emphG-buffer free frame extrapolation, GFFE, with a novel heuristic framework and an efficient neural network, to plausibly generate new frames in real-time without introducing additional latency. We analyze the motion of dynamic fragments and different types of disocclusions, and design the corresponding modules of the extrapolation block to handle them. After filling disocclusions, a light-weight shading correction network is used to correct shading and improve overall quality. GFFE achieves comparable or better results compared to previous interpolation as well as G-buffer-dependent extrapolation methods, with more efficient performance and easier game integration.

[CV-113] Pre-Trained Vision-Language Models as Partial Annotators

链接: https://arxiv.org/abs/2406.18550
作者: Qian-Wei Wang,Yuqiu Xie,Letian Zhang,Zimo Liu,Shu-Tao Xia
关键词: learn massive data, vision-language models learn, models learn massive, Pre-trained vision-language models, machine learning tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pre-trained vision-language models learn massive data to model unified representations of images and natural languages, which can be widely applied to downstream machine learning tasks. In addition to zero-shot inference, in order to better adapt pre-trained models to the requirements of downstream tasks, people usually use methods such as few-shot or parameter-efficient fine-tuning and knowledge distillation. However, annotating samples is laborious, while a large number of unlabeled samples can be easily obtained. In this paper, we investigate a novel “pre-trained annotating - weakly-supervised learning” paradigm for pre-trained model application and experiment on image classification tasks. Specifically, based on CLIP, we annotate image samples with multiple prompt templates to obtain multiple candidate labels to form the noisy partial label dataset, and design a collaborative consistency regularization algorithm to solve this problem. Our method simultaneously trains two neural networks, which collaboratively purify training labels for each other and obtain pseudo-labels for self-training, while adopting prototypical similarity alignment and noisy supervised contrastive learning to optimize model representation. In experiments, our method achieves performances far beyond zero-shot inference without introducing additional label information, and outperforms other weakly supervised learning and few-shot fine-tuning methods, and obtains smaller deployed models. Our code is available at: \urlhttps://anonymous.4open.science/r/Co-Reg-8CF9.

[CV-114] Application of Multimodal Fusion Deep Learning Model in Disease Recognition

链接: https://arxiv.org/abs/2406.18546
作者: Xiaoyi Liu,Hongjie Qiu,Muqing Li,Zhou Yu,Yutian Yang,Yafeng Yan
关键词: single-modal recognition techniques, innovative multi-modal fusion, traditional single-modal recognition, deep learning approach, multi-modal fusion deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces an innovative multi-modal fusion deep learning approach to overcome the drawbacks of traditional single-modal recognition techniques. These drawbacks include incomplete information and limited diagnostic accuracy. During the feature extraction stage, cutting-edge deep learning models including convolutional neural networks (CNN), recurrent neural networks (RNN), and transformers are applied to distill advanced features from image-based, temporal, and structured data sources. The fusion strategy component seeks to determine the optimal fusion mode tailored to the specific disease recognition task. In the experimental section, a comparison is made between the performance of the proposed multi-mode fusion model and existing single-mode recognition methods. The findings demonstrate significant advantages of the multimodal fusion model across multiple evaluation metrics.

[CV-115] Visual Analysis of Prediction Uncertainty in Neural Networks for Deep Image Synthesis

链接: https://arxiv.org/abs/2406.18545
作者: Soumya Dutta,Faheem Nizar,Ahmad Amaan,Ayan Acharya
关键词: artificial intelligence systems, Deep neural networks, solving challenging visualization, challenging visualization problems, Ubiquitous applications
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ubiquitous applications of Deep neural networks (DNNs) in different artificial intelligence systems have led to their adoption in solving challenging visualization problems in recent years. While sophisticated DNNs offer an impressive generalization, it is imperative to comprehend the quality, confidence, robustness, and uncertainty associated with their prediction. A thorough understanding of these quantities produces actionable insights that help application scientists make informed decisions. Unfortunately, the intrinsic design principles of the DNNs cannot beget prediction uncertainty, necessitating separate formulations for robust uncertainty-aware models for diverse visualization applications. To that end, this contribution demonstrates how the prediction uncertainty and sensitivity of DNNs can be estimated efficiently using various methods and then interactively compared and contrasted for deep image synthesis tasks. Our inspection suggests that uncertainty-aware deep visualization models generate illustrations of informative and superior quality and diversity. Furthermore, prediction uncertainty improves the robustness and interpretability of deep visualization models, making them practical and convenient for various scientific domains that thrive on visual analyses.

[CV-116] GS-ROR: 3D Gaussian Splatting for Reflective Object Relighting via SDF Priors

链接: https://arxiv.org/abs/2406.18544
作者: Zuo-Liang Zhu,Beibei Wang,Jian Yang
关键词: view synthesis due, detailed expressive ability, Gaussian Splatting, efficient rendering speed, highly efficient rendering
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has shown a powerful capability for novel view synthesis due to its detailed expressive ability and highly efficient rendering speed. Unfortunately, creating relightable 3D assets with 3DGS is still problematic, particularly for reflective objects, as its discontinuous representation raises difficulties in constraining geometries. Inspired by previous works, the signed distance field (SDF) can serve as an effective way for geometry regularization. However, a direct incorporation between Gaussians and SDF significantly slows training. To this end, we propose GS-ROR for reflective objects relighting with 3DGS aided by SDF priors. At the core of our method is the mutual supervision of the depth and normal between deferred Gaussians and SDF, which avoids the expensive volume rendering of SDF. Thanks to this mutual supervision, the learned deferred Gaussians are well-constrained with a minimal time cost. As the Gaussians are rendered in a deferred shading mode, while the alpha-blended Gaussians are smooth, individual Gaussians may still be outliers, yielding floater artifacts. Therefore, we further introduce an SDF-aware pruning strategy to remove Gaussian outliers, which are located distant from the surface defined by SDF, avoiding the floater issue. Consequently, our method outperforms the existing Gaussian-based inverse rendering methods in terms of relighting quality. Our method also exhibits competitive relighting quality compared to NeRF-based methods with at most 25% of training time and allows rendering at 200+ frames per second on an RTX4090.

[CV-117] A Set-based Approach for Feature Extraction of 3D CAD Models

链接: https://arxiv.org/abs/2406.18543
作者: Peng Xu,Qi Gao,Ying-Jie Wu
关键词: product life cycles, Feature extraction, Feature, life cycles, geometric information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages

点击查看摘要

Abstract:Feature extraction is a critical technology to realize the automatic transmission of feature information throughout product life cycles. As CAD models primarily capture the 3D geometry of products, feature extraction heavily relies on geometric information. However, existing feature extraction methods often yield inaccurate outcomes due to the diverse interpretations of geometric information. This report presents a set-based feature extraction approach to address this uncertainty issue. Unlike existing methods that seek accurate feature results, our approach aims to transform the uncertainty of geometric information into a set of feature subgraphs. First, we define the convexity of basic geometric entities and introduce the concept of two-level attributed adjacency graphs. Second, a feature extraction workflow is designed to determine feature boundaries and identify feature subgraphs from CAD models. This set of feature subgraphs can be used for further feature recognition. A feature extraction system is programmed using C++ and UG/Open to demonstrate the feasibility of our proposed approach.

[CV-118] Generative AI Empowered LiDAR Point Cloud Generation with Multimodal Transformer

链接: https://arxiv.org/abs/2406.18542
作者: Mohammad Farzanullah,Han Zhang,Akram Bin Sediq,Ali Afana,Melike Erol-Kantarci
关键词: Integrated sensing, wireless communication systems, key enabler, wireless communication, communication systems
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注: 6 pages, 4 figures, conference

点击查看摘要

Abstract:Integrated sensing and communications is a key enabler for the 6G wireless communication systems. The multiple sensing modalities will allow the base station to have a more accurate representation of the environment, leading to context-aware communications. Some widely equipped sensors such as cameras and RADAR sensors can provide some environmental perceptions. However, they are not enough to generate precise environmental representations, especially in adverse weather conditions. On the other hand, the LiDAR sensors provide more accurate representations, however, their widespread adoption is hindered by their high cost. This paper proposes a novel approach to enhance the wireless communication systems by synthesizing LiDAR point clouds from images and RADAR data. Specifically, it uses a multimodal transformer architecture and pre-trained encoding models to enable an accurate LiDAR generation. The proposed framework is evaluated on the DeepSense 6G dataset, which is a real-world dataset curated for context-aware wireless applications. Our results demonstrate the efficacy of the proposed approach in accurately generating LiDAR point clouds. We achieve a modified mean squared error of 10.3931. Visual examination of the images indicates that our model can successfully capture the majority of structures present in the LiDAR point cloud for diverse environments. This will enable the base stations to achieve more precise environmental sensing. By integrating LiDAR synthesis with existing sensing modalities, our method can enhance the performance of various wireless applications, including beam and blockage prediction.

[CV-119] Refining 3D Point Cloud Normal Estimation via Sample Selection

链接: https://arxiv.org/abs/2406.18541
作者: Jun Zhou,Yaoshun Li,Hongchen Tan,Mingjie Wang,Nannan Li,Xiuping Liu
关键词: point cloud normal, cloud normal estimation, garnered extensive attention, current Neural Network-based, geometric processing
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:In recent years, point cloud normal estimation, as a classical and foundational algorithm, has garnered extensive attention in the field of 3D geometric processing. Despite the remarkable performance achieved by current Neural Network-based methods, their robustness is still influenced by the quality of training data and the models’ performance. In this study, we designed a fundamental framework for normal estimation, enhancing existing model through the incorporation of global information and various constraint mechanisms. Additionally, we employed a confidence-based strategy to select the reasonable samples for fair and robust network training. The introduced sample confidence can be integrated into the loss function to balance the influence of different samples on model training. Finally, we utilized existing orientation methods to correct estimated non-oriented normals, achieving state-of-the-art performance in both oriented and non-oriented tasks. Extensive experimental results demonstrate that our method works well on the widely used benchmarks.

[CV-120] Fully Exploiting Every Real Sample: SuperPixel Sample Gradient Model Stealing

链接: https://arxiv.org/abs/2406.18540
作者: Yunlong Zhao,Xiaoheng Deng,Yijing Liu,Xinjun Pei,Jiazhi Xia,Wei Chen
关键词: machine learning model, steal its capabilities, Model, observing the output, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: Accepted to CVPR 2024

点击查看摘要

Abstract:Model stealing (MS) involves querying and observing the output of a machine learning model to steal its capabilities. The quality of queried data is crucial, yet obtaining a large amount of real data for MS is often challenging. Recent works have reduced reliance on real data by using generative models. However, when high-dimensional query data is required, these methods are impractical due to the high costs of querying and the risk of model collapse. In this work, we propose using sample gradients (SG) to enhance the utility of each real sample, as SG provides crucial guidance on the decision boundaries of the victim model. However, utilizing SG in the model stealing scenario faces two challenges: 1. Pixel-level gradient estimation requires extensive query volume and is susceptible to defenses. 2. The estimation of sample gradients has a significant variance. This paper proposes Superpixel Sample Gradient stealing (SPSG) for model stealing under the constraint of limited real samples. With the basic idea of imitating the victim model’s low-variance patch-level gradients instead of pixel-level gradients, SPSG achieves efficient sample gradient estimation through two steps. First, we perform patch-wise perturbations on query images to estimate the average gradient in different regions of the image. Then, we filter the gradients through a threshold strategy to reduce variance. Exhaustive experiments demonstrate that, with the same number of real samples, SPSG achieves accuracy, agreements, and adversarial success rate significantly surpassing the current state-of-the-art MS methods. Codes are available at this https URL.

[CV-121] xPainter: Generative Mesh Texturing with Multi-view Consistency

链接: https://arxiv.org/abs/2406.18539
作者: Hongkun Zhang,Zherong Pan,Congyi Zhang,Lifeng Zhu,Xifeng Gao
关键词: diffusion models unlocks, recent success, unlocks the possibility, automatic generation, Diffusion Implicit Models
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: accepted by Siggraph 2024

点击查看摘要

Abstract:The recent success of pre-trained diffusion models unlocks the possibility of the automatic generation of textures for arbitrary 3D meshes in the wild. However, these models are trained in the screen space, while converting them to a multi-view consistent texture image poses a major obstacle to the output quality. In this paper, we propose a novel method to enforce multi-view consistency. Our method is based on the observation that latent space in a pre-trained diffusion model is noised separately for each camera view, making it difficult to achieve multi-view consistency by directly manipulating the latent codes. Based on the celebrated Denoising Diffusion Implicit Models (DDIM) scheme, we propose to use an optimization-based color-fusion to enforce consistency and indirectly modify the latent codes by gradient back-propagation. Our method further relaxes the sequential dependency assumption among the camera views. By evaluating on a series of general 3D models, we find our simple approach improves consistency and overall quality of the generated textures as compared to competing state-of-the-arts. Our implementation is available at: this https URL

[CV-122] VideoQA-SC: Adaptive Semantic Communication for Video Question Answering

链接: https://arxiv.org/abs/2406.18538
作者: Jiangyuan Guo,Wei Chen,Yuxuan Sun,Jialong Xu,Bo Ai
关键词: efficiently transmitting multi-modal, transmitting multi-modal data, speeches and images, efficiently transmitting, transmitting multi-modal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Although semantic communication (SC) has shown its potential in efficiently transmitting multi-modal data such as text, speeches and images, SC for videos has focused primarily on pixel-level reconstruction. However, these SC systems may be suboptimal for downstream intelligent tasks. Moreover, SC systems without pixel-level video reconstruction present advantages by achieving higher bandwidth efficiency and real-time performance of various intelligent tasks. The difficulty in such system design lies in the extraction of task-related compact semantic representations and their accurate delivery over noisy channels. In this paper, we propose an end-to-end SC system for video question answering (VideoQA) tasks called VideoQA-SC. Our goal is to accomplish VideoQA tasks directly based on video semantics over noisy or fading wireless channels, bypassing the need for video reconstruction at the receiver. To this end, we develop a spatiotemporal semantic encoder for effective video semantic extraction, and a learning-based bandwidth-adaptive deep joint source-channel coding (DJSCC) scheme for efficient and robust video semantic transmission. Experiments demonstrate that VideoQA-SC outperforms traditional and advanced DJSCC-based SC systems that rely on video reconstruction at the receiver under a wide range of channel conditions and bandwidth constraints. In particular, when the signal-to-noise ratio is low, VideoQA-SC can improve the answer accuracy by 5.17% while saving almost 99.5% of the bandwidth at the same time, compared with the advanced DJSCC-based SC system. Our results show the great potential of task-oriented SC system design for video applications.

[CV-123] AddBiomechanics Dataset: Capturing the Physics of Human Motion at Scale

链接: https://arxiv.org/abs/2406.18537
作者: Keenon Werling,Janelle Kaneda,Alan Tan,Rishi Agarwal,Six Skov,Tom Van Wouwe,Scott Uhlrich,Nicholas Bianco,Carmichael Ong,Antoine Falisse,Shardul Sapkota,Aidan Chandra,Joshua Carter,Ezio Preatoni,Benjamin Fregly,Jennifer Hicks,Scott Delp,C. Karen Liu
关键词: muscle-generated joint torques, remains a challenge, recent years, including the muscle-generated, reconstructing human poses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
*备注: 15 pages, 6 figures, 4 tables

点击查看摘要

Abstract:While reconstructing human poses in 3D from inexpensive sensors has advanced significantly in recent years, quantifying the dynamics of human motion, including the muscle-generated joint torques and external forces, remains a challenge. Prior attempts to estimate physics from reconstructed human poses have been hampered by a lack of datasets with high-quality pose and force data for a variety of movements. We present the AddBiomechanics Dataset 1.0, which includes physically accurate human dynamics of 273 human subjects, over 70 hours of motion and force plate data, totaling more than 24 million frames. To construct this dataset, novel analytical methods were required, which are also reported here. We propose a benchmark for estimating human dynamics from motion using this dataset, and present several baseline results. The AddBiomechanics Dataset is publicly available at this https URL.

[CV-124] LiverUSRecon: Automatic 3D Reconstruction and Volumetry of the Liver with a Few Partial Ultrasound Scans

链接: https://arxiv.org/abs/2406.19336
作者: Kaushalya Sivayogaraj,Sahan T. Guruge,Udari Liyanage,Jeevani Udupihille,Saroj Jayasinghe,Gerard Fernando,Ranga Rodrigo,M. Rukshani Liyanaarachchi
关键词: scans, liver, disease diagnosis, important for qualitative, qualitative analysis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, Accepted to MICCAI 2024

点击查看摘要

Abstract:3D reconstruction of the liver for volumetry is important for qualitative analysis and disease diagnosis. Liver volumetry using ultrasound (US) scans, although advantageous due to less acquisition time and safety, is challenging due to the inherent noisiness in US scans, blurry boundaries, and partial liver visibility. We address these challenges by using the segmentation masks of a few incomplete sagittal-plane US scans of the liver in conjunction with a statistical shape model (SSM) built using a set of CT scans of the liver. We compute the shape parameters needed to warp this canonical SSM to fit the US scans through a parametric regression network. The resulting 3D liver reconstruction is accurate and leads to automatic liver volume calculation. We evaluate the accuracy of the estimated liver volumes with respect to CT segmentation volumes using RMSE. Our volume computation is statistically much closer to the volume estimated using CT scans than the volume computed using Childs’ method by radiologists: p-value of 0.094 (0.05) says that there is no significant difference between CT segmentation volumes and ours in contrast to Childs’ method. We validate our method using investigations (ablation studies) on the US image resolution, the number of CT scans used for SSM, the number of principal components, and the number of input US scans. To the best of our knowledge, this is the first automatic liver volumetry system using a few incomplete US scans given a set of CT scans of livers for SSM.

[CV-125] ALMA: a mathematics-driven approach for determining tuning parameters in generalized LASSO problems with applications to MRI

链接: https://arxiv.org/abs/2406.19239
作者: Gianluca Giacchi,Isidoros Iakovidis,Bastien Milani,Matthias Stuber,Micah Murray,Benedetta Franceschiello
关键词: Magnetic Resonance Imaging, Magnetic Resonance, Resonance Imaging, internal structures, Total Variation-regularized LASSO
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is a powerful technique employed for non-invasive in vivo visualization of internal structures. Sparsity is often deployed to accelerate the signal acquisition or overcome the presence of motion artifacts, improving the quality of image reconstruction. Image reconstruction algorithms use TV-regularized LASSO (Total Variation-regularized LASSO) to retrieve the missing information of undersampled signals, by cleaning the data of noise and while optimizing sparsity. A tuning parameter moderates the balance between these two aspects; its choice affecting the quality of the reconstructions. Currently, there is a lack of general deterministic techniques to choose these parameters, which are oftentimes manually selected and thus hinder the reliability of the reconstructions. Here, we present ALMA (Algorithm for Lagrange Multipliers Approximation), an iterative mathematics-inspired technique that computes tuning parameters for generalized LASSO problems during MRI reconstruction. We analyze quantitatively the performance of these parameters for imaging reconstructions via TV-LASSO in an MRI context on phantoms. Although our study concentrates on TV-LASSO, the techniques developed here hold significant promise for a wide array of applications. ALMA is not only adaptable to more generalized LASSO problems but is also robust to accommodate other forms of regularization beyond total variation. Moreover, it extends effectively to handle non-Cartesian sampling trajectories, broadening its utility in complex data reconstruction scenarios. More generally, ALMA provides a powerful tool for numerically solving constrained optimization problems across various disciplines, offering a versatile and impactful solution for advanced computational challenges.

[CV-126] Unsupervised Latent Stain Adaption for Digital Pathology

链接: https://arxiv.org/abs/2406.19081
作者: Daniel Reisenbüchler,Lucas Luttner,Nadine S. Schaadt,Friedrich Feuerhake,Dorit Merhof
关键词: domain shifts due, suffer from domain, domain shifts, shifts due, Stain
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in MICCAI2024

点击查看摘要

Abstract:In digital pathology, deep learning (DL) models for tasks such as segmentation or tissue classification are known to suffer from domain shifts due to different staining techniques. Stain adaptation aims to reduce the generalization error between different stains by training a model on source stains that generalizes to target stains. Despite the abundance of target stain data, a key challenge is the lack of annotations. To address this, we propose a joint training between artificially labeled and unlabeled data including all available stained images called Unsupervised Latent Stain Adaption (ULSA). Our method uses stain translation to enrich labeled source images with synthetic target images in order to increase supervised signals. Moreover, we leverage unlabeled target stain images using stain-invariant feature consistency learning. With ULSA we present a semi-supervised strategy for efficient stain adaption without access to annotated target stain data. Remarkably, ULSA is task agnostic in patch-level analysis for whole slide images (WSIs). Through extensive evaluation on external datasets, we demonstrate that ULSA achieves state-of-the-art (SOTA) performance in kidney tissue segmentation and breast cancer classification across a spectrum of staining variations. Our findings suggest that ULSA is an important framework towards stain adaption in digital pathology.

[CV-127] CMRxRecon2024: A Multi-Modality Multi-View K-Space Dataset Boosting Universal Machine Learning for Accelerated Cardiac MRI

链接: https://arxiv.org/abs/2406.19043
作者: Zi Wang,Fanwen Wang,Chen Qin,Jun Lyu,Ouyang Cheng,Shuo Wang,Yan Li,Mengyao Yu,Haoyu Zhang,Kunyuan Guo,Zhang Shi,Qirong Li,Ziqiang Xu,Yajing Zhang,Hao Li,Sha Hua,Binghua Chen,Longyu Sun,Mengting Sun,Qin Li,Ying-Hua Chu,Wenjia Bai,Jing Qin,Xiahai Zhuang,Claudia Prieto,Alistair Young,Michael Markl,He Wang,Lianming Wu,Guang Yang,Xiaobo Qu,Chengyan Wang
关键词: magnetic resonance imaging, cardiac MRI, diagnosing cardiac diseases, clinically gold-standard technique, Cardiac magnetic resonance
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
*备注: 19 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover high-quality, clinically interpretable images from undersampled measurements. However, the lack of publicly available cardiac MRI k-space dataset in terms of both quantity and diversity has severely hindered substantial technological progress, particularly for data-driven artificial intelligence. Here, we provide a standardized, diverse, and high-quality CMRxRecon2024 dataset to facilitate the technical development, fair evaluation, and clinical transfer of cardiac MRI reconstruction approaches, towards promoting the universal frameworks that enable fast and robust reconstructions across different cardiac MRI protocols in clinical practice. To the best of our knowledge, the CMRxRecon2024 dataset is the largest and most diverse publicly available cardiac k-space dataset. It is acquired from 330 healthy volunteers, covering commonly used modalities, anatomical views, and acquisition trajectories in clinical cardiac MRI workflows. Besides, an open platform with tutorials, benchmarks, and data processing tools is provided to facilitate data usage, advanced method development, and fair performance evaluation.

[CV-128] MMR-Mamba: Multi-Contrast MRI Reconstruction with Mamba and Spatial-Frequency Information Fusion

链接: https://arxiv.org/abs/2406.18950
作者: Jing Zou,Lanqing Liu,Qi Chen,Shujun Wang,Xiaohan Xing,Jing Qin
关键词: under-sampled k-space data, fully-sampled auxiliary modality, Multi-contrast MRI acceleration, auxiliary modality, target modality information
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figure

点击查看摘要

Abstract:Multi-contrast MRI acceleration has become prevalent in MR imaging, enabling the reconstruction of high-quality MR images from under-sampled k-space data of the target modality, using guidance from a fully-sampled auxiliary modality. The main crux lies in efficiently and comprehensively integrating complementary information from the auxiliary modality. Existing methods either suffer from quadratic computational complexity or fail to capture long-range correlated features comprehensively. In this work, we propose MMR-Mamba, a novel framework that achieves comprehensive integration of multi-contrast features through Mamba and spatial-frequency information fusion. Firstly, we design the \textitTarget modality-guided Cross Mamba (TCM) module in the spatial domain, which maximally restores the target modality information by selectively absorbing useful information from the auxiliary modality. Secondly, leveraging global properties of the Fourier domain, we introduce the \textitSelective Frequency Fusion (SFF) module to efficiently integrate global information in the frequency domain and recover high-frequency signals for the reconstruction of structure details. Additionally, we present the \textitAdaptive Spatial-Frequency Fusion (ASFF) module, which enhances fused features by supplementing less informative features from one domain with corresponding features from the other domain. These innovative strategies ensure efficient feature fusion across spatial and frequency domains, avoiding the introduction of redundant information and facilitating the reconstruction of high-quality target images. Extensive experiments on the BraTS and fastMRI knee datasets demonstrate the superiority of the proposed MMR-Mamba over state-of-the-art MRI reconstruction methods.

[CV-129] Classification of Carotid Plaque with Jellyfish Sign Through Convolutional and Recurrent Neural Networks Utilizing Plaque Surface Edges

链接: https://arxiv.org/abs/2406.18919
作者: Takeshi Yoshidomi,Shinji Kume,Hiroaki Aizawa,Akira Furui
关键词: localized elevated lesions, Jellyfish sign, elevated lesions, develop as localized, localized elevated
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages, 3 figures, accepted at IEEE EMBC 2024

点击查看摘要

Abstract:In carotid arteries, plaque can develop as localized elevated lesions. The Jellyfish sign, marked by fluctuating plaque surfaces with blood flow pulsation, is a dynamic characteristic of these plaques that has recently attracted attention. Detecting this sign is vital, as it is often associated with cerebral infarction. This paper proposes an ultrasound video-based classification method for the Jellyfish sign, using deep neural networks. The proposed method first preprocesses carotid ultrasound videos to separate the movement of the vascular wall from plaque movements. These preprocessed videos are then combined with plaque surface information and fed into a deep learning model comprising convolutional and recurrent neural networks, enabling the efficient classification of the Jellyfish sign. The proposed method was verified using ultrasound video images from 200 patients. Ablation studies demonstrated the effectiveness of each component of the proposed method.

[CV-130] Renal digital pathology visual knowledge search platform based on language large model and book knowledge

链接: https://arxiv.org/abs/2406.18556
作者: Xiaomin Lv,Chong Lai,Liya Ding,Maode Lai,Qingrong Sun
关键词: Large models, require exploration, applications in digital, renal pathology, renal pathology images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Large models have become mainstream, yet their applications in digital pathology still require exploration. Meanwhile renal pathology images play an important role in the diagnosis of renal diseases. We conducted image segmentation and paired corresponding text descriptions based on 60 books for renal pathology, clustering analysis for all image and text description features based on large models, ultimately building a retrieval system based on the semantic features of large models. Based above analysis, we established a knowledge base of 10,317 renal pathology images and paired corresponding text descriptions, and then we evaluated the semantic feature capabilities of 4 large models, including GPT2, gemma, LLma and Qwen, and the image-based feature capabilities of dinov2 large model. Furthermore, we built a semantic retrieval system to retrieve pathological images based on text descriptions, and named RppD (this http URL).

[CV-131] Using a Convolutional Neural Network and Explainable AI to Diagnose Dementia Based on MRI Scans

链接: https://arxiv.org/abs/2406.18555
作者: Tyler Morris,Ziming Liu,Longjian Liu,Xiaopeng Zhao
关键词: diagnostic procedures rises, accurate diagnostic procedures, dementia patients rises, patients rises, procedures rises
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages, 4 figures

点击查看摘要

Abstract:As the number of dementia patients rises, the need for accurate diagnostic procedures rises as well. Current methods, like using an MRI scan, rely on human input, which can be inaccurate. However, the decision logic behind machine learning algorithms and their outputs cannot be explained, as most operate in black-box models. Therefore, to increase the accuracy of diagnosing dementia through MRIs, a convolution neural network has been developed and trained using an open-source database of 6400 MRI scans divided into 4 dementia classes. The model, which attained a 98 percent validation accuracy, was shown to be well fit and able to generalize to new data. Furthermore, to aid in the visualization of the model output, an explainable AI algorithm was developed by visualizing the outputs of individual filters in each convolution layer, which highlighted regions of interest in the scan. These outputs do a great job of identifying the image features that contribute most to the model classification, thus allowing users to visualize and understand the results. Altogether, this combination of the convolution neural network and explainable AI algorithm creates a system that can be used in the medical field to not only aid in the proper classification of dementia but also allow everyone involved to visualize and understand the results.

[CV-132] Advancements in Feature Extraction Recognition of Medical Imaging Systems Through Deep Learning Technique

链接: https://arxiv.org/abs/2406.18549
作者: Qishi Zhan,Dan Sun,Erdi Gao,Yuhan Ma,Yaxin Liang,Haowei Yang
关键词: employs spatial stratification, unsupervised medical image, spatial stratification techniques, study introduces, unsupervised medical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: conference

点击查看摘要

Abstract:This study introduces a novel unsupervised medical image feature extraction method that employs spatial stratification techniques. An objective function based on weight is proposed to achieve the purpose of fast image recognition. The algorithm divides the pixels of the image into multiple subdomains and uses a quadtree to access the image. A technique for threshold optimization utilizing a simplex algorithm is presented. Aiming at the nonlinear characteristics of hyperspectral images, a generalized discriminant analysis algorithm based on kernel function is proposed. In this project, a hyperspectral remote sensing image is taken as the object, and we investigate its mathematical modeling, solution methods, and feature extraction techniques. It is found that different types of objects are independent of each other and compact in image processing. Compared with the traditional linear discrimination method, the result of image segmentation is better. This method can not only overcome the disadvantage of the traditional method which is easy to be affected by light, but also extract the features of the object quickly and accurately. It has important reference significance for clinical diagnosis.

[CV-133] Exploration of Multi-Scale Image Fusion Systems in Intelligent Medical Image Analysis

链接: https://arxiv.org/abs/2406.18548
作者: Yuxiang Hu,Haowei Yang,Ting Xu,Shuyao He,Jiajie Yuan,Haozhang Deng
关键词: medical imaging techniques, cancer relies heavily, brain cancer relies, relies heavily, heavily on medical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The diagnosis of brain cancer relies heavily on medical imaging techniques, with MRI being the most commonly used. It is necessary to perform automatic segmentation of brain tumors on MRI images. This project intends to build an MRI algorithm based on U-Net. The residual network and the module used to enhance the context information are combined, and the void space convolution pooling pyramid is added to the network for processing. The brain glioma MRI image dataset provided by cancer imaging archives was experimentally verified. A multi-scale segmentation method based on a weighted least squares filter was used to complete the 3D reconstruction of brain tumors. Thus, the accuracy of three-dimensional reconstruction is further improved. Experiments show that the local texture features obtained by the proposed algorithm are similar to those obtained by laser scanning. The algorithm is improved by using the U-Net method and an accuracy of 0.9851 is obtained. This approach significantly enhances the precision of image segmentation and boosts the efficiency of image classification.

[CV-134] Enhancing Medical Imaging with GANs Synthesizing Realistic Images from Limited Data

链接: https://arxiv.org/abs/2406.18547
作者: Yinqiu Feng,Bo Zhang,Lingxi Xiao,Yutian Yang,Tana Gegen,Zexi Chen
关键词: synthesizing medical images, introduce an innovative, generative adversarial networks, medical image data, medical images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this research, we introduce an innovative method for synthesizing medical images using generative adversarial networks (GANs). Our proposed GANs method demonstrates the capability to produce realistic synthetic images even when trained on a limited quantity of real medical image data, showcasing commendable generalization prowess. To achieve this, we devised a generator and discriminator network architecture founded on deep convolutional neural networks (CNNs), leveraging the adversarial training paradigm for model optimization. Through extensive experimentation across diverse medical image datasets, our method exhibits robust performance, consistently generating synthetic images that closely emulate the structural and textural attributes of authentic medical images.

机器学习

[LG-0] he Remarkable Robustness of LLMs: Stages of Inference?

链接: https://arxiv.org/abs/2406.19384
作者: Vedang Lad,Wes Gurnee,Max Tegmark
关键词: Large Language Models, Large Language, swapping adjacent layers, Language Models, deleting and swapping
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We demonstrate and investigate the remarkable robustness of Large Language Models by deleting and swapping adjacent layers. We find that deleting and swapping interventions retain 72-95% of the original model’s prediction accuracy without fine-tuning, whereas models with more layers exhibit more robustness. Based on the results of the layer-wise intervention and further experiments, we hypothesize the existence of four universal stages of inference across eight different models: detokenization, feature engineering, prediction ensembling, and residual sharpening. The first stage integrates local information, lifting raw token representations into higher-level contextual representations. Next is the iterative refinement of task and entity-specific features. Then, the second half of the model begins with a phase transition, where hidden representations align more with the vocabulary space due to specialized model components. Finally, the last layer sharpens the following token distribution by eliminating obsolete features that add noise to the prediction.

[LG-1] abReD: A Benchmark of Tabular Machine Learning in-the-Wild

链接: https://arxiv.org/abs/2406.19380
作者: Ivan Rubachev,Nikolay Kartashev,Yury Gorishniy,Artem Babenko
关键词: closely reflect downstream, reflect downstream application, tabular machine learning, downstream application scenarios, machine learning
类目: Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:Benchmarks that closely reflect downstream application scenarios are essential for the streamlined adoption of new research in tabular machine learning (ML). In this work, we examine existing tabular benchmarks and find two common characteristics of industry-grade tabular data that are underrepresented in the datasets available to the academic community. First, tabular data often changes over time in real-world deployment scenarios. This impacts model performance and requires time-based train and test splits for correct model evaluation. Yet, existing academic tabular datasets often lack timestamp metadata to enable such evaluation. Second, a considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines. For each specific dataset, this can have a different impact on the absolute and relative number of predictive, uninformative, and correlated features, which in turn can affect model selection. To fill the aforementioned gaps in academic benchmarks, we introduce TabReD – a collection of eight industry-grade tabular datasets covering a wide range of domains from finance to food delivery services. We assess a large number of tabular ML models in the feature-rich, temporally-evolving data setting facilitated by TabReD. We demonstrate that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits more common in academic benchmarks. Furthermore, on the TabReD datasets, MLP-like architectures and GBDT show the best results, while more sophisticated DL models are yet to prove their effectiveness.

[LG-2] Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space

链接: https://arxiv.org/abs/2406.19370
作者: Core Francisco Park,Maya Okawa,Andrew Lee,Ekdeep Singh Lubana,Hidenori Tanaka
关键词: Modern generative models, models demonstrate impressive, demonstrate impressive capabilities, manipulate abstract concepts, Modern generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Modern generative models demonstrate impressive capabilities, likely stemming from an ability to identify and manipulate abstract concepts underlying their training data. However, fundamental questions remain: what determines the concepts a model learns, the order in which it learns them, and its ability to manipulate those concepts? To address these questions, we propose analyzing a model’s learning dynamics via a framework we call the concept space, where each axis represents an independent concept underlying the data generating process. By characterizing learning dynamics in this space, we identify how the speed at which a concept is learned, and hence the order of concept learning, is controlled by properties of the data we term concept signal. Further, we observe moments of sudden turns in the direction of a model’s learning dynamics in concept space. Surprisingly, these points precisely correspond to the emergence of hidden capabilities, i.e., where latent interventions show the model possesses the capability to manipulate a concept, but these capabilities cannot yet be elicited via naive input prompting. While our results focus on synthetically defined toy datasets, we hypothesize a general claim on emergence of hidden capabilities may hold: generative models possess latent capabilities that emerge suddenly and consistently during training, though a model might not exhibit these capabilities under naive input prompting.

[LG-3] DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions

链接: https://arxiv.org/abs/2406.19356
作者: Nigel Fernandez,Alexander Scarlatos,Simon Woodhead,Andrew Lan
关键词: anticipate knowledge deficiencies, High-quality distractors, assessment and pedagogical, manually crafting, anticipate knowledge
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-quality distractors are crucial to both the assessment and pedagogical value of multiple-choice questions (MCQs), where manually crafting ones that anticipate knowledge deficiencies or misconceptions among real students is difficult. Meanwhile, automated distractor generation, even with the help of large language models (LLMs), remains challenging for subjects like math. It is crucial to not only identify plausible distractors but also understand the error behind them. In this paper, we introduce DiVERT (Distractor Generation with Variational Errors Represented as Text), a novel variational approach that learns an interpretable representation of errors behind distractors in math MCQs. Through experiments on a real-world math MCQ dataset with 1,434 questions used by hundreds of thousands of students, we show that DiVERT, despite using a base open-source LLM with 7B parameters, outperforms state-of-the-art approaches using GPT-4o on downstream distractor generation. We also conduct a human evaluation with math educators and find that DiVERT leads to error labels that are of comparable quality to human-authored ones.

[LG-4] Subtractive Training for Music Stem Insertion using Latent Diffusion Models

链接: https://arxiv.org/abs/2406.19328
作者: Ivan Villa-Renteria,Mason L. Wang,Zachary Shah,Zhe Li,Soohyun Kim,Neelesh Ramachandran,Mert Pilanci
关键词: synthesizing individual musical, present Subtractive Training, individual musical instrument, Subtractive Training, musical instrument stems
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We present Subtractive Training, a simple and novel method for synthesizing individual musical instrument stems given other instruments as context. This method pairs a dataset of complete music mixes with 1) a variant of the dataset lacking a specific stem, and 2) LLM-generated instructions describing how the missing stem should be reintroduced. We then fine-tune a pretrained text-to-audio diffusion model to generate the missing instrument stem, guided by both the existing stems and the text instruction. Our results demonstrate Subtractive Training’s efficacy in creating authentic drum stems that seamlessly blend with the existing tracks. We also show that we can use the text instruction to control the generation of the inserted stem in terms of rhythm, dynamics, and genre, allowing us to modify the style of a single instrument in a full song while keeping the remaining instruments the same. Lastly, we extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.

[LG-5] Efficient World Models with Context-Aware Tokenization

链接: https://arxiv.org/abs/2406.19320
作者: Vincent Micheli,Eloi Alonso,François Fleuret
关键词: deep Reinforcement Learning, Reinforcement Learning, Scaling up deep, deep Reinforcement, methods presents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024

点击查看摘要

Abstract:Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose \Delta -IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, \Delta -IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at this https URL.

[LG-6] Jump Starting Bandits with LLM-Generated Prior Knowledge

链接: https://arxiv.org/abs/2406.19317
作者: Parand A. Alamdari,Yanshuai Cao,Kevin H. Wilson
关键词: integrating Large Language, Large Language Models, Large Language, present substantial evidence, substantial evidence demonstrating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.

[LG-7] LiveBench: A Challenging Contamination-Free LLM Benchmark

链接: https://arxiv.org/abs/2406.19314
作者: Colin White,Samuel Dooley,Manley Roberts,Arka Pal,Ben Feuer,Siddhartha Jain,Ravid Shwartz-Ziv,Neel Jain,Khalid Saifullah,Siddartha Naidu,Chinmay Hegde,Yann LeCun,Tom Goldstein,Willie Neiswanger,Micah Goldblum
关键词: Test set contamination, fair LLM evaluation, render benchmarks obsolete, quickly render benchmarks, newer model training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Test set contamination, wherein test data from a benchmark ends up in a newer model’s training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

[LG-8] Mapping Land Naturalness from Sentinel-2 using Deep Contextual and Geographical Priors

链接: https://arxiv.org/abs/2406.19302
作者: Burak Ekim,Michael Schmitt
关键词: recent decades, affecting our planet, unprecedented scale, climate change, combating climate change
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, ICLR 2024 Tackling Climate Change with Machine Learning Workshop

点击查看摘要

Abstract:In recent decades, the causes and consequences of climate change have accelerated, affecting our planet on an unprecedented scale. This change is closely tied to the ways in which humans alter their surroundings. As our actions continue to impact natural areas, using satellite images to observe and measure these effects has become crucial for understanding and combating climate change. Aiming to map land naturalness on the continuum of modern human pressure, we have developed a multi-modal supervised deep learning framework that addresses the unique challenges of satellite data and the task at hand. We incorporate contextual and geographical priors, represented by corresponding coordinate information and broader contextual information, including and surrounding the immediate patch to be predicted. Our framework improves the model’s predictive performance in mapping land naturalness from Sentinel-2 data, a type of multi-spectral optical satellite imagery. Recognizing that our protective measures are only as effective as our understanding of the ecosystem, quantifying naturalness serves as a crucial step toward enhancing our environmental stewardship.

[LG-9] MCNC: Manifold Constrained Network Compression

链接: https://arxiv.org/abs/2406.19301
作者: Chayne Thrash,Ali Abbasi,Parsa Nooralinejad,Soroush Abbasi Koohpayegani,Reed Andreas,Hamed Pirsiavash,Soheil Kolouri
关键词: large foundational models, diverse tasks-from computer, processing-has significantly increased, increased their demand, outstanding performance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The outstanding performance of large foundational models across diverse tasks-from computer vision to speech and natural language processing-has significantly increased their demand. However, storing and transmitting these models pose significant challenges due to their massive size (e.g., 350GB for GPT-3). Recent literature has focused on compressing the original weights or reducing the number of parameters required for fine-tuning these models. These compression methods typically involve constraining the parameter space, for example, through low-rank reparametrization (e.g., LoRA) or quantization (e.g., QLoRA) during model training. In this paper, we present MCNC as a novel model compression method that constrains the parameter space to low-dimensional pre-defined and frozen nonlinear manifolds, which effectively cover this space. Given the prevalence of good solutions in over-parameterized deep neural networks, we show that by constraining the parameter space to our proposed manifold, we can identify high-quality solutions while achieving unprecedented compression rates across a wide variety of tasks. Through extensive experiments in computer vision and natural language processing tasks, we demonstrate that our method, MCNC, significantly outperforms state-of-the-art baselines in terms of compression, accuracy, and/or model reconstruction time.

[LG-10] scTree: Discovering Cellular Hierarchies in the Presence of Batch Effects in scRNA-seq Data

链接: https://arxiv.org/abs/2406.19300
作者: Moritz Vandenhirtz,Florian Barkmann,Laura Manduchi,Julia E. Vogt,Valentina Boeva
关键词: Tree Variational Autoencoders, single-cell Tree Variational, single-cell RNA sequencing, RNA sequencing data, Variational Autoencoders
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel method, scTree, for single-cell Tree Variational Autoencoders, extending a hierarchical clustering approach to single-cell RNA sequencing data. scTree corrects for batch effects while simultaneously learning a tree-structured data representation. This VAE-based method allows for a more in-depth understanding of complex cellular landscapes independently of the biasing effects of batches. We show empirically on seven datasets that scTree discovers the underlying clusters of the data and the hierarchical relations between them, as well as outperforms established baseline methods across these datasets. Additionally, we analyze the learned hierarchy to understand its biological relevance, thus underpinning the importance of integrating batch correction directly into the clustering procedure.

[LG-11] Compositional Image Decomposition with Diffusion Models

链接: https://arxiv.org/abs/2406.19298
作者: Jocelin Su,Nan Liu,Yanbo Wang,Joshua B. Tenenbaum,Yilun Du
关键词: scene, set, quickly decompose, components, natural scene
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ICML 2024, Webpage: this https URL

点击查看摘要

Abstract:Given an image of a natural scene, we are able to quickly decompose it into a set of components such as objects, lighting, shadows, and foreground. We can then envision a scene where we combine certain components with those from other images, for instance a set of objects from our bedroom and animals from a zoo under the lighting conditions of a forest, even if we have never encountered such a scene before. In this paper, we present a method to decompose an image into such compositional components. Our approach, Decomp Diffusion, is an unsupervised method which, when given a single image, infers a set of different components in the image, each represented by a diffusion model. We demonstrate how components can capture different factors of the scene, ranging from global scene descriptors like shadows or facial expression to local scene descriptors like constituent objects. We further illustrate how inferred factors can be flexibly composed, even with factors inferred from other models, to generate a variety of scenes sharply different than those seen in training time. Website and code at this https URL.

[LG-12] From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

链接: https://arxiv.org/abs/2406.19292
作者: Zheyang Xiong,Vasilis Papageorgiou,Kangwook Lee,Dimitris Papailiopoulos
关键词: Large Language Models, Large Language, Recent studies, shown that Large, accurately retrieve information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs’ information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., 10.5% improvement on 20 documents MDQA at position 10 for GPT-3.5 Turbo). We also find that finetuned LLMs’ performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from 2.33% to 6.19% ). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.

[LG-13] HuatuoGPT-Vision Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

链接: https://arxiv.org/abs/2406.19280
作者: Junying Chen,Ruyi Ouyang,Anningzhe Gao,Shunian Chen,Guiming Hardy Chen,Xidong Wang,Ruifei Zhang,Zhenyang Cai,Ke Ji,Guangjun Yu,Xiang Wan,Benyou Wang
关键词: large language models, multimodal large language, rapid development, large language, medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed’s large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an ‘unblinded’ capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

[LG-14] Stochastic Concept Bottleneck Models

链接: https://arxiv.org/abs/2406.19272
作者: Moritz Vandenhirtz,Sonia Laguna,Ričards Marcinkevičs,Julia E. Vogt
关键词: promising interpretable method, Concept Bottleneck Models, Bottleneck Models, Stochastic Concept Bottleneck, Concept Bottleneck
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) have emerged as a promising interpretable method whose final prediction is based on intermediate, human-understandable concepts rather than the raw input. Through time-consuming manual interventions, a user can correct wrongly predicted concept values to enhance the model’s downstream performance. We propose Stochastic Concept Bottleneck Models (SCBMs), a novel approach that models concept dependencies. In SCBMs, a single-concept intervention affects all correlated concepts, thereby improving intervention effectiveness. Unlike previous approaches that model the concept relations via an autoregressive structure, we introduce an explicit, distributional parameterization that allows SCBMs to retain the CBMs’ efficient training and inference procedure. Additionally, we leverage the parameterization to derive an effective intervention strategy based on the confidence region. We show empirically on synthetic tabular and natural image datasets that our approach improves intervention effectiveness significantly. Notably, we showcase the versatility and usability of SCBMs by examining a setting with CLIP-inferred concepts, alleviating the need for manual concept annotations.

[LG-15] Leveraging Contrastive Learning for Enhanced Node Representations in Tokenized Graph Transformers

链接: https://arxiv.org/abs/2406.19258
作者: Jinsong Chen,Hanpeng Liu,John E. Hopcroft,Kun He
关键词: demonstrated strong performance, high similarity scores, tokenized graph Transformers, fully harness graph, graph Transformers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While tokenized graph Transformers have demonstrated strong performance in node classification tasks, their reliance on a limited subset of nodes with high similarity scores for constructing token sequences overlooks valuable information from other nodes, hindering their ability to fully harness graph information for learning optimal node representations. To address this limitation, we propose a novel graph Transformer called GCFormer. Unlike previous approaches, GCFormer develops a hybrid token generator to create two types of token sequences, positive and negative, to capture diverse graph information. And a tailored Transformer-based backbone is adopted to learn meaningful node representations from these generated token sequences. Additionally, GCFormer introduces contrastive learning to extract valuable information from both positive and negative token sequences, enhancing the quality of learned node representations. Extensive experimental results across various datasets, including homophily and heterophily graphs, demonstrate the superiority of GCFormer in node classification, when compared to representative graph neural networks (GNNs) and graph Transformers.

[LG-16] Advection Augmented Convolutional Neural Networks

链接: https://arxiv.org/abs/2406.19253
作者: Niloufar Zakariaei,Siddharth Rout,Eldad Haber,Moshe Eliasof
关键词: space-time sequences, physical sciences, sciences are characterized, Convolution Neural Networks, prediction
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many problems in physical sciences are characterized by the prediction of space-time sequences. Such problems range from weather prediction to the analysis of disease propagation and video prediction. Modern techniques for the solution of these problems typically combine Convolution Neural Networks (CNN) architecture with a time prediction mechanism. However, oftentimes, such approaches underperform in the long-range propagation of information and lack explainability. In this work, we introduce a physically inspired architecture for the solution of such problems. Namely, we propose to augment CNNs with advection by designing a novel semi-Lagrangian push operator. We show that the proposed operator allows for the non-local transformation of information compared with standard convolutional kernels. We then complement it with Reaction and Diffusion neural components to form a network that mimics the Reaction-Advection-Diffusion equation, in high dimensions. We demonstrate the effectiveness of our network on a number of spatio-temporal datasets that show their merit.

[LG-17] NTFormer: A Composite Node Tokenized Graph Transformer for Node Classification

链接: https://arxiv.org/abs/2406.19249
作者: Jinsong Chen,Siyu Jiang,Kun He
关键词: made significant advancements, emerging graph Transformers, graph Transformers, made significant, significant advancements
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, the emerging graph Transformers have made significant advancements for node classification on graphs. In most graph Transformers, a crucial step involves transforming the input graph into token sequences as the model input, enabling Transformer to effectively learn the node representations. However, we observe that existing methods only express partial graph information of nodes through single-type token generation. Consequently, they require tailored strategies to encode additional graph-specific features into the Transformer to ensure the quality of node representation learning, limiting the model flexibility to handle diverse graphs. To this end, we propose a new graph Transformer called NTFormer to address this issue. NTFormer introduces a novel token generator called Node2Par, which constructs various token sequences using different token elements for each node. This flexibility allows Node2Par to generate valuable token sequences from different perspectives, ensuring comprehensive expression of rich graph features. Benefiting from the merits of Node2Par, NTFormer only leverages a Transformer-based backbone without graph-specific modifications to learn node representations, eliminating the need for graph-specific modifications. Extensive experiments conducted on various benchmark datasets containing homophily and heterophily graphs with different scales demonstrate the superiority of NTFormer over representative graph Transformers and graph neural networks for node classification.

[LG-18] Improving the Expressiveness of K-hop Message-Passing GNNs by Injecting Contextualized Substructure Information

链接: https://arxiv.org/abs/2406.19244
作者: Tianjun Yao,Yiongxu Wang,Kun Zhang,Shangsong Liang
关键词: Graph neural networks, Graph neural, hop message-passing GNNs, expressive power, hop graph neural
类目: Machine Learning (cs.LG)
*备注: 13 pages, published in Research track of KDD2023

点击查看摘要

Abstract:Graph neural networks (GNNs) have become the \textitde facto standard for representational learning in graphs, and have achieved state-of-the-art performance in many graph-related tasks; however, it has been shown that the expressive power of standard GNNs are equivalent maximally to 1-dimensional Weisfeiler-Lehman (1-WL) Test. Recently, there is a line of works aiming to enhance the expressive power of graph neural networks. One line of such works aim at developing K -hop message-passing GNNs where node representation is updated by aggregating information from not only direct neighbors but all neighbors within K -hop of the node. Another line of works leverages subgraph information to enhance the expressive power which is proven to be strictly more powerful than 1-WL test. In this work, we discuss the limitation of K -hop message-passing GNNs and propose \textitsubstructure encoding function to uplift the expressive power of any K -hop message-passing GNN. We further inject contextualized substructure information to enhance the expressiveness of K -hop message-passing GNNs. Our method is provably more powerful than previous works on K -hop graph neural networks and 1-WL subgraph GNNs, which is a specific type of subgraph based GNN models, and not less powerful than 3-WL. Empirically, our proposed method set new state-of-the-art performance or achieves comparable performance for a variety of datasets. Our code is available at \urlthis https URL.

[LG-19] Revealing Fine-Grained Values and Opinions in Large Language Models

链接: https://arxiv.org/abs/2406.19238
作者: Dustin Wright,Arnav Arora,Nadav Borenstein,Srishti Yadav,Serge Belongie,Isabelle Augenstein
关键词: mitigate potential harm, Uncovering latent, potential harm, biases and mitigate, mitigate potential
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 28 pages, 20 figures, 7 tables

点击查看摘要

Abstract:Uncovering latent values and opinions in large language models (LLMs) can help identify biases and mitigate potential harm. Recently, this has been approached by presenting LLMs with survey questions and quantifying their stances towards morally and politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and there are many ways to argue for or against a given position. In this work, we propose to address this by analysing a large and robust dataset of 156k LLM responses to the 62 propositions of the Political Compass Test (PCT) generated by 6 LLMs using 420 prompt variations. We perform coarse-grained analysis of their generated stances and fine-grained analysis of the plain text justifications for those stances. For fine-grained analysis, we propose to identify tropes in the responses: semantically similar phrases that are recurrent and consistent across different prompts, revealing patterns in the text that a given LLM is prone to produce. We find that demographic features added to prompts significantly affect outcomes on the PCT, reflecting bias, as well as disparities between the results of tests when eliciting closed-form vs. open domain responses. Additionally, patterns in the plain text rationales via tropes show that similar justifications are repeatedly generated across models and prompts even with disparate stances.

[LG-20] FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts

链接: https://arxiv.org/abs/2406.19237
作者: Shubhankar Singh,Purvi Chaurasia,Yerram Varun,Pranshu Pandya,Vatsal Gupta,Vivek Gupta,Dan Roth
关键词: question answering lack, spatial reasoning skills, visual question answering, evaluating spatial reasoning, Existing benchmarks
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark’s potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.

[LG-21] ools Fail: Detecting Silent Errors in Faulty Tools

链接: https://arxiv.org/abs/2406.19228
作者: Jimin Sun,So Yeon Min,Yingshan Chang,Yonatan Bisk
关键词: control robots, retrieve knowledge, perform tasks, mainstay of LLMs, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 12 figures

点击查看摘要

Abstract:Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model’s ability to detect “silent” tool errors, and reflect on how to plan. This more directly aligns with the increasingly popular use of models as tools. We provide an initial approach to failure recovery with promising results both on a controlled calculator setting and embodied agent planning.

[LG-22] -FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

链接: https://arxiv.org/abs/2406.19223
作者: Björn Deiseroth,Manuel Brack,Patrick Schramowski,Kristian Kersting,Samuel Weinbach
关键词: Large Language Models, Tokenizers are crucial, Language Models, recently stagnated, inherent weaknesses
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2406.19223 [cs.CL] (or arXiv:2406.19223v1 [cs.CL] for this version)

[LG-23] Estimating Long-term Heterogeneous Dose-response Curve: Generalization Bound Leveraging Optimal Transport Weights

链接: https://arxiv.org/abs/2406.19195
作者: Zeqin Yang,Weilin Chen,Ruichu Cai,Yuguang Yan,Zhifeng Hao,Zhipeng Yu,Zhichao Zou,Zhen Peng,Jiecheng Guo
关键词: causal effect estimation, Long-term causal effect, long-term average effects, significant but challenging, Long-term causal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Long-term causal effect estimation is a significant but challenging problem in many applications. Existing methods rely on ideal assumptions to estimate long-term average effects, e.g., no unobserved confounders or a binary treatment,while in numerous real-world applications, these assumptions could be violated and average effects are unable to provide individual-level this http URL this paper,we address a more general problem of estimating the long-term heterogeneous dose-response curve (HDRC) while accounting for unobserved confounders. Specifically, to remove unobserved confounding in observational data, we introduce an optimal transport weighting framework to align the observational data to the experimental data with theoretical guarantees. Furthermore,to accurately predict the heterogeneous effects of continuous treatment, we establish a generalization bound on counterfactual prediction error by leveraging the reweighted distribution induced by optimal transport. Finally, we develop an HDRC estimator building upon the above theoretical foundations. Extensive experimental studies conducted on multiple synthetic and semi-synthetic datasets demonstrate the effectiveness of our proposed method.

[LG-24] BISeizuRe: BERT-Inspired Seizure Data Representation to Improve Epilepsy Monitoring

链接: https://arxiv.org/abs/2406.19189
作者: Luca Benfenati,Thorir Mar Ingolfsson,Andrea Cossettini,Daniele Jahier Pagliari,Alessio Burrello,Luca Benini
关键词: Hospital EEG Corpus, University Hospital EEG, Temple University Hospital, Scalp EEG Database, study presents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 4 pages, 2 tables, 2 figures

点击查看摘要

Abstract:This study presents a novel approach for EEG-based seizure detection leveraging a BERT-based model. The model, BENDR, undergoes a two-phase training process. Initially, it is pre-trained on the extensive Temple University Hospital EEG Corpus (TUEG), a 1.5 TB dataset comprising over 10,000 subjects, to extract common EEG data patterns. Subsequently, the model is fine-tuned on the CHB-MIT Scalp EEG Database, consisting of 664 EEG recordings from 24 pediatric patients, of which 198 contain seizure events. Key contributions include optimizing fine-tuning on the CHB-MIT dataset, where the impact of model architecture, pre-processing, and post-processing techniques are thoroughly examined to enhance sensitivity and reduce false positives per hour (FP/h). We also explored custom training strategies to ascertain the most effective setup. The model undergoes a novel second pre-training phase before subject-specific fine-tuning, enhancing its generalization capabilities. The optimized model demonstrates substantial performance enhancements, achieving as low as 0.23 FP/h, 2.5 \times lower than the baseline model, with a lower but still acceptable sensitivity rate, showcasing the effectiveness of applying a BERT-based approach on EEG-based seizure detection.

[LG-25] Averaging log-likelihoods in direct alignment

链接: https://arxiv.org/abs/2406.19188
作者: Nathan Grinsztajn,Yannis Flet-Berliac,Mohammad Gheshlaghi Azar,Florian Strub,Bill Wu,Eugene Choi,Chris Cremer,Arash Ahmadian,Yash Chandak,Olivier Pietquin,Matthieu Geist
关键词: align Large Language, Large Language Models, Reinforcement Learning, Large Language, Human Feedback
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To better align Large Language Models (LLMs) with human judgment, Reinforcement Learning from Human Feedback (RLHF) learns a reward model and then optimizes it using regularized RL. Recently, direct alignment methods were introduced to learn such a fine-tuned model directly from a preference dataset without computing a proxy reward function. These methods are built upon contrastive losses involving the log-likelihood of (dis)preferred completions according to the trained model. However, completions have various lengths, and the log-likelihood is not length-invariant. On the other side, the cross-entropy loss used in supervised training is length-invariant, as batches are typically averaged token-wise. To reconcile these approaches, we introduce a principled approach for making direct alignment length-invariant. Formally, we introduce a new averaging operator, to be composed with the optimality operator giving the best policy for the underlying RL problem. It translates into averaging the log-likelihood within the loss. We empirically study the effect of such averaging, observing a trade-off between the length of generations and their scores.

[LG-26] Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion

链接: https://arxiv.org/abs/2406.19185
作者: Yannis Flet-Berliac,Nathan Grinsztajn,Florian Strub,Eugene Choi,Chris Cremer,Arash Ahmadian,Yash Chandak,Mohammad Gheshlaghi Azar,Olivier Pietquin,Matthieu Geist
关键词: Large Language Models, finetune Large Language, Reinforcement Learning, Large Language, Language Models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has been used to finetune Large Language Models (LLMs) using a reward model trained from preference data, to better align with human judgment. The recently introduced direct alignment methods, which are often simpler, more stable, and computationally lighter, can more directly achieve this. However, these approaches cannot optimize arbitrary rewards, and the preference-based ones are not the only rewards of interest for LLMs (eg., unit tests for code generation or textual entailment for summarization, among others). RL-finetuning is usually done with a variation of policy gradient, which calls for on-policy or near-on-policy samples, requiring costly generations. We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data. It can be seen as an off-policy policy gradient approach that does not rely on important sampling techniques and highlights the importance of using (the right) state baseline. We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient. We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task, using a learned reward function considered as ground truth for the purpose of the experiments.

[LG-27] owards Reducing Data Acquisition and Labeling for Defect Detection using Simulated Data

链接: https://arxiv.org/abs/2406.19175
作者: Lukas Malte Kemeter,Rasmus Hvingelby,Paulina Sierak,Tobias Schön,Bishwajit Gosswam
关键词: machine learning, data, manufacturing settings, vision is costly, synthetic data
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In many manufacturing settings, annotating data for machine learning and computer vision is costly, but synthetic data can be generated at significantly lower cost. Substituting the real-world data with synthetic data is therefore appealing for many machine learning applications that require large amounts of training data. However, relying solely on synthetic data is frequently inadequate for effectively training models that perform well on real-world data, primarily due to domain shifts between the synthetic and real-world data. We discuss approaches for dealing with such a domain shift when detecting defects in X-ray scans of aluminium wheels. Using both simulated and real-world X-ray images, we train an object detection model with different strategies to identify the training approach that generates the best detection results while minimising the demand for annotated real-world training samples. Our preliminary findings suggest that the sim-2-real domain adaptation approach is more cost-efficient than a fully supervised oracle - if the total number of available annotated samples is fixed. Given a certain number of labeled real-world samples, training on a mix of synthetic and unlabeled real-world data achieved comparable or even better detection results at significantly lower cost. We argue that future research into the cost-efficiency of different training strategies is important for a better understanding of how to allocate budget in applied machine learning projects.

[LG-28] Heterogeneous Causal Metapath Graph Neural Network for Gene-Microbe-Disease Association Prediction

链接: https://arxiv.org/abs/2406.19156
作者: Kexin Zhang,Feng Huang,Luotao Liu,Zhankun Xiong,Hongyu Zhang,Yuan Quan,Wen Zhang
关键词: human medicine highlights, GMD associations, GMD, recent focus, human medicine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent focus on microbes in human medicine highlights their potential role in the genetic framework of diseases. To decode the complex interactions among genes, microbes, and diseases, computational predictions of gene-microbe-disease (GMD) associations are crucial. Existing methods primarily address gene-disease and microbe-disease associations, but the more intricate triple-wise GMD associations remain less explored. In this paper, we propose a Heterogeneous Causal Metapath Graph Neural Network (HCMGNN) to predict GMD associations. HCMGNN constructs a heterogeneous graph linking genes, microbes, and diseases through their pairwise associations, and utilizes six predefined causal metapaths to extract directed causal subgraphs, which facilitate the multi-view analysis of causal relations among three entity types. Within each subgraph, we employ a causal semantic sharing message passing network for node representation learning, coupled with an attentive fusion method to integrate these representations for predicting GMD associations. Our extensive experiments show that HCMGNN effectively predicts GMD associations and addresses association sparsity issue by enhancing the graph’s semantics and structure.

[LG-29] Advancing operational PM2.5 forecasting with dual deep neural networks (D-DNet)

链接: https://arxiv.org/abs/2406.19154
作者: Shengjuan Cai,Fangxin Fang,Vincent-Henri Peuch,Mihai Alexe,Ionel Michael Navon,Yanghua Wang
关键词: air quality management, public health, air quality, quality management, policy development
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:PM2.5 forecasting is crucial for public health, air quality management, and policy development. Traditional physics-based models are computationally demanding and slow to adapt to real-time conditions. Deep learning models show potential in efficiency but still suffer from accuracy loss over time due to error accumulation. To address these challenges, we propose a dual deep neural network (D-DNet) prediction and data assimilation system that efficiently integrates real-time observations, ensuring reliable operational forecasting. D-DNet excels in global operational forecasting for PM2.5 and AOD550, maintaining consistent accuracy throughout the entire year of 2019. It demonstrates notably higher efficiency than the Copernicus Atmosphere Monitoring Service (CAMS) 4D-Var operational forecasting system while maintaining comparable accuracy. This efficiency benefits ensemble forecasting, uncertainty analysis, and large-scale tasks.

[LG-30] Resolving Discrepancies in Compute-Optimal Scaling of Language Models

链接: https://arxiv.org/abs/2406.19146
作者: Tomer Porian,Mitchell Wortsman,Jenia Jitsev,Ludwig Schmidt,Yair Carmon
关键词: laws yield substantially, developed influential scaling, Kaplan scaling law, influential scaling laws, compute budget
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., “Chinchilla”) scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW \beta_2 parameter is essential at lower batch sizes.

[LG-31] YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention

链接: https://arxiv.org/abs/2406.19136
作者: Chenxu Wang,Haowei Ming,Jian He,Yao Lu
关键词: drug ADME processes, ADME processes, effectiveness and safety, essential for determining, determining their therapeutic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 12 figures, 6 tables

点击查看摘要

Abstract:The accurate prediction of drug molecule solubility is essential for determining their therapeutic effectiveness and safety, influencing the drug’s ADME processes. Traditional solubility prediction techniques often fail to capture the complex nature of molecular tructures, leading to notable deviations between predictions and actual results. For example, the Discussion on Advanced Drug-Like Compound Structures. Lusci highlighted issues in capturing crucial cyclic structural information in molecules with ring structures. To overcome this issue, our research introduces a novel deep learning framework combining attention-based transformers, Long Short-Term Memory (LSTM) networks, and Graph Convolutional Networks (GCN), aimed at enhancing the precision of solubility predictions. Utilizing a training set of 9,943 compounds and testing on an anticancer compound dataset, our method achieved a correlation coefficient ( R^2 ) of 0.55 and a Root Mean Square Error (RMSE) of 0.59, which outperforms the benchmark models’ scores of 0.52 ( R^2 ) and 0.61 (RMSE). Importantly, in an additional independent test, our model significantly outperformed the baseline with an RMSE of 1.05 compared to 1.28, a relative accuracy improvement of 45.9%. This research not only demonstrates the vast potential of deep learning for improving solubility prediction accuracy but also offers novel insights for drug design and selection in the future. Continued efforts will be directed towards optimizing the model architecture and extending its application to better support the drug development process, underscoring the pivotal role of deep learning in drug discovery.

[LG-32] owards Learning Abductive Reasoning using VSA Distributed Representations

链接: https://arxiv.org/abs/2406.19121
作者: Giacomo Camposampiero,Michael Hersche,Aleksandar Terzić,Roger Wattenhofer,Abu Sebastian,Abbas Rahimi
关键词: Abductive Rule Learner, Learner with Context-awareness, solves abstract reasoning, abstract reasoning tasks, reasoning tasks based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注: Accepted at the 18th International Conference on Neural-Symbolic Learning and Reasoning (NeSy) 2024

点击查看摘要

Abstract:We introduce the Abductive Rule Learner with Context-awareness (ARLC), a model that solves abstract reasoning tasks based on Learn-VRF. ARLC features a novel and more broadly applicable training objective for abductive reasoning, resulting in better interpretability and higher accuracy when solving Raven’s progressive matrices (RPM). ARLC allows both programming domain knowledge and learning the rules underlying a data distribution. We evaluate ARLC on the I-RAVEN dataset, showcasing state-of-the-art accuracy across both in-distribution and out-of-distribution (unseen attribute-rule pairs) tests. ARLC surpasses neuro-symbolic and connectionist baselines, including large language models, despite having orders of magnitude fewer parameters. We show ARLC’s robustness to post-programming training by incrementally learning from examples on top of programmed knowledge, which only improves its performance and does not result in catastrophic forgetting of the programmed solution. We validate ARLC’s seamless transfer learning from a 2x2 RPM constellation to unseen constellations. Our code is available at this https URL.

[LG-33] CHEW: A Dataset of CHanging Events in Wikipedia

链接: https://arxiv.org/abs/2406.19116
作者: Hsuvas Borkakoty,Luis Espinosa-Anke
关键词: naturally occurring text, occurring text, introduce CHEW, Wikipedia expressed, dataset of changing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Short Paper

点击查看摘要

Abstract:We introduce CHEW, a novel dataset of changing events in Wikipedia expressed in naturally occurring text. We use CHEW for probing LLMs for their timeline understanding of Wikipedia entities and events in generative and classification experiments. Our results suggest that LLMs, despite having temporal information available, struggle to construct accurate timelines. We further show the usefulness of CHEW-derived embeddings for identifying meaning shift.

[LG-34] A Teacher Is Worth A Million Instructions

链接: https://arxiv.org/abs/2406.19112
作者: Nikhil Kothari,Ravindra Nayak,Shreyas Shetty,Amey Patil,Nikesh Garera
关键词: shown exceptional abilities, Large Language Models, Large Language, exceptional abilities, shown exceptional
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Large Language Models(LLMs) have shown exceptional abilities, yet training these models can be quite challenging. There is a strong dependence on the quality of data and finding the best instruction tuning set. Further, the inherent limitations in training methods create substantial difficulties to train relatively smaller models with 7B and 13B parameters. In our research, we suggest an improved training method for these models by utilising knowledge from larger models, such as a mixture of experts (8x7B) architectures. The scale of these larger models allows them to capture a wide range of variations from data alone, making them effective teachers for smaller models. Moreover, we implement a novel post-training domain alignment phase that employs domain-specific expert models to boost domain-specific knowledge during training while preserving the model’s ability to generalise. Fine-tuning Mistral 7B and 2x7B with our method surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 7.9 in MT-Bench and 93.04% on AlpacaEval.

[LG-35] Adaptive Stochastic Weight Averaging

链接: https://arxiv.org/abs/2406.19092
作者: Caglar Demir,Arnab Sharma,Axel-Cyrille Ngonga Ngomo
关键词: Stochastic Weight Averaging, underlying running model, Weight Averaging, Adaptive Stochastic Weight, Stochastic Weight
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensemble models often improve generalization performances in challenging tasks. Yet, traditional techniques based on prediction averaging incur three well-known disadvantages: the computational overhead of training multiple models, increased latency, and memory requirements at test time. To address these issues, the Stochastic Weight Averaging (SWA) technique maintains a running average of model parameters from a specific epoch onward. Despite its potential benefits, maintaining a running average of parameters can hinder generalization, as an underlying running model begins to overfit. Conversely, an inadequately chosen starting point can render SWA more susceptible to underfitting compared to an underlying running model. In this work, we propose Adaptive Stochastic Weight Averaging (ASWA) technique that updates a running average of model parameters, only when generalization performance is improved on the validation dataset. Hence, ASWA can be seen as a combination of SWA with the early stopping technique, where the former accepts all updates on a parameter ensemble model and the latter rejects any update on an underlying running model. We conducted extensive experiments ranging from image classification to multi-hop reasoning over knowledge graphs. Our experiments over 11 benchmark datasets with 7 baseline models suggest that ASWA leads to a statistically better generalization across models and datasets

[LG-36] Dimensions underlying the representational alignment of deep neural networks with humans

链接: https://arxiv.org/abs/2406.19087
作者: Florian P. Mahner,Lukas Muttenthaler,Umut Güçlü,Martin N. Hebart
关键词: artificial intelligence, machine learning, Determining the similarities, Determining, DNN
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Determining the similarities and differences between humans and artificial intelligence is an important goal both in machine learning and cognitive neuroscience. However, similarities in representations only inform us about the degree of alignment, not the factors that determine it. Drawing upon recent developments in cognitive science, we propose a generic framework for yielding comparable representations in humans and deep neural networks (DNN). Applying this framework to humans and a DNN model of natural images revealed a low-dimensional DNN embedding of both visual and semantic dimensions. In contrast to humans, DNNs exhibited a clear dominance of visual over semantic features, indicating divergent strategies for representing images. While in-silico experiments showed seemingly-consistent interpretability of DNN dimensions, a direct comparison between human and DNN representations revealed substantial differences in how they process images. By making representations directly comparable, our results reveal important challenges for representational alignment, offering a means for improving their comparability.

[LG-37] Dancing in the Shadows: Harnessing Ambiguity for Fairer Classifiers

链接: https://arxiv.org/abs/2406.19066
作者: Ainhize Barrainkua,Paula Gordaliza,Jose A. Lozano,Novi Quadrianto
关键词: bolster algorithmic fairness, paper introduces, approach to bolster, bolster algorithmic, sensitive information
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This paper introduces a novel approach to bolster algorithmic fairness in scenarios where sensitive information is only partially known. In particular, we propose to leverage instances with uncertain identity with regards to the sensitive attribute to train a conventional machine learning classifier. The enhanced fairness observed in the final predictions of this classifier highlights the promising potential of prioritizing ambiguity (i.e., non-normativity) as a means to improve fairness guarantees in real-world classification tasks.

[LG-38] Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO

链接: https://arxiv.org/abs/2406.19057
作者: Fuseini Mumuni,Alhassan Mumuni
关键词: Segment Anything Model, Grounding DINO, Model, achieved impressive performance, zero-shot object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Grounding DINO and the Segment Anything Model (SAM) have achieved impressive performance in zero-shot object detection and image segmentation, respectively. Together, they have a great potential in revolutionizing zero-shot semantic segmentation or data annotation. Yet, in specialized domains like medical image segmentation, objects of interest (e.g., organs, tissues, and tumors) may not fall in existing class names. To address this problem, the referring expression comprehension (REC) ability of Grounding DINO is leveraged to detect arbitrary targets by their language descriptions. However, recent studies have highlighted severe limitation of the REC framework in this application setting owing to its tendency to make false positive predictions when the target is absent in the given image. And, while this bottleneck is central to the prospect of open-set semantic segmentation, it is still largely unknown how much improvement can be achieved by studying the prediction errors. To this end, we perform empirical studies on eight publicly available datasets and reveal that these errors consistently follow a predictable pattern and can, thus, be mitigated by a simple strategy. Specifically, we show that these false positive detections with appreciable confidence scores generally occupy large image areas and can usually be filtered by their relative sizes. More importantly, we expect these observations to inspire future research in improving REC-based detection and automated segmentation. Using this technique, we evaluate the performance of SAM on multiple datasets from various specialized domains and report significant improvement in segmentation performance and annotation time savings over manual approaches.

[LG-39] A look under the hood of the Interactive Deep Learning Enterprise (No-IDLE)

链接: https://arxiv.org/abs/2406.19054
作者: Daniel Sonntag,Michael Barz,Thiago Gouvêa
关键词: German Federal Ministry, German Federal, Federal Ministry, Ministry of Education, reveals deeper insights
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: DFKI Technical Report

点击查看摘要

Abstract:This DFKI technical report presents the anatomy of the No-IDLE prototype system (funded by the German Federal Ministry of Education and Research) that provides not only basic and fundamental research in interactive machine learning, but also reveals deeper insights into users’ behaviours, needs, and goals. Machine learning and deep learning should become accessible to millions of end users. No-IDLE’s goals and scienfific challenges centre around the desire to increase the reach of interactive deep learning solutions for non-experts in machine learning. One of the key innovations described in this technical report is a methodology for interactive machine learning combined with multimodal interaction which will become central when we start interacting with semi-intelligent machines in the upcoming area of neural networks and large language models.

[LG-40] FedMap: Iterative Magnitude-Based Pruning for Communication-Efficient Federated Learning

链接: https://arxiv.org/abs/2406.19050
作者: Alexander Herzog,Robbie Southam,Ioannis Mavromatis,Aftab Khan
关键词: distributed machine learning, Federated Learning, preserving privacy, distributed machine, enables training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE Transactions on Neural Networks and Learning Systems

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning approach that enables training on decentralized data while preserving privacy. However, FL systems often involve resource-constrained client devices with limited computational power, memory, storage, and bandwidth. This paper introduces FedMap, a novel method that aims to enhance the communication efficiency of FL deployments by collaboratively learning an increasingly sparse global model through iterative, unstructured pruning. Importantly, FedMap trains a global model from scratch, unlike other methods reported in the literature, making it ideal for privacy-critical use cases such as in the medical and finance domains, where suitable pre-training data is often limited. FedMap adapts iterative magnitude-based pruning to the FL setting, ensuring all clients prune and refine the same subset of the global model parameters, therefore gradually reducing the global model size and communication overhead. The iterative nature of FedMap, forming subsequent models as subsets of predecessors, avoids parameter reactivation issues seen in prior work, resulting in stable performance. In this paper we provide an extensive evaluation of FedMap across diverse settings, datasets, model architectures, and hyperparameters, assessing performance in both IID and non-IID environments. Comparative analysis against the baseline approach demonstrates FedMap’s ability to achieve more stable client model performance. For IID scenarios, FedMap achieves over 90 % pruning without significant performance degradation. In non-IID settings, it achieves at least ~80 % pruning while maintaining accuracy. FedMap offers a promising solution to alleviate communication bottlenecks in FL systems while retaining model accuracy.

[LG-41] Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation

链接: https://arxiv.org/abs/2406.19049
作者: Amartya Sanyal,Yaxi Hu,Yaodong Yu,Yian Ma,Yixin Wang,Bernhard Schölkopf
关键词: widely observed phenomenon, machine learning, widely observed, OOD, data configurations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:“Accuracy-on-the-line” is a widely observed phenomenon in machine learning, where a model’s accuracy on in-distribution (ID) and out-of-distribution (OOD) data is positively correlated across different hyperparameters and data configurations. But when does this useful relationship break down? In this work, we explore its robustness. The key observation is that noisy data and the presence of nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon. In these cases, ID and OOD accuracy can become negatively correlated, leading to “Accuracy-on-the-wrong-line”. This phenomenon can also occur in the presence of spurious (shortcut) features, which tend to overshadow the more complex signal (core, non-spurious) features, resulting in a large nuisance feature space. Moreover, scaling to larger datasets does not mitigate this undesirable behavior and may even exacerbate it. We formally prove a lower bound on Out-of-distribution (OOD) error in a linear classification model, characterizing the conditions on the noise and nuisance features for a large OOD error. We finally demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.

[LG-42] On Convex Optimization with Semi-Sensitive Features

链接: https://arxiv.org/abs/2406.19040
作者: Badih Ghazi,Pritish Kamath,Ravi Kumar,Pasin Manurangsi,Raghu Meka,Chiyuan Zhang
关键词: empirical risk minimization, differentially private, sensitive domain size, study the differentially, semi-sensitive DP setting
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
*备注: To appear in COLT 2024

点击查看摘要

Abstract:We study the differentially private (DP) empirical risk minimization (ERM) problem under the semi-sensitive DP setting where only some features are sensitive. This generalizes the Label DP setting where only the label is sensitive. We give improved upper and lower bounds on the excess risk for DP-ERM. In particular, we show that the error only scales polylogarithmically in terms of the sensitive domain size, improving upon previous results that scale polynomially in the sensitive domain size (Ghazi et al., 2021).

[LG-43] Lithium-Ion Battery System Health Monitoring and Fault Analysis from Field Data Using Gaussian Processes

链接: https://arxiv.org/abs/2406.19015
作者: Joachim Schaeffer,Eric Lenz,Duncan Gulla,Martin Z. Bazant,Richard D. Braatz,Rolf Findeisen
关键词: Health monitoring, safe and sustainable, sustainable operation, battery systems, Gaussian process resistance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Health monitoring, fault analysis, and detection are critical for the safe and sustainable operation of battery systems. We apply Gaussian process resistance models on lithium iron phosphate battery field data to effectively separate the time-dependent and operating point-dependent resistance. The data set contains 29 battery systems returned to the manufacturer for warranty, each with eight cells in series, totaling 232 cells and 131 million data rows. We develop probabilistic fault detection rules using recursive spatiotemporal Gaussian processes. These processes allow the quick processing of over a million data points, enabling advanced online monitoring and furthering the understanding of battery pack failure in the field. The analysis underlines that often, only a single cell shows abnormal behavior or a knee point, consistent with weakest-link failure for cells connected in series, amplified by local resistive heating. The results further the understanding of how batteries degrade and fail in the field and demonstrate the potential of efficient online monitoring based on data. We open-source the code and publish the large data set upon completion of the review of this article.

[LG-44] Zero-shot domain adaptation based on dual-level mix and contrast

链接: https://arxiv.org/abs/2406.18996
作者: Yu Zhe,Jun Sakuma
关键词: Zero-shot domain adaptation, learn domain-invariant features, Zero-shot domain, domain adaptation, task
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by IEEE conference on Artificial intelligence 2024

点击查看摘要

Abstract:Zero-shot domain adaptation (ZSDA) is a domain adaptation problem in the situation that labeled samples for a target task (task of interest) are only available from the source domain at training time, but for a task different from the task of interest (irrelevant task), labeled samples are available from both source and target domains. In this situation, classical domain adaptation techniques can only learn domain-invariant features in the irrelevant task. However, due to the difference in sample distribution between the two tasks, domain-invariant features learned in the irrelevant task are biased and not necessarily domain-invariant in the task of interest. To solve this problem, this paper proposes a new ZSDA method to learn domain-invariant features with low task bias. To this end, we propose (1) data augmentation with dual-level mixups in both task and domain to fill the absence of target task-of-interest data, (2) an extension of domain adversarial learning to learn domain-invariant features with less task bias, and (3) a new dual-level contrastive learning method that enhances domain-invariance and less task biasedness of features. Experimental results show that our proposal achieves good performance on several benchmarks.

[LG-45] FedMLP: Federated Multi-Label Medical Image Classification under Task Heterogeneity

链接: https://arxiv.org/abs/2406.18995
作者: Zhaobin Sun(1),Nannan Wu(1),Junjie Shi(1),Li Yu(1),Xin Yang(1),Kwang-Ting Cheng(2),Zengqiang Yan(1) ((1) School of Electronic Information and Communications, Huazhong University of Science and Technology, (2) School of Engineering, Hong Kong University of Science and Technology)
关键词: enables decentralized organizations, preserving data privacy, made significant progress, collaboratively train models, Cross-silo federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Early accepted by MICCAI 2024

点击查看摘要

Abstract:Cross-silo federated learning (FL) enables decentralized organizations to collaboratively train models while preserving data privacy and has made significant progress in medical image classification. One common assumption is task homogeneity where each client has access to all classes during training. However, in clinical practice, given a multi-label classification task, constrained by the level of medical knowledge and the prevalence of diseases, each institution may diagnose only partial categories, resulting in task heterogeneity. How to pursue effective multi-label medical image classification under task heterogeneity is under-explored. In this paper, we first formulate such a realistic label missing setting in the multi-label FL domain and propose a two-stage method FedMLP to combat class missing from two aspects: pseudo label tagging and global knowledge learning. The former utilizes a warmed-up model to generate class prototypes and select samples with high confidence to supplement missing labels, while the latter uses a global model as a teacher for consistency regularization to prevent forgetting missing class knowledge. Experiments on two publicly-available medical datasets validate the superiority of FedMLP against the state-of-the-art both federated semi-supervised and noisy label learning approaches under task heterogeneity. Code is available at this https URL.

[LG-46] Semi-supervised Concept Bottleneck Models

链接: https://arxiv.org/abs/2406.18992
作者: Lijie Hu,Tianhao Huang,Huanyi Xie,Chenyang Ren,Zhengyu Hu,Lu Yu,Di Wang
关键词: garnered increasing attention, increasing attention due, provide concept-based explanations, black-box deep learning, achieving high final
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) have garnered increasing attention due to their ability to provide concept-based explanations for black-box deep learning models while achieving high final prediction accuracy using human-like concepts. However, the training of current CBMs heavily relies on the accuracy and richness of annotated concepts in the dataset. These concept labels are typically provided by experts, which can be costly and require significant resources and effort. Additionally, concept saliency maps frequently misalign with input saliency maps, causing concept predictions to correspond to irrelevant input features - an issue related to annotation alignment. To address these limitations, we propose a new framework called SSCBM (Semi-supervised Concept Bottleneck Model). Our SSCBM is suitable for practical situations where annotated data is scarce. By leveraging joint training on both labeled and unlabeled data and aligning the unlabeled data at the concept level, we effectively solve these issues. We proposed a strategy to generate pseudo labels and an alignment loss. Experiments demonstrate that our SSCBM is both effective and efficient. With only 20% labeled data, we achieved 93.19% (96.39% in a fully supervised setting) concept accuracy and 75.51% (79.82% in a fully supervised setting) prediction accuracy.

[LG-47] A Fast Learning-Based Surrogate of Electrical Machines using a Reduced Basis

链接: https://arxiv.org/abs/2406.18990
作者: Alejandro Ribés,Nawfal Benchekroun,Théo Delagnes
关键词: Partial Differential Equations, Differential Equations, Partial Differential, surrogate model approximates, solver of Partial
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A surrogate model approximates the outputs of a solver of Partial Differential Equations (PDEs) with a low computational cost. In this article, we propose a method to build learning-based surrogates in the context of parameterized PDEs, which are PDEs that depend on a set of parameters but are also temporal and spatial processes. Our contribution is a method hybridizing the Proper Orthogonal Decomposition and several Support Vector Regression machines. This method is conceived to work in real-time, thus aimed for being used in the context of digital twins, where a user can perform an interactive analysis of results based on the proposed surrogate. We present promising results on two use cases concerning electrical machines. These use cases are not toy examples but are produced an industrial computational code, they use meshes representing non-trivial geometries and contain non-linearities.

[LG-48] Alignment For Performance Improvement in Conversation Bots

链接: https://arxiv.org/abs/2406.18954
作者: Raghav Garg,Kapil Sharma,Shrey Singla
关键词: Identity Preference Optimization, achieve superior adherence, paper shows, achieve superior, predefined guidelines
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper shows that alignment methods can achieve superior adherence to guardrails compared to instruction fine-tuning alone in conversational agents, also known as bots, within predefined guidelines or ‘guardrails’. It examines traditional training approaches such as instruction fine-tuning and the recent advancements in direct alignment methods like Identity Preference Optimization (IPO), and Kahneman-Tversky Optimization (KTO). The effectiveness of alignment techniques both pre and post-instruction tuning is highlighted, illustrating their potential to optimize conversational bots in domains that require strict adherence to specified rules, such as customer care.

[LG-49] Evaluating AI Group Fairness: a Fuzzy Logic Perspective

链接: https://arxiv.org/abs/2406.18939
作者: Emmanouil Krasanakis,Symeon Papadopoulos
关键词: Artificial intelligence systems, Artificial intelligence, address fairness concerns, genders or races, concerns by evaluating
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: preprint, 32 pages, 7 figures, 2 theorems, 6 appendices

点击查看摘要

Abstract:Artificial intelligence systems often address fairness concerns by evaluating and mitigating measures of group discrimination, for example that indicate biases against certain genders or races. However, what constitutes group fairness depends on who is asked and the social context, whereas definitions are often relaxed to accept small deviations from the statistical constraints they set out to impose. Here we decouple definitions of group fairness both from the context and from relaxation-related uncertainty by expressing them in the axiomatic system of Basic fuzzy Logic (BL) with loosely understood predicates, like encountering group members. We then evaluate the definitions in subclasses of BL, such as Product or Lukasiewicz logics. Evaluation produces continuous instead of binary truth values by choosing the logic subclass and truth values for predicates that reflect uncertain context-specific beliefs, such as stakeholder opinions gathered through questionnaires. Internally, it follows logic-specific rules to compute the truth values of definitions. We show that commonly held propositions standardize the resulting mathematical formulas and we transcribe logic and truth value choices to layperson terms, so that anyone can answer them. We also use our framework to study several literature definitions of algorithmic fairness, for which we rationalize previous expedient practices that are non-probabilistic and show how to re-interpret their formulas and parameters in new contexts.

[LG-50] Federated Graph Semantic and Structural Learning

链接: https://arxiv.org/abs/2406.18937
作者: Wenke Huang,Guancheng Wan,Mang Ye,Bo Du
关键词: Federated graph learning, learning collaboratively learns, identically distributed property, graph learning collaboratively, Federated graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated graph learning collaboratively learns a global graph neural network with distributed graphs, where the non-independent and identically distributed property is one of the major challenges. Most relative arts focus on traditional distributed tasks like images and voices, incapable of graph structures. This paper firstly reveals that local client distortion is brought by both node-level semantics and graph-level structure. First, for node-level semantics, we find that contrasting nodes from distinct classes is beneficial to provide a well-performing discrimination. We pull the local node towards the global node of the same class and push it away from the global node of different classes. Second, we postulate that a well-structural graph neural network possesses similarity for neighbors due to the inherent adjacency relationships. However, aligning each node with adjacent nodes hinders discrimination due to the potential class inconsistency. We transform the adjacency relationships into the similarity distribution and leverage the global model to distill the relation knowledge into the local model, which preserves the structural information and discriminability of the local model. Empirical results on three graph datasets manifest the superiority of the proposed method over its counterparts.

[LG-51] Semi-adaptive Synergetic Two-way Pseudoinverse Learning System

链接: https://arxiv.org/abs/2406.18931
作者: Binghong Liu,Ziqi Zhao,Shupan Li,Ke Wang
关键词: Deep learning, crucial technology, technology for making, making breakthroughs, learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has become a crucial technology for making breakthroughs in many fields. Nevertheless, it still faces two important challenges in theoretical and applied aspects. The first lies in the shortcomings of gradient descent based learning schemes which are time-consuming and difficult to determine the learning control hyperparameters. Next, the architectural design of the model is usually tricky. In this paper, we propose a semi-adaptive synergetic two-way pseudoinverse learning system, wherein each subsystem encompasses forward learning, backward learning, and feature concatenation modules. The whole system is trained using a non-gradient descent learning algorithm. It simplifies the hyperparameter tuning while improving the training efficiency. The architecture of the subsystems is designed using a data-driven approach that enables automated determination of the depth of the subsystems. We compare our method with the baselines of mainstream non-gradient descent based methods and the results demonstrate the effectiveness of our proposed method. The source code for this paper is available at this http URLthis http URL.

[LG-52] Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network

链接: https://arxiv.org/abs/2406.18928
作者: Yehoshua Dissen,Shiry Yonash,Israel Cohen,Joseph Keshet
关键词: automatic speech recognition, ASR models, ASR, speech recognition, significant challenge
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted for publication at INTERSPEECH 2024

点击查看摘要

Abstract:In the realm of automatic speech recognition (ASR), robustness in noisy environments remains a significant challenge. Recent ASR models, such as Whisper, have shown promise, but their efficacy in noisy conditions can be further enhanced. This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models. We propose using a front-end adaptation network connected to a frozen ASR model. The adaptation network is trained to modify the corrupted input spectrum by minimizing the criteria of the ASR model in addition to an enhancement loss function. Our experiments demonstrate that the adaptation network, trained on Whisper’s criteria, notably reduces word error rates across domains and languages in packet-loss scenarios. This improvement is achieved with minimal affect to Whisper model’s foundational performance, underscoring our method’s practicality and potential in enhancing ASR models in challenging acoustic environments.

[LG-53] Fine-tuned network relies on generic representation to solve unseen cognitive task

链接: https://arxiv.org/abs/2406.18926
作者: Dongyan Lin
关键词: shown promising results, generic pretrained representation, shown promising, wide range, Fine-tuning pretrained language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning pretrained language models has shown promising results on a wide range of tasks, but when encountering a novel task, do they rely more on generic pretrained representation, or develop brand new task-specific solutions? Here, we fine-tuned GPT-2 on a context-dependent decision-making task, novel to the model but adapted from neuroscience literature. We compared its performance and internal mechanisms to a version of GPT-2 trained from scratch on the same task. Our results show that fine-tuned models depend heavily on pretrained representations, particularly in later layers, while models trained from scratch develop different, more task-specific mechanisms. These findings highlight the advantages and limitations of pretraining for task generalization and underscore the need for further investigation into the mechanisms underpinning task-specific fine-tuning in LLMs.

[LG-54] Learning Pareto Set for Multi-Objective Continuous Robot Control

链接: https://arxiv.org/abs/2406.18924
作者: Tianye Shu,Ke Shang,Cheng Gong,Yang Nan,Hisao Ishibuchi
关键词: multiple conflicting objectives, Pareto-optimal policies called, Pareto set, Pareto-optimal deep policies, called the Pareto
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:For a control problem with multiple conflicting objectives, there exists a set of Pareto-optimal policies called the Pareto set instead of a single optimal policy. When a multi-objective control problem is continuous and complex, traditional multi-objective reinforcement learning (MORL) algorithms search for many Pareto-optimal deep policies to approximate the Pareto set, which is quite resource-consuming. In this paper, we propose a simple and resource-efficient MORL algorithm that learns a continuous representation of the Pareto set in a high-dimensional policy parameter space using a single hypernet. The learned hypernet can directly generate various well-trained policy networks for different user preferences. We compare our method with two state-of-the-art MORL algorithms on seven multi-objective continuous robot control problems. Experimental results show that our method achieves the best overall performance with the least training parameters. An interesting observation is that the Pareto set is well approximated by a curved line or surface in a high-dimensional parameter space. This observation will provide insight for researchers to design new MORL algorithms.

[LG-55] me Matters: Scaling Laws for Any Budget

链接: https://arxiv.org/abs/2406.18922
作者: Itay Inbar,Luke Sernau
关键词: primary cost driver, primary cost, cost driver, wall-clock training time, training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A primary cost driver for training large models is wall-clock training time. We show that popular time estimates based on FLOPs are poor estimates, and construct a more accurate proxy based on memory copies. We show that with some simple accounting, we can estimate the training speed of a transformer model from its hyperparameters. Combined with a scaling law curve like Chinchilla, this lets us estimate the final loss of the model. We fit our estimate to real data with a linear regression, and apply the result to rewrite Chinchilla in terms of a model’s estimated training time as opposed to the amount of training data. This gives an expression for the loss in terms of the model’s hyperparameters alone. We show that this expression is accurate across a wide range of model hyperparameter values, enabling us to analytically make architectural decisions and train models more efficiently.

[LG-56] LearnedKV: Integrating LSM and Learned Index for Superior Performance on SSD

链接: https://arxiv.org/abs/2406.18892
作者: Wenlong Wang,David Hung-Chang Du
关键词: Log-Structured Merge, Learned Index, LSM tree, tiered Learned Index, store that seamlessly
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 17 pages, 13 figures

点击查看摘要

Abstract:In this paper, we introduce LearnedKV, a novel tiered key-value (KV) store that seamlessly integrates a Log-Structured Merge (LSM) tree with a Learned Index. This integration yields superior read and write performance compared to standalone indexing structures on SSDs. Our design capitalizes on the LSM tree’s high write/update throughput and the Learned Index’s fast read capabilities, enabling each component to leverage its strengths. We analyze the impact of size on LSM tree performance and demonstrate how the tiered Learned Index significantly mitigates the LSM tree’s size-related performance degradation, particularly by reducing the intensive I/O operations resulting from re-insertions after Garbage Collection (GC). To maintain rapid read performance for newly inserted keys, we introduce a non-blocking conversion mechanism that efficiently transforms the existing LSM tree into a new Learned Index with minimal overhead during GC. Our experimental results, conducted across diverse workloads, show that LearnedKV outperforms state-of-the-art solutions by up to 1.32x in read requests and 1.31x in write performance.

[LG-57] From Biased Selective Labels to Pseudo-Labels: An Expectation-Maximization Framework for Learning from Biased Decisions

链接: https://arxiv.org/abs/2406.18865
作者: Trenton Chang,Jenna Wiens
关键词: Selective labels occur, decision-making process, diagnoses that depend, disparate censorship, observations are subject
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 39 pages, 33 figures. ICML 2024 conference paper

点击查看摘要

Abstract:Selective labels occur when label observations are subject to a decision-making process; e.g., diagnoses that depend on the administration of laboratory tests. We study a clinically-inspired selective label problem called disparate censorship, where labeling biases vary across subgroups and unlabeled individuals are imputed as “negative” (i.e., no diagnostic test = no illness). Machine learning models naively trained on such labels could amplify labeling bias. Inspired by causal models of selective labels, we propose Disparate Censorship Expectation-Maximization (DCEM), an algorithm for learning in the presence of disparate censorship. We theoretically analyze how DCEM mitigates the effects of disparate censorship on model performance. We validate DCEM on synthetic data, showing that it improves bias mitigation (area between ROC curves) without sacrificing discriminative performance (AUC) compared to baselines. We achieve similar results in a sepsis classification task using clinical data.

[LG-58] Predicting the duration of traffic incidents for Sydney greater metropolitan area using machine learning methods

链接: https://arxiv.org/abs/2406.18861
作者: Artur Grigorev,Sajjad Shafiei,Hanna Grzybowska,Adriana-Simona Mihaita
关键词: Sydney Metropolitan Area, Metropolitan Area, Sydney Metropolitan, Gradient Boosted Decision, Boosted Decision Trees
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This research presents a comprehensive approach to predicting the duration of traffic incidents and classifying them as short-term or long-term across the Sydney Metropolitan Area. Leveraging a dataset that encompasses detailed records of traffic incidents, road network characteristics, and socio-economic indicators, we train and evaluate a variety of advanced machine learning models including Gradient Boosted Decision Trees (GBDT), Random Forest, LightGBM, and XGBoost. The models are assessed using Root Mean Square Error (RMSE) for regression tasks and F1 score for classification tasks. Our experimental results demonstrate that XGBoost and LightGBM outperform conventional models with XGBoost achieving the lowest RMSE of 33.7 for predicting incident duration and highest classification F1 score of 0.62 for a 30-minute duration threshold. For classification, the 30-minute threshold balances performance with 70.84% short-term duration classification accuracy and 62.72% long-term duration classification accuracy. Feature importance analysis, employing both tree split counts and SHAP values, identifies the number of affected lanes, traffic volume, and types of primary and secondary vehicles as the most influential features. The proposed methodology not only achieves high predictive accuracy but also provides stakeholders with vital insights into factors contributing to incident durations. These insights enable more informed decision-making for traffic management and response strategies. The code is available by the link: this https URL Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) Cite as: arXiv:2406.18861 [cs.LG] (or arXiv:2406.18861v1 [cs.LG] for this version)

[LG-59] What Is Missing In Homophily? Disentangling Graph Homophily For Graph Neural Networks

链接: https://arxiv.org/abs/2406.18854
作者: Yilun Zheng,Sitao Luan,Lihui Chen
关键词: Graph Neural Networks, share similar characteristics, connected nodes tend, Graph homophily refers, effective Graph Neural
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph homophily refers to the phenomenon that connected nodes tend to share similar characteristics. Understanding this concept and its related metrics is crucial for designing effective Graph Neural Networks (GNNs). The most widely used homophily metrics, such as edge or node homophily, quantify such “similarity” as label consistency across the graph topology. These metrics are believed to be able to reflect the performance of GNNs, especially on node-level tasks. However, many recent studies have empirically demonstrated that the performance of GNNs does not always align with homophily metrics, and how homophily influences GNNs still remains unclear and controversial. Then, a crucial question arises: What is missing in our current understanding of homophily? To figure out the missing part, in this paper, we disentangle the graph homophily into 3 aspects: label, structural, and feature homophily, providing a more comprehensive understanding of GNN performance. To investigate their synergy, we propose a Contextual Stochastic Block Model with 3 types of Homophily (CSBM-3H), where the topology and feature generation are controlled by the 3 metrics. Based on the theoretical analysis of CSBM-3H, we derive a new composite metric, named Tri-Hom, that considers all 3 aspects and overcomes the limitations of conventional homophily metrics. The theoretical conclusions and the effectiveness of Tri-Hom have been verified through synthetic experiments on CSBM-3H. In addition, we conduct experiments on 31 real-world benchmark datasets and calculate the correlations between homophily metrics and model performance. Tri-Hom has significantly higher correlation values than 17 existing metrics that only focus on a single homophily aspect, demonstrating its superiority and the importance of homophily synergy. Our code is available at \urlthis https URL.

[LG-60] Decoding-Time Language Model Alignment with Multiple Objectives

链接: https://arxiv.org/abs/2406.18853
作者: Ruizhe Shi,Yifang Chen,Yushi Hu,ALisa Liu,Noah Smith,Hannaneh Hajishirzi,Simon Du
关键词: Aligning language models, Aligning language, serve diverse user, critical pursuit, serve diverse
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose \textbfmulti-objective decoding (MOD) , a decoding-time algorithm that outputs the next token from a linear combination of predictions of all base models, for any given weightings over different objectives. We exploit a common form among a family of f -divergence regularized alignment approaches (such as PPO, DPO, and their variants) to identify a closed-form solution by Legendre transform, and derive an efficient decoding strategy. Theoretically, we show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method. Empirical results demonstrate the effectiveness of the algorithm. For example, compared to a parameter-merging baseline, MOD achieves 12.8% overall reward improvement when equally optimizing towards 3 objectives. Moreover, we experiment with MOD on combining three fully-finetuned LLMs of different model sizes, each aimed at different objectives such as safety, coding, and general user preference. Unlike traditional methods that require careful curation of a mixture of datasets to achieve comprehensive improvement, we can quickly experiment with preference weightings using MOD to find the best combination of models. Our best combination reduces toxicity on Toxigen to nearly 0% and achieves 7.9–33.3% improvement across other three metrics ( \textiti.e. , Codex@1, GSM-COT, BBH-COT).

[LG-61] LICO: Large Language Models for In-Context Molecular Optimization

链接: https://arxiv.org/abs/2406.18851
作者: Tung Nguyen,Aditya Grover
关键词: Optimizing black-box functions, Optimizing black-box, science and engineering, fundamental problem, Optimizing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO achieves state-of-the-art performance on PMO, a challenging molecular optimization benchmark comprising over 20 objective functions.

[LG-62] mporally Multi-Scale Sparse Self-Attention for Physical Activity Data Imputation

链接: https://arxiv.org/abs/2406.18848
作者: Hui Wei,Maxwell A. Xu,Colin Samplawski,James M. Rehg,Santosh Kumar,Benjamin M. Marlin
关键词: enable health researchers, sensors enable health, Wearable sensors enable, continuously collect data, collect data pertaining
类目: Machine Learning (cs.LG)
*备注: Accepted by Conference on Health, Inference, and Learning (CHIL) 2024

点击查看摘要

Abstract:Wearable sensors enable health researchers to continuously collect data pertaining to the physiological state of individuals in real-world settings. However, such data can be subject to extensive missingness due to a complex combination of factors. In this work, we study the problem of imputation of missing step count data, one of the most ubiquitous forms of wearable sensor data. We construct a novel and large scale data set consisting of a training set with over 3 million hourly step count observations and a test set with over 2.5 million hourly step count observations. We propose a domain knowledge-informed sparse self-attention model for this task that captures the temporal multi-scale nature of step-count data. We assess the performance of the model relative to baselines and conduct ablation studies to verify our specific model designs.

[LG-63] Learning Retrieval Augmentation for Personalized Dialogue Generation

链接: https://arxiv.org/abs/2406.18847
作者: Qiushi Huang,Shuai Fu,Xubo Liu,Wenwu Wang,Tom Ko,Yu Zhang,Lilian Tang
关键词: generating highly tailored, gained significant attention, textbf, persona dialogue generation, Personalized dialogue generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP-2023

点击查看摘要

Abstract:Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose \textbfL earning Retrieval \textbfA ugmentation for \textbfP ersonalized \textbfD ial \textbfO gue \textbfG eneration ( \textbfLAPDOG ), which studies the potential of leveraging external knowledge for persona dialogue generation. Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator. The story retriever uses a given persona profile as queries to retrieve relevant information from the story document, which serves as a supplementary context to augment the persona profile. The dialogue generator utilizes both the dialogue history and the augmented persona profile to generate personalized responses. For optimization, we adopt a joint training framework that collaboratively learns the story retriever and dialogue generator, where the story retriever is optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for the dialogue generator to generate personalized responses. Experiments conducted on the CONVAI2 dataset with ROCStory as a supplementary data source show that the proposed LAPDOG method substantially outperforms the baselines, indicating the effectiveness of the proposed method. The LAPDOG model code is publicly available for further exploration. this https URL

[LG-64] Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

链接: https://arxiv.org/abs/2406.18820
作者: Xinyu Lian,Sam Ade Jacobs,Lev Kurilenko,Masahiro Tanaka,Stas Bekman,Olatunji Ruwase,Minjia Zhang
关键词: Existing checkpointing approaches, Existing checkpointing, Universal Checkpointing, limitations make model, hardware limitations make
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity. The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2406.18820 [cs.DC] (or arXiv:2406.18820v1 [cs.DC] for this version)

[LG-65] MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation

链接: https://arxiv.org/abs/2406.18815
作者: Sanggeon Yun,Ryozo Masukawa,Minhyoung Na,Mohsen Imani
关键词: escalating safety concerns, evidence investigation, violence alerting, context of escalating, escalating safety
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the context of escalating safety concerns across various domains, the tasks of Video Anomaly Detection (VAD) and Video Anomaly Recognition (VAR) have emerged as critically important for applications in intelligent surveillance, evidence investigation, violence alerting, etc. These tasks, aimed at identifying and classifying deviations from normal behavior in video data, face significant challenges due to the rarity of anomalies which leads to extremely imbalanced data and the impracticality of extensive frame-level data annotation for supervised learning. This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN that addresses these challenges by leveraging a state-of-the-art large language model and a comprehensive knowledge graph for efficient weakly supervised learning in VAR. Our approach circumvents the limitations of previous methods by avoiding heavy gradient computations on large multimodal models and enabling fully frame-level training without fixed video segmentation. Utilizing automated, mission-specific knowledge graph generation, our model provides a practical and efficient solution for real-time video analysis without the constraints of previous segmentation-based or multimodal approaches. Experimental validation on benchmark datasets demonstrates our model’s performance in VAD and VAR, highlighting its potential to redefine the landscape of anomaly detection and recognition in video surveillance systems.

[LG-66] Online Stackelberg Optimization via Nonlinear Control

链接: https://arxiv.org/abs/2406.18805
作者: William Brown,Christos Papadimitriou,Tim Roughgarden
关键词: objective often requires, requires anticipating, anticipating and optimizing, repeated interaction problems, textit
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: COLT 2024

点击查看摘要

Abstract:In repeated interaction problems with adaptive agents, our objective often requires anticipating and optimizing over the space of possible agent responses. We show that many problems of this form can be cast as instances of online (nonlinear) control which satisfy \textitlocal controllability, with convex losses over a bounded state space which encodes agent behavior, and we introduce a unified algorithmic framework for tractable regret minimization in such cases. When the instance dynamics are known but otherwise arbitrary, we obtain oracle-efficient O(\sqrtT) regret by reduction to online convex optimization, which can be made computationally efficient if dynamics are locally \textitaction-linear. In the presence of adversarial disturbances to the state, we give tight bounds in terms of either the cumulative or per-round disturbance magnitude (for \textitstrongly or \textitweakly locally controllable dynamics, respectively). Additionally, we give sublinear regret results for the cases of unknown locally action-linear dynamics as well as for the bandit feedback setting. Finally, we demonstrate applications of our framework to well-studied problems including performative prediction, recommendations for adaptive agents, adaptive pricing of real-valued goods, and repeated gameplay against no-regret learners, directly yielding extensions beyond prior results in each case.

[LG-67] All Random Features Representations are Equivalent

链接: https://arxiv.org/abs/2406.18802
作者: Luke Sernau,Silvano Bonacina,Rif A. Saurous
关键词: infinite-dimensional dot products, rewrite positive-definite kernels, random feature representations, dot products, Random features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Random features are an important technique that make it possible to rewrite positive-definite kernels as infinite-dimensional dot products. Over time, increasingly elaborate random feature representations have been developed in pursuit of finite approximations with ever lower error. We resolve this arms race by deriving an optimal sampling policy, and show that under this policy all random features representations have the same approximation error. This establishes a lower bound that holds across all random feature representations, and shows that we are free to choose whatever representation we please, provided we sample optimally.

[LG-68] Infinite Width Models That Work: Why Feature Learning Doesnt Matter as Much as You Think

链接: https://arxiv.org/abs/2406.18800
作者: Luke Sernau
关键词: Neural Tangent Kernels, Common infinite-width architectures, Tangent Kernels, Neural Tangent, Common infinite-width
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Common infinite-width architectures such as Neural Tangent Kernels (NTKs) have historically shown weak performance compared to finite models. This has been attributed to the absence of feature learning. We show that this is not the case. In fact, we show that infinite width NTK models are able to access richer features than finite models by selecting relevant subfeatures from their (infinite) feature vector. In fact, we show experimentally that NTKs under-perform traditional finite models even when feature learning is artificially disabled. Instead, weak performance is due to the fact that existing constructions depend on weak optimizers like SGD. We provide an infinite width limit based on ADAM-like learning dynamics and demonstrate empirically that the resulting models erase this performance gap.

[LG-69] Operator Learning of Lipschitz Operators: An Information-Theoretic Perspective

链接: https://arxiv.org/abs/2406.18794
作者: Samuel Lanthaler
关键词: infinite-dimensional Banach spaces, Banach spaces, infinite-dimensional Banach, Operator learning based, mapping between infinite-dimensional
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Operator learning based on neural operators has emerged as a promising paradigm for the data-driven approximation of operators, mapping between infinite-dimensional Banach spaces. Despite significant empirical progress, our theoretical understanding regarding the efficiency of these approximations remains incomplete. This work addresses the parametric complexity of neural operator approximations for the general class of Lipschitz continuous operators. Motivated by recent findings on the limitations of specific architectures, termed curse of parametric complexity, we here adopt an information-theoretic perspective. Our main contribution establishes lower bounds on the metric entropy of Lipschitz operators in two approximation settings; uniform approximation over a compact set of input functions, and approximation in expectation, with input functions drawn from a probability measure. It is shown that these entropy bounds imply that, regardless of the activation function used, neural operator architectures attaining an approximation accuracy \epsilon must have a size that is exponentially large in \epsilon^-1 . The size of architectures is here measured by counting the number of encoded bits necessary to store the given model in computational memory. The results of this work elucidate fundamental trade-offs and limitations in

[LG-70] Unified Uncertainties: Combining Input Data and Model Uncertainty into a Single Formulation

链接: https://arxiv.org/abs/2406.18787
作者: Matias Valdenegro-Toro,Ivo Pascal de Jong,Marco Zullich
关键词: Machine Learning models, Machine Learning, Modelling uncertainty, Learning models, uncertainty
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 4 pages, 3 figures, with appendix. LatinX in AI Research Workshop @ ICML 2024 Camera Ready

点击查看摘要

Abstract:Modelling uncertainty in Machine Learning models is essential for achieving safe and reliable predictions. Most research on uncertainty focuses on output uncertainty (predictions), but minimal attention is paid to uncertainty at inputs. We propose a method for propagating uncertainty in the inputs through a Neural Network that is simultaneously able to estimate input, data, and model uncertainty. Our results show that this propagation of input uncertainty results in a more stable decision boundary even under large amounts of input noise than comparatively simple Monte Carlo sampling. Additionally, we discuss and demonstrate that input uncertainty, when propagated through the model, results in model uncertainty at the outputs. The explicit incorporation of input uncertainty may be beneficial in situations where the amount of input uncertainty is known, though good datasets for this are still needed.

[LG-71] Psychological Profiling in Cybersecurity: A Look at LLMs and Psycholinguistic Features

链接: https://arxiv.org/abs/2406.18783
作者: Jean Marie Tshimula,D’Jeff K. Nkashama,Jean Tshibangu Muabila,René Manassé Galekwa,Hugues Kanda,Maximilien V. Dialufuma,Mbuyi Mukendi Didier,Kalala Kalonji,Serge Mundele,Patience Kinshie Lenye,Tighana Wenge Basele,Aristarque Ilunga,Christian N. Mayemba,Nathanaël M. Kasoro,Selain K. Kasereka,Hardy Mikese,Pierre-Martin Tardif,Marc Frappier,Froduald Kabanza,Belkacem Chikhaoui,Shengrui Wang,Ali Mulenda Sumbu,Xavier Ndona,Raoul Kienge-Kienge Intudi
关键词: necessitates innovative approaches, Large Language Models, cyber threats necessitates, threats necessitates innovative, increasing sophistication
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing sophistication of cyber threats necessitates innovative approaches to cybersecurity. In this paper, we explore the potential of psychological profiling techniques, particularly focusing on the utilization of Large Language Models (LLMs) and psycholinguistic features. We investigate the intersection of psychology and cybersecurity, discussing how LLMs can be employed to analyze textual data for identifying psychological traits of threat actors. We explore the incorporation of psycholinguistic features, such as linguistic patterns and emotional cues, into cybersecurity frameworks. \iffalse Through case studies and experiments, we discuss the effectiveness of these methods in enhancing threat detection and mitigation strategies.\fi Our research underscores the importance of integrating psychological perspectives into cybersecurity practices to bolster defense mechanisms against evolving threats.

[LG-72] Aligning Model Properties via Conformal Risk Control

链接: https://arxiv.org/abs/2406.18777
作者: William Overman,Jacqueline Jil Vallon,Mohsen Bayati
关键词: meet end-user requirements, excellent test set, test set metrics, end-user requirements, crucial due
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:AI model alignment is crucial due to inadvertent biases in training data and the underspecified pipeline in modern machine learning, where numerous models with excellent test set metrics can be produced, yet they may not meet end-user requirements. Recent advances demonstrate that post-training model alignment via human feedback can address some of these challenges. However, these methods are often confined to settings (such as generative AI) where humans can interpret model outputs and provide feedback. In traditional non-generative settings, where model outputs are numerical values or classes, detecting misalignment through single-sample outputs is highly challenging. In this paper we consider an alternative strategy. We propose interpreting model alignment through property testing, defining an aligned model f as one belonging to a subset \mathcalP of functions that exhibit specific desired behaviors. We focus on post-processing a pre-trained model f to better align with \mathcalP using conformal risk control. Specifically, we develop a general procedure for converting queries for a given property \mathcalP to a collection of loss functions suitable for use in a conformal risk control algorithm. We prove a probabilistic guarantee that the resulting conformal interval around f contains a function approximately satisfying \mathcalP . Given the capabilities of modern AI models with extensive parameters and training data, one might assume alignment issues will resolve naturally. However, increasing training data or parameters in a random feature model doesn’t eliminate the need for alignment techniques when pre-training data is biased. We demonstrate our alignment methodology on supervised learning datasets for properties like monotonicity and concavity. Our flexible procedure can be applied to various desired properties. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2406.18777 [cs.LG] (or arXiv:2406.18777v1 [cs.LG] for this version)

[LG-73] ADO-LLM: Analog Design Bayesian Optimization with In-Context Learning of Large Language Models

链接: https://arxiv.org/abs/2406.18770
作者: Yuxuan Yin,Yu Wang,Boxun Xu,Peng Li
关键词: requires substantial human, substantial human expertise, design requires substantial, design, circuit design requires
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Analog circuit design requires substantial human expertise and involvement, which is a significant roadblock to design productivity. Bayesian Optimization (BO), a popular machine learning based optimization strategy, has been leveraged to automate analog design given its applicability across various circuit topologies and technologies. Traditional BO methods employ black box Gaussian Process surrogate models and optimized labeled data queries to find optimization solutions by trading off between exploration and exploitation. However, the search for the optimal design solution in BO can be expensive from both a computational and data usage point of view, particularly for high dimensional optimization problems. This paper presents ADO-LLM, the first work integrating large language models (LLMs) with Bayesian Optimization for analog design optimization. ADO-LLM leverages the LLM’s ability to infuse domain knowledge to rapidly generate viable design points to remedy BO’s inefficiency in finding high value design areas specifically under the limited design space coverage of the BO’s probabilistic surrogate model. In the meantime, sampling of design points evaluated in the iterative BO process provides quality demonstrations for the LLM to generate high quality design points while leveraging infused broad design knowledge. Furthermore, the diversity brought by BO’s exploration enriches the contextual understanding of the LLM and allows it to more broadly search in the design space and prevent repetitive and redundant suggestions. We evaluate the proposed framework on two different types of analog circuits and demonstrate notable improvements in design efficiency and effectiveness.

[LG-74] WV-Net: A foundation model for SAR WV-mode satellite imagery trained using contrastive self-supervised learning on 10 million images

链接: https://arxiv.org/abs/2406.18765
作者: Yannik Glaser,Justin E. Stopa,Linnea M. Wolniewicz,Ralph Foster,Doug Vandemark,Alexis Mouche,Bertrand Chapron,Peter Sadowski
关键词: Space Agency Copernicus, European Space Agency, C-band synthetic aperture, Agency Copernicus, European Space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 9 figures, submitted to NeurIPS 2024

点击查看摘要

Abstract:The European Space Agency’s Copernicus Sentinel-1 (S-1) mission is a constellation of C-band synthetic aperture radar (SAR) satellites that provide unprecedented monitoring of the world’s oceans. S-1’s wave mode (WV) captures 20x20 km image patches at 5 m pixel resolution and is unaffected by cloud cover or time-of-day. The mission’s open data policy has made SAR data easily accessible for a range of applications, but the need for manual image annotations is a bottleneck that hinders the use of machine learning methods. This study uses nearly 10 million WV-mode images and contrastive self-supervised learning to train a semantic embedding model called WV-Net. In multiple downstream tasks, WV-Net outperforms a comparable model that was pre-trained on natural images (ImageNet) with supervised learning. Experiments show improvements for estimating wave height (0.50 vs 0.60 RMSE using linear probing), estimating near-surface air temperature (0.90 vs 0.97 RMSE), and performing multilabel-classification of geophysical and atmospheric phenomena (0.96 vs 0.95 micro-averaged AUROC). WV-Net embeddings are also superior in an unsupervised image-retrieval task and scale better in data-sparse settings. Together, these results demonstrate that WV-Net embeddings can support geophysical research by providing a convenient foundation model for a variety of data analysis and exploration tasks.

[LG-75] Conformalized Link Prediction on Graph Neural Networks

链接: https://arxiv.org/abs/2406.18763
作者: Tianyi Zhao,Jian Kang,Lu Cheng
关键词: Graph Neural Networks, Neural Networks, excel in diverse, Graph Neural, high-stakes domains
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) excel in diverse tasks, yet their applications in high-stakes domains are often hampered by unreliable predictions. Although numerous uncertainty quantification methods have been proposed to address this limitation, they often lack \textitrigorous uncertainty estimates. This work makes the first attempt to introduce a distribution-free and model-agnostic uncertainty quantification approach to construct a predictive interval with a statistical guarantee for GNN-based link prediction. We term it as \textitconformalized link prediction. Our approach builds upon conformal prediction (CP), a framework that promises to construct statistically robust prediction sets or intervals. We first theoretically and empirically establish a permutation invariance condition for the application of CP in link prediction tasks, along with an exact test-time coverage. Leveraging the important structural information in graphs, we then identify a novel and crucial connection between a graph’s adherence to the power law distribution and the efficiency of CP. This insight leads to the development of a simple yet effective sampling-based method to align the graph structure with a power law distribution prior to the standard CP procedure. Extensive experiments demonstrate that for conformalized link prediction, our approach achieves the desired marginal coverage while significantly improving the efficiency of CP compared to baseline methods.

[LG-76] he Impact of Feature Representation on the Accuracy of Photonic Neural Networks

链接: https://arxiv.org/abs/2406.18757
作者: Mauricio Gomes de Queiroz,Paul Jimenez,Raphael Cardoso,Mateus Vidaletti da Costa,Mohab Abdalla,Ian O’Connor,Alberto Bosio,Fabio Pavanello
关键词: Photonic Neural Networks, gaining significant interest, research community due, Photonic Neural, Neural Networks
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Photonic Neural Networks (PNNs) are gaining significant interest in the research community due to their potential for high parallelization, low latency, and energy efficiency. PNNs compute using light, which leads to several differences in implementation when compared to electronics, such as the need to represent input features in the photonic domain before feeding them into the network. In this encoding process, it is common to combine multiple features into a single input to reduce the number of inputs and associated devices, leading to smaller and more energy-efficient PNNs. Although this alters the network’s handling of input data, its impact on PNNs remains understudied. This paper addresses this open question, investigating the effect of commonly used encoding strategies that combine features on the performance and learning capabilities of PNNs. Here, using the concept of feature importance, we develop a mathematical framework for analyzing feature combination. Through this framework, we demonstrate that encoding multiple features together in a single input determines their relative importance, thus limiting the network’s ability to learn from the data. Given some prior knowledge of the data, however, this can also be leveraged for higher accuracy. By selecting an optimal encoding method, we achieve up to a 12.3% improvement in accuracy of PNNs trained on the Iris dataset compared to other encoding techniques, surpassing the performance of networks where features are not combined. These findings highlight the importance of carefully choosing the encoding to the accuracy and decision-making strategies of PNNs, particularly in size or power constrained applications.

[LG-77] Competitive Algorithms for Online Knapsack with Succinct Predictions

链接: https://arxiv.org/abs/2406.18752
作者: Mohammadreza Daneshvaramoli,Helia Karisani,Adam Lechowicz,Bo Sun,Cameron Musco,Mohammad Hajiesmaili
关键词: pack items arriving, items arriving online, online knapsack problem, online knapsack, online
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 29 pages, 10 figures, Submitted to NeurIPS 2024

点击查看摘要

Abstract:In the online knapsack problem, the goal is to pack items arriving online with different values and weights into a capacity-limited knapsack to maximize the total value of the accepted items. We study \textitlearning-augmented algorithms for this problem, which aim to use machine-learned predictions to move beyond pessimistic worst-case guarantees. Existing learning-augmented algorithms for online knapsack consider relatively complicated prediction models that give an algorithm substantial information about the input, such as the total weight of items at each value. In practice, such predictions can be error-sensitive and difficult to learn. Motivated by this limitation, we introduce a family of learning-augmented algorithms for online knapsack that use \emphsuccinct predictions. In particular, the machine-learned prediction given to the algorithm is just a single value or interval that estimates the minimum value of any item accepted by an offline optimal solution. By leveraging a relaxation to online \emphfractional knapsack, we design algorithms that can leverage such succinct predictions in both the trusted setting (i.e., with perfect prediction) and the untrusted setting, where we prove that a simple meta-algorithm achieves a nearly optimal consistency-robustness trade-off. Empirically, we show that our algorithms significantly outperform baselines that do not use predictions and often outperform algorithms based on more complex prediction models.

[LG-78] A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems

链接: https://arxiv.org/abs/2406.18747
作者: Karn N. Watcharasupat,Alexander Lerch
关键词: significant recent progress, source separation, four-stem vocals, audio source separation, significant recent
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Submitted to the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

点击查看摘要

Abstract:Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached the performance level of the significantly more complex 6-stem Hybrid Transformer Demucs on VDBO stems and outperformed it on guitar and piano. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs. Implementation is available at this https URL.

[LG-79] QBI: Quantile-based Bias Initialization for Efficient Private Data Reconstruction in Federated Learning

链接: https://arxiv.org/abs/2406.18745
作者: Micha V. Nowak,Tim P. Bott,David Khachaturov,Frank Puppe,Adrian Krenzer,Amar Hekalo
关键词: compromising user privacy, shared model updates, machine learning models, user privacy, model updates
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning enables the training of machine learning models on distributed data without compromising user privacy, as data remains on personal devices and only model updates, such as gradients, are shared with a central coordinator. However, recent research has shown that the central entity can perfectly reconstruct private data from shared model updates by maliciously initializing the model’s parameters. In this paper, we propose QBI, a novel bias initialization method that significantly enhances reconstruction capabilities. This is accomplished by directly solving for bias values yielding sparse activation patterns. Further, we propose PAIRS, an algorithm that builds on QBI. PAIRS can be deployed when a separate dataset from the target domain is available to further increase the percentage of data that can be fully recovered. Measured by the percentage of samples that can be perfectly reconstructed from batches of various sizes, our approach achieves significant improvements over previous methods with gains of up to 50% on ImageNet and up to 60% on the IMDB sentiment analysis text dataset. Furthermore, we establish theoretical limits for attacks leveraging stochastic gradient sparsity, providing a foundation for understanding the fundamental constraints of these attacks. We empirically assess these limits using synthetic datasets. Finally, we propose and evaluate AGGP, a defensive framework designed to prevent gradient sparsity attacks, contributing to the development of more secure and private federated learning systems.

[LG-80] Decentralized Semantic Traffic Control in AVs Using RL and DQN for Dynamic Roadblocks

链接: https://arxiv.org/abs/2406.18741
作者: Emanuel Figetakis,Yahuza Bello,Ahmed Refaey,Abdallah Shami
关键词: execute intelligent maneuvers, capturing essential vehicle, essential vehicle dynamics, Autonomous Vehicles, furnished with sensors
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Autonomous Vehicles (AVs), furnished with sensors capable of capturing essential vehicle dynamics such as speed, acceleration, and precise location, possess the capacity to execute intelligent maneuvers, including lane changes, in anticipation of approaching roadblocks. Nevertheless, the sheer volume of sensory data and the processing necessary to derive informed decisions can often overwhelm the vehicles, rendering them unable to handle the task independently. Consequently, a common approach in traffic scenarios involves transmitting the data to servers for processing, a practice that introduces challenges, particularly in situations demanding real-time processing. In response to this challenge, we present a novel DL-based semantic traffic control system that entrusts semantic encoding responsibilities to the vehicles themselves. This system processes driving decisions obtained from a Reinforcement Learning (RL) agent, streamlining the decision-making process. Specifically, our framework envisions scenarios where abrupt roadblocks materialize due to factors such as road maintenance, accidents, or vehicle repairs, necessitating vehicles to make determinations concerning lane-keeping or lane-changing actions to navigate past these obstacles. To formulate this scenario mathematically, we employ a Markov Decision Process (MDP) and harness the Deep Q Learning (DQN) algorithm to unearth viable solutions.

[LG-81] RetroGFN: Diverse and Feasible Retrosynthesis using GFlowNets

链接: https://arxiv.org/abs/2406.18739
作者: Piotr Gaiński,Michał Koziarski,Krzysztof Maziarz,Marwin Segler,Jacek Tabor,Marek Śmieja
关键词: Single-step retrosynthesis aims, target molecule, molecular discovery, aims to predict, crucial task
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Single-step retrosynthesis aims to predict a set of reactions that lead to the creation of a target molecule, which is a crucial task in molecular discovery. Although a target molecule can often be synthesized with multiple different reactions, it is not clear how to verify the feasibility of a reaction, because the available datasets cover only a tiny fraction of the possible solutions. Consequently, the existing models are not encouraged to explore the space of possible reactions sufficiently. In this paper, we propose a novel single-step retrosynthesis model, RetroGFN, that can explore outside the limited dataset and return a diverse set of feasible reactions by leveraging a feasibility proxy model during the training. We show that RetroGFN achieves competitive results on standard top-k accuracy while outperforming existing methods on round-trip accuracy. Moreover, we provide empirical arguments in favor of using round-trip accuracy which expands the notion of feasibility with respect to the standard top-k accuracy metric.

[LG-82] Data-driven identification of port-Hamiltonian DAE systems by Gaussian processes

链接: https://arxiv.org/abs/2406.18726
作者: Peter Zaspel,Michael Günther
关键词: modeling of dynamical, pHS, structure-preserving modeling, Gaussian processes, Abstract
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Port-Hamiltonian systems (pHS) allow for a structure-preserving modeling of dynamical systems. Coupling pHS via linear relations between input and output defines an overall pHS, which is structure preserving. However, in multiphysics applications, some subsystems do not allow for a physical pHS description, as (a) this is not available or (b) too expensive. Here, data-driven approaches can be used to deliver a pHS for such subsystems, which can then be coupled to the other subsystems in a structure-preserving way. In this work, we derive a data-driven identification approach for port-Hamiltonian differential algebraic equation (DAE) systems. The approach uses input and state space data to estimate nonlinear effort functions of pH-DAEs. As underlying technique, we us (multi-task) Gaussian processes. This work thereby extends over the current state of the art, in which only port-Hamiltonian ordinary differential equation systems could be identified via Gaussian processes. We apply this approach successfully to two applications from network design and constrained multibody system dynamics, based on pH-DAE system of index one and three, respectively.

[LG-83] Jailbreaking LLMs with Arabic Transliteration and Arabizi

链接: https://arxiv.org/abs/2406.18725
作者: Mansour Al Ghanim,Saleh Almohaimeed,Mengxin Zheng,Yan Solihin,Qian Lou
关键词: Large Language Models, vulnerabilities of Large, Large Language, Arabic language, specifically focusing
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:This study identifies the potential vulnerabilities of Large Language Models (LLMs) to ‘jailbreak’ attacks, specifically focusing on the Arabic language and its various forms. While most research has concentrated on English-based prompt manipulation, our investigation broadens the scope to investigate the Arabic language. We initially tested the AdvBench benchmark in Standardized Arabic, finding that even with prompt manipulation techniques like prefix injection, it was insufficient to provoke LLMs into generating unsafe content. However, when using Arabic transliteration and chatspeak (or arabizi), we found that unsafe content could be produced on platforms like OpenAI GPT-4 and Anthropic Claude 3 Sonnet. Our findings suggest that using Arabic and its various forms could expose information that might remain hidden, potentially increasing the risk of jailbreak attacks. We hypothesize that this exposure could be due to the model’s learned connection to specific words, highlighting the need for more comprehensive safety training across all language forms.

[LG-84] Learn it or Leave it: Module Composition and Pruning for Continual Learning

链接: https://arxiv.org/abs/2406.18708
作者: Mingyang Wang,Heike Adel,Lukas Lange,Jannik Strötgen,Hinrich Schütze
关键词: real-world environments, machine learning models, continual learning, essential for machine, learning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In real-world environments, continual learning is essential for machine learning models, as they need to acquire new knowledge incrementally without forgetting what they have already learned. While pretrained language models have shown impressive capabilities on various static tasks, applying them to continual learning poses significant challenges, including avoiding catastrophic forgetting, facilitating knowledge transfer, and maintaining parameter efficiency. In this paper, we introduce MoCL-P, a novel lightweight continual learning method that addresses these challenges simultaneously. Unlike traditional approaches that continuously expand parameters for newly arriving tasks, MoCL-P integrates task representation-guided module composition with adaptive pruning, effectively balancing knowledge integration and computational overhead. Our evaluation across three continual learning benchmarks with up to 176 tasks shows that MoCL-P achieves state-of-the-art performance and improves parameter efficiency by up to three times, demonstrating its potential for practical applications where resource requirements are constrained.

[LG-85] Fast Optimizer Benchmark

链接: https://arxiv.org/abs/2406.18701
作者: Simon Blauth,Tobias Bürger,Zacharias Häringer,Jörg Franke,Frank Hutter
关键词: Fast Optimizer Benchmark, present the Fast, evaluating deep learning, Fast Optimizer, designed for evaluating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages + 12 appendix pages, submitted to AutoML Conf 2024 Workshop Track

点击查看摘要

Abstract:In this paper, we present the Fast Optimizer Benchmark (FOB), a tool designed for evaluating deep learning optimizers during their development. The benchmark supports tasks from multiple domains such as computer vision, natural language processing, and graph learning. The focus is on convenient usage, featuring human-readable YAML configurations, SLURM integration, and plotting utilities. FOB can be used together with existing hyperparameter optimization (HPO) tools as it handles training and resuming of runs. The modular design enables integration into custom pipelines, using it simply as a collection of tasks. We showcase an optimizer comparison as a usage example of our tool. FOB can be found on GitHub: this https URL.

[LG-86] Learning to Correct for QA Reasoning with Black-box LLMs

链接: https://arxiv.org/abs/2406.18695
作者: Jaehyung Kim,Dongyoung Kim,Yiming Yang
关键词: output token probabilities, recent machine learning, large language models, token probabilities, open challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: preprint, 18 pages

点击查看摘要

Abstract:An open challenge in recent machine learning is about how to improve the reasoning capability of large language models (LLMs) in a black-box setting, i.e., without access to detailed information such as output token probabilities. Existing approaches either rely on accessibility (which is often unrealistic) or involve significantly increased train- and inference-time costs. This paper addresses those limitations or shortcomings by proposing a novel approach, namely CoBB (Correct for improving QA reasoning of Black-Box LLMs). It uses a trained adaptation model to perform a seq2seq mapping from the often-imperfect reasonings of the original black-box LLM to the correct or improved reasonings. Specifically, the adaptation model is initialized with a relatively small open-source LLM and adapted over a collection of sub-sampled training pairs. To select the representative pairs of correct and incorrect reasonings, we formulated the dataset construction as an optimization problem that minimizes the statistical divergence between the sampled subset and the entire collection, and solved it via a genetic algorithm. We then train the adaptation model over the sampled pairs by contrasting the likelihoods of correct and incorrect reasonings. Our experimental results demonstrate that CoBB significantly improves reasoning accuracy across various QA benchmarks, compared to the best-performing adaptation baselines.

[LG-87] Petal-X: Human-Centered Visual Explanations to Improve Cardiovascular Risk Communication

链接: https://arxiv.org/abs/2406.18690
作者: Diego Rojo,Houda Lamqaddam,Lucija Gosak,Katrien Verbert
关键词: Cardiovascular diseases, CVD risk, death worldwide, behavioral interventions, risk
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardiovascular diseases (CVDs), the leading cause of death worldwide, can be prevented in most cases through behavioral interventions. Therefore, effective communication of CVD risk and projected risk reduction by risk factor modification plays a crucial role in reducing CVD risk at the individual level. However, despite interest in refining risk estimation with improved prediction models such as SCORE2, the guidelines for presenting these risk estimations in clinical practice remained essentially unchanged in the last few years, with graphical score charts (GSCs) continuing to be one of the prevalent systems. This work describes the design and implementation of Petal-X, a novel tool to support clinician-patient shared decision-making by explaining the CVD risk contributions of different factors and facilitating what-if analysis. Petal-X relies on a novel visualization, Petal Product Plots, and a tailor-made global surrogate model of SCORE2, whose fidelity is comparable to that of the GSCs used in clinical practice. We evaluated Petal-X compared to GSCs in a controlled experiment with 88 healthcare students, all but one with experience with chronic patients. The results show that Petal-X outperforms GSC in critical tasks, such as comparing the contribution to the patient’s 10-year CVD risk of each modifiable risk factor, without a significant loss of perceived transparency, trust, or intent to use. Our study provides an innovative approach to the visualization and explanation of risk in clinical practice that, due to its model-agnostic nature, could continue to support next-generation artificial intelligence risk assessment models.

[LG-88] he Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm

链接: https://arxiv.org/abs/2406.18682
作者: Aakanksha,Arash Ahmadian,Beyza Ermis,Seraphina Goldfarb-Tarrant,Julia Kreutzer,Marzieh Fadaee,Sara Hooker
关键词: key concern, implicit question, alignment, languages, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key concern with the concept of “alignment” is the implicit question of “alignment to what?”. AI systems are increasingly used across the world, yet safety alignment is often focused on homogeneous monolingual settings. Additionally, preference training and safety measures often overfit to harms common in Western-centric datasets. Here, we explore the viability of different alignment approaches when balancing dual objectives: addressing and optimizing for a non-homogeneous set of languages and cultural preferences while minimizing both global and local harms. We collect the first set of human annotated red-teaming prompts in different languages distinguishing between global and local harm, which serve as a laboratory for understanding the reliability of alignment techniques when faced with preference distributions that are non-stationary across geographies and languages. While this setting is seldom covered by the literature to date, which primarily centers on English harm mitigation, it captures real-world interactions with AI systems around the world. We establish a new precedent for state-of-the-art alignment techniques across 6 languages with minimal degradation in general performance. Our work provides important insights into cross-lingual transfer and novel optimization approaches to safeguard AI systems designed to serve global populations.

[LG-89] Few-shot Personalization of LLMs with Mis-aligned Responses

链接: https://arxiv.org/abs/2406.18678
作者: Jaehyung Kim,Yiming Yang
关键词: large language models, providing personalized responses, language models, increasingly important, capability of providing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: preprint, 30 pages

点击查看摘要

Abstract:As the diversity of users increases, the capability of providing personalized responses by large language models (LLMs) has become increasingly important. Existing approaches have only limited successes in LLM personalization, due to the absence of personalized learning or the reliance on shared personal data. This paper proposes a new approach for a few-shot personalization of LLMs with their mis-aligned responses (Fermi). Our key idea is to learn a set of personalized prompts for each user by progressively improving the prompts using LLMs, based on user profile (e.g., demographic information) and a few examples of previous opinions. During an iterative process of prompt improvement, we incorporate the contexts of mis-aligned responses by LLMs, which are especially crucial for the effective personalization of LLMs. In addition, we develop an effective inference method to further leverage the context of the test query and the personalized prompts. Our experimental results demonstrate that Fermi significantly improves performance across various benchmarks, compared to the best-performing baselines.

[LG-90] Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2406.18676
作者: Guanting Dong,Yutao Zhu,Chenghao Zhang,Zechen Wang,Zhicheng Dou,Ji-Rong Wen
关键词: large language models, Retrieval-augmented generation, RAG systems, reliable RAG system, RAG
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has demonstrated effectiveness in mitigating the hallucination problem of large language models (LLMs). However, the difficulty of aligning the retriever with the diverse LLMs’ knowledge preferences inevitably poses an inevitable challenge in developing a reliable RAG system. To address this issue, we propose DPA-RAG, a universal framework designed to align diverse knowledge preferences within RAG systems. Specifically, we initially introduce a preference knowledge construction pipline and incorporate five novel query augmentation strategies to alleviate preference data scarcity. Based on preference data, DPA-RAG accomplishes both external and internal preference alignment: 1) It jointly integrate pair-wise, point-wise, and contrastive preference alignment abilities into the reranker, achieving external preference alignment among RAG components. 2) It further introduces a pre-aligned stage before vanilla Supervised Fine-tuning (SFT), enabling LLMs to implicitly capture knowledge aligned with their reasoning preferences, achieving LLMs’ internal alignment. Experimental results across four knowledge-intensive QA datasets demonstrate that DPA-RAG outperforms all baselines and seamlessly integrates both black-box and open-sourced LLM readers. Further qualitative analysis and discussions also provide empirical guidance for achieving reliable RAG systems. Our code is publicly available at this https URL.

[LG-91] A Zero Auxiliary Knowledge Membership Inference Attack on Aggregate Location Data

链接: https://arxiv.org/abs/2406.18671
作者: Vincent Guan,Florent Guépin,Ana-Maria Cretu,Yves-Alexandre de Montjoye
关键词: Location data, decision making, form to guide, guide policy, policy and decision
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To be published in PETS 2024

点击查看摘要

Abstract:Location data is frequently collected from populations and shared in aggregate form to guide policy and decision making. However, the prevalence of aggregated data also raises the privacy concern of membership inference attacks (MIAs). MIAs infer whether an individual’s data contributed to the aggregate release. Although effective MIAs have been developed for aggregate location data, these require access to an extensive auxiliary dataset of individual traces over the same locations, which are collected from a similar population. This assumption is often impractical given common privacy practices surrounding location data. To measure the risk of an MIA performed by a realistic adversary, we develop the first Zero Auxiliary Knowledge (ZK) MIA on aggregate location data, which eliminates the need for an auxiliary dataset of real individual traces. Instead, we develop a novel synthetic approach, such that suitable synthetic traces are generated from the released aggregate. We also develop methods to correct for bias and noise, to show that our synthetic-based attack is still applicable when privacy mechanisms are applied prior to release. Using two large-scale location datasets, we demonstrate that our ZK MIA matches the state-of-the-art Knock-Knock (KK) MIA across a wide range of settings, including popular implementations of differential privacy (DP) and suppression of small counts. Furthermore, we show that ZK MIA remains highly effective even when the adversary only knows a small fraction (10%) of their target’s location history. This demonstrates that effective MIAs can be performed by realistic adversaries, highlighting the need for strong DP protection.

[LG-92] RouteLLM: Learning to Route LLMs with Preference Data

链接: https://arxiv.org/abs/2406.18665
作者: Isaac Ong,Amjad Almahairi,Vincent Wu,Wei-Lin Chiang,Tianhao Wu,Joseph E. Gonzalez,M Waleed Kadous,Ion Stoica
关键词: Large language models, Large language, exhibit impressive capabilities, exhibit impressive, range of tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.

[LG-93] Evaluating Copyright Takedown Methods for Language Models

链接: https://arxiv.org/abs/2406.18664
作者: Boyi Wei,Weijia Shi,Yangsibo Huang,Noah A. Smith,Chiyuan Zhang,Luke Zettlemoyer,Kai Li,Peter Henderson
关键词: potentially copyrighted material, including potentially copyrighted, Language models, derive their capabilities, copyrighted material
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 31 pages, 9 figures, 14 tables

点击查看摘要

Abstract:Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. These models can memorize and generate content similar to their training data, posing potential concerns. Therefore, model creators are motivated to develop mitigation methods that prevent generating protected content. We term this procedure as copyright takedowns for LMs, noting the conceptual similarity to (but legal distinction from) the DMCA takedown This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs. We propose CoTaEval, an evaluation framework to assess the effectiveness of copyright takedown methods, the impact on the model’s ability to retain uncopyrightable factual knowledge from the training data whose recitation is embargoed, and how well the model maintains its general utility and efficiency. We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches. Our findings indicate that no tested method excels across all metrics, showing significant room for research in this unique problem setting and indicating potential unresolved challenges for live policy proposals.

[LG-94] Improving Hyperparameter Optimization with Checkpointed Model Weights

链接: https://arxiv.org/abs/2406.18630
作者: Nikhil Mehta,Jonathan Lorraine,Steve Masson,Ramanathan Arunachalam,Zaid Pervaiz Bhat,James Lucas,Arun George Zachariah
关键词: performance depends largely, training deep learning, performance depends, depends largely, deep learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: See the project website at this https URL

点击查看摘要

Abstract:When training deep learning models, the performance depends largely on the selected hyperparameters. However, hyperparameter optimization (HPO) is often one of the most expensive parts of model design. Classical HPO methods treat this as a black-box optimization problem. However, gray-box HPO methods, which incorporate more information about the setup, have emerged as a promising direction for more efficient optimization. For example, using intermediate loss evaluations to terminate bad selections. In this work, we propose an HPO method for neural networks using logged checkpoints of the trained weights to guide future hyperparameter selections. Our method, Forecasting Model Search (FMS), embeds weights into a Gaussian process deep kernel surrogate model, using a permutation-invariant graph metanetwork to be data-efficient with the logged network weights. To facilitate reproducibility and further research, we open-source our code at this https URL.

[LG-95] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

链接: https://arxiv.org/abs/2406.18629
作者: Xin Lai,Zhuotao Tian,Yukang Chen,Senqiao Yang,Xiangru Peng,Jiaya Jia
关键词: Large Language Models, Large Language, challenge for Large, Mathematical reasoning presents, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Code, data, and models are available at this https URL

点击查看摘要

Abstract:Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter’s out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at this https URL.

[LG-96] AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation

链接: https://arxiv.org/abs/2406.18627
作者: Vaishnavi Pulavarthi,Deeksha Nandal,Soham Dan,Debjit Pal
关键词: Assertions, hardware designs, facto collateral, collateral for simulation-based, simulation-based and formal
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 14 pages, 7 figures, NIPS 2024

点击查看摘要

Abstract:Assertions have been the de facto collateral for simulation-based and formal verification of hardware designs for over a decade. The quality of hardware verification, \ie, detection and diagnosis of corner-case design bugs, is critically dependent on the quality of the assertions. There has been a considerable amount of research leveraging a blend of data-driven statistical analysis and static analysis to generate high-quality assertions from hardware design source code and design execution trace data. Despite such concerted effort, all prior research struggles to scale to industrial-scale large designs, generates too many low-quality assertions, often fails to capture subtle and non-trivial design functionality, and does not produce any easy-to-comprehend explanations of the generated assertions to understand assertions’ suitability to different downstream validation tasks. Recently, with the advent of Large-Language Models (LLMs), there has been a widespread effort to leverage prompt engineering to generate assertions. However, there is little effort to quantitatively establish the effectiveness and suitability of various LLMs for assertion generation. In this paper, we present AssertionBench, a novel benchmark to evaluate LLMs’ effectiveness for assertion generation quantitatively. AssertioBench contains 100 curated Verilog hardware designs from OpenCores and formally verified assertions for each design generated from GoldMine and HARM. We use AssertionBench to compare state-of-the-art LLMs to assess their effectiveness in inferring functionally correct assertions for hardware designs. Our experiments demonstrate how LLMs perform relative to each other, the benefits of using more in-context exemplars in generating a higher fraction of functionally correct assertions, and the significant room for improvement for LLM-based assertion generators.

[LG-97] Realtime Dynamic Gaze Target Tracking and Depth-Level Estimation

链接: https://arxiv.org/abs/2406.18595
作者: Esmaeil Seraj,Harsh Bhate,Walter Talamonti
关键词: revolutionize user experiences, burgeoning field, poised to revolutionize, Transparent Displays, revolutionize user
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of Transparent Displays (TD) in various applications, such as Heads-Up Displays (HUDs) in vehicles, is a burgeoning field, poised to revolutionize user experiences. However, this innovation brings forth significant challenges in realtime human-device interaction, particularly in accurately identifying and tracking a user’s gaze on dynamically changing TDs. In this paper, we present a two-fold robust and efficient systematic solution for realtime gaze monitoring, comprised of: (1) a tree-based algorithm for identifying and dynamically tracking gaze targets (i.e., moving, size-changing, and overlapping 2D content) projected on a transparent display, in realtime; (2) a multi-stream self-attention architecture to estimate the depth-level of human gaze from eye tracking data, to account for the display’s transparency and preventing undesired interactions with the TD. We collected a real-world eye-tracking dataset to train and test our gaze monitoring system. We present extensive results and ablation studies, including inference experiments on System on Chip (SoC) evaluation boards, demonstrating our model’s scalability, precision, and realtime feasibility in both static and dynamic contexts. Our solution marks a significant stride in enhancing next-generation user-device interaction and experience, setting a new benchmark for algorithmic gaze monitoring technology in dynamic transparent displays.

[LG-98] Composition Vision-Language Understanding via Segment and Depth Anything Model

链接: https://arxiv.org/abs/2406.18591
作者: Mingxiao Huo,Pengliang Ji,Haotian Lin,Junchen Liu,Yixiao Wang,Yijun Chen
关键词: augment neural comprehension, model zero-shot understanding, language-vision model zero-shot, pioneering unified library, zero-shot understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a pioneering unified library that leverages depth anything, segment anything models to augment neural comprehension in language-vision model zero-shot understanding. This library synergizes the capabilities of the Depth Anything Model (DAM), Segment Anything Model (SAM), and GPT-4V, enhancing multimodal tasks such as vision-question-answering (VQA) and composition reasoning. Through the fusion of segmentation and depth analysis at the symbolic instance level, our library provides nuanced inputs for language models, significantly advancing image interpretation. Validated across a spectrum of in-the-wild real-world images, our findings showcase progress in vision-language models through neural-symbolic integration. This novel approach melds visual and language analysis in an unprecedented manner. Overall, our library opens new directions for future research aimed at decoding the complexities of the real world through advanced multimodal technologies and our code is available at \urlthis https URL.

[LG-99] xt-Guided Alternative Image Clustering

链接: https://arxiv.org/abs/2406.18589
作者: Andreas Stephan,Lukas Miklautz,Collin Leiber,Pedro Henrique Luz de Araujo,Dominik Répás,Claudia Plant,Benjamin Roth
关键词: Traditional image clustering, Traditional image, alternative image clustering, image clustering techniques, alternative image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional image clustering techniques only find a single grouping within visual data. In particular, they do not provide a possibility to explicitly define multiple types of clustering. This work explores the potential of large vision-language models to facilitate alternative image clustering. We propose Text-Guided Alternative Image Consensus Clustering (TGAICC), a novel approach that leverages user-specified interests via prompts to guide the discovery of diverse clusterings. To achieve this, it generates a clustering for each prompt, groups them using hierarchical clustering, and then aggregates them using consensus clustering. TGAICC outperforms image- and text-based baselines on four alternative image clustering benchmark datasets. Furthermore, using count-based word statistics, we are able to obtain text-based explanations of the alternative clusterings. In conclusion, our research illustrates how contemporary large vision-language models can transform explanatory data analysis, enabling the generation of insightful, customizable, and diverse image clusterings.

[LG-100] Varying Manifolds in Diffusion: From Time-varying Geometries to Visual Saliency

链接: https://arxiv.org/abs/2406.18588
作者: Junhao Chen,Manyi Li,Zherong Pan,Xifeng Gao,Changhe Tu
关键词: Deep generative models, Deep generative, generative models learn, generation, generation rate
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep generative models learn the data distribution, which is concentrated on a low-dimensional manifold. The geometric analysis of distribution transformation provides a better understanding of data structure and enables a variety of applications. In this paper, we study the geometric properties of the diffusion model, whose forward diffusion process and reverse generation process construct a series of distributions on manifolds which vary over time. Our key contribution is the introduction of generation rate, which corresponds to the local deformation of manifold over time around an image component. We show that the generation rate is highly correlated with intuitive visual properties, such as visual saliency, of the image component. Further, we propose an efficient and differentiable scheme to estimate the generation rate for a given image component over time, giving rise to a generation curve. The differentiable nature of our scheme allows us to control the shape of the generation curve via optimization. Using different loss functions, our generation curve matching algorithm provides a unified framework for a range of image manipulation tasks, including semantic transfer, object removal, saliency manipulation, image blending, etc. We conduct comprehensive analytical evaluations to support our findings and evaluate our framework on various manipulation tasks. The results show that our method consistently leads to better manipulation results, compared to recent baselines.

[LG-101] Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

链接: https://arxiv.org/abs/2406.18583
作者: Le Zhuo,Ruoyi Du,Han Xiao,Yangguang Li,Dongyang Liu,Rongjie Huang,Wenze Liu,Lirui Zhao,Fu-Yun Wang,Zhanyu Ma,Xu Luo,Zehan Wang,Kaipeng Zhang,Xiangyang Zhu,Si Liu,Xiangyu Yue,Dingning Liu,Wanli Ouyang,Ziwei Liu,Yu Qiao,Hongsheng Li,Peng Gao
关键词: Flow-based Large Diffusion, Flow-based Large, Large Diffusion Transformers, family of Flow-based, Large Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code at: this https URL

点击查看摘要

Abstract:Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.

[LG-102] Shedding Light on Large Generative Networks: Estimating Epistemic Uncertainty in Diffusion Models

链接: https://arxiv.org/abs/2406.18580
作者: Lucas Berry,Axel Brando,David Meger
关键词: pose significant challenges, Generative diffusion models, large parameter count, Generative diffusion, traditional uncertainty estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative diffusion models, notable for their large parameter count (exceeding 100 million) and operation within high-dimensional image spaces, pose significant challenges for traditional uncertainty estimation methods due to computational demands. In this work, we introduce an innovative framework, Diffusion Ensembles for Capturing Uncertainty (DECU), designed for estimating epistemic uncertainty for diffusion models. The DECU framework introduces a novel method that efficiently trains ensembles of conditional diffusion models by incorporating a static set of pre-trained parameters, drastically reducing the computational burden and the number of parameters that require training. Additionally, DECU employs Pairwise-Distance Estimators (PaiDEs) to accurately measure epistemic uncertainty by evaluating the mutual information between model outputs and weights in high-dimensional spaces. The effectiveness of this framework is demonstrated through experiments on the ImageNet dataset, highlighting its capability to capture epistemic uncertainty, specifically in under-sampled image classes.

[LG-103] Research on Driver Facial Fatigue Detection Based on Yolov8 Model

链接: https://arxiv.org/abs/2406.18575
作者: Chang Zhou,Yang Zhao,Shaobo Liu,Yi Zhao,Xingchen Li,Chiyu Cheng
关键词: accidents frequently occur, frequently occur, grave issue, traffic accidents frequently, fatigue driving
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by the 5th International Conference on Information Science, Parallel and Distributed Systems (ISPDS 2024), 2024 IEEE

点击查看摘要

Abstract:In a society where traffic accidents frequently occur, fatigue driving has emerged as a grave issue. Fatigue driving detection technology, especially those based on the YOLOv8 deep learning model, has seen extensive research and application as an effective preventive measure. This paper discusses in depth the methods and technologies utilized in the YOLOv8 model to detect driver fatigue, elaborates on the current research status both domestically and internationally, and systematically introduces the processing methods and algorithm principles for various datasets. This study aims to provide a robust technical solution for preventing and detecting fatigue driving, thereby contributing significantly to reducing traffic accidents and safeguarding lives.

[LG-104] Unsupervised Few-Shot Continual Learning for Remote Sensing Image Scene Classification

链接: https://arxiv.org/abs/2406.18574
作者: Muhammad Anwar Ma’sum,Mahardhika Pratama,Ramasamy Savitha,Lin Liu,Habibullah,Ryszard Kowalczyk
关键词: varying camera parameters, remote sensing image, sensing image analysis, remote sensing, spectral ranges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under Review for Publication in IEEE TGRS

点击查看摘要

Abstract:A continual learning (CL) model is desired for remote sensing image analysis because of varying camera parameters, spectral ranges, resolutions, etc. There exist some recent initiatives to develop CL techniques in this domain but they still depend on massive labelled samples which do not fully fit remote sensing applications because ground truths are often obtained via field-based surveys. This paper addresses this problem with a proposal of unsupervised flat-wide learning approach (UNISA) for unsupervised few-shot continual learning approaches of remote sensing image scene classifications which do not depend on any labelled samples for its model updates. UNISA is developed from the idea of prototype scattering and positive sampling for learning representations while the catastrophic forgetting problem is tackled with the flat-wide learning approach combined with a ball generator to address the data scarcity problem. Our numerical study with remote sensing image scene datasets and a hyperspectral dataset confirms the advantages of our solution. Source codes of UNISA are shared publicly in \urlthis https URL to allow convenient future studies and reproductions of our numerical results.

[LG-105] GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

链接: https://arxiv.org/abs/2406.18572
作者: Ling Li,Yu Ye,Bingchuan Jiang,Wei Zeng
关键词: large vision-language model, vision-language model, work tackles, tackles the problem, large vision-language
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at this https URL.

[LG-106] A Diagnostic Model for Acute Lymphoblastic Leukemia Using Metaheuristics and Deep Learning Methods

链接: https://arxiv.org/abs/2406.18568
作者: M. Hosseinzadeh,P. Khoshaght,S. Sadeghi,P. Asghari,Z. Arabi,J. Lansky,P. Budinsky,A. Masoud Rahmani,S. W. Lee
关键词: Acute lymphoblastic leukemia, abnormal white blood, white blood cells, Acute lymphoblastic, blast cell characteristics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Acute lymphoblastic leukemia (ALL) severity is determined by the presence and ratios of blast cells (abnormal white blood cells) in both bone marrow and peripheral blood. Manual diagnosis of this disease is a tedious and time-consuming operation, making it difficult for professionals to accurately examine blast cell characteristics. To address this difficulty, researchers use deep learning and machine learning. In this paper, a ResNet-based feature extractor is utilized to detect ALL, along with a variety of feature selectors and classifiers. To get the best results, a variety of transfer learning models, including the Resnet, VGG, EfficientNet, and DensNet families, are used as deep feature extractors. Following extraction, different feature selectors are used, including Genetic algorithm, PCA, ANOVA, Random Forest, Univariate, Mutual information, Lasso, XGB, Variance, and Binary ant colony. After feature qualification, a variety of classifiers are used, with MLP outperforming the others. The recommended technique is used to categorize ALL and HEM in the selected dataset which is C-NMC 2019. This technique got an impressive 90.71% accuracy and 95.76% sensitivity for the relevant classifications, and its metrics on this dataset outperformed others.

[LG-107] Memorized Images in Diffusion Models share a Subspace that can be Located and Deleted

链接: https://arxiv.org/abs/2406.18566
作者: Ruchika Chavhan,Ondrej Bohdal,Yongshuo Zong,Da Li,Timothy Hospedales
关键词: samples raising copyright, raising copyright infringement, generating high-quality images, replicate exact training, training samples raising
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale text-to-image diffusion models excel in generating high-quality images from textual inputs, yet concerns arise as research indicates their tendency to memorize and replicate training data, raising We also addressed the issue of memorization in diffusion models, where models tend to replicate exact training samples raising copyright infringement and privacy issues. Efforts within the text-to-image community to address memorization explore causes such as data duplication, replicated captions, or trigger tokens, proposing per-prompt inference-time or training-time mitigation strategies. In this paper, we focus on the feed-forward layers and begin by contrasting neuron activations of a set of memorized and non-memorized prompts. Experiments reveal a surprising finding: many different sets of memorized prompts significantly activate a common subspace in the model, demonstrating, for the first time, that memorization in the diffusion models lies in a special subspace. Subsequently, we introduce a novel post-hoc method for editing pre-trained models, whereby memorization is mitigated through the straightforward pruning of weights in specialized subspaces, avoiding the need to disrupt the training or inference process as seen in prior research. Finally, we demonstrate the robustness of the pruned model against training data extraction attacks, thereby unveiling new avenues for a practical and one-for-all solution to memorization.

[LG-108] Interdisciplinary Expertise to Advance Equitable Explainable AI

链接: https://arxiv.org/abs/2406.18563
作者: Chloe R. Bennett,Heather Cole-Lewis,Stephanie Farquhar,Naama Haamel,Boris Babenko,Oran Lang,Mat Fleck,Ilana Traynis,Charles Lau,Ivor Horn,Courtney Lyles
关键词: widespread structural oppression, face widespread structural, poor performance persists, rapidly influencing health, artificial intelligence
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of artificial intelligence (AI) is rapidly influencing health and healthcare, but bias and poor performance persists for populations who face widespread structural oppression. Previous work has clearly outlined the need for more rigorous attention to data representativeness and model performance to advance equity and reduce bias. However, there is an opportunity to also improve the explainability of AI by leveraging best practices of social epidemiology and health equity to help us develop hypotheses for associations found. In this paper, we focus on explainable AI (XAI) and describe a framework for interdisciplinary expert panel review to discuss and critically assess AI model explanations from multiple perspectives and identify areas of bias and directions for future research. We emphasize the importance of the interdisciplinary expert panel to produce more accurate, equitable interpretations which are historically and contextually informed. Interdisciplinary panel discussions can help reduce bias, identify potential confounders, and identify opportunities for additional research where there are gaps in the literature. In turn, these insights can suggest opportunities for AI model improvement.

[LG-109] Views Can Be Deceiving: Improved SSL Through Feature Space Augmentation

链接: https://arxiv.org/abs/2406.18562
作者: Kimia Hamidieh,Haoran Zhang,Swami Sankaranarayanan,Marzyeh Ghassemi
关键词: exhibit inductive biases, inductive biases favoring, biases favoring simpler, Supervised learning methods, favoring simpler features
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised learning methods have been found to exhibit inductive biases favoring simpler features. When such features are spuriously correlated with the label, this can result in suboptimal performance on minority subgroups. Despite the growing popularity of methods which learn from unlabeled data, the extent to which these representations rely on spurious features for prediction is unclear. In this work, we explore the impact of spurious features on Self-Supervised Learning (SSL) for visual representation learning. We first empirically show that commonly used augmentations in SSL can cause undesired invariances in the image space, and illustrate this with a simple example. We further show that classical approaches in combating spurious correlations, such as dataset re-sampling during SSL, do not consistently lead to invariant representations. Motivated by these findings, we propose LateTVG to remove spurious information from these representations during pre-training, by regularizing later layers of the encoder via pruning. We find that our method produces representations which outperform the baselines on several benchmarks, without the need for group or label information during SSL.

[LG-110] SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching

链接: https://arxiv.org/abs/2406.18561
作者: Yongmin Lee,Hye Won Chung
关键词: minimal performance loss, full dataset training, approximate full dataset, IPC, images per class
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:Dataset distillation aims to synthesize a small number of images per class (IPC) from a large dataset to approximate full dataset training with minimal performance loss. While effective in very small IPC ranges, many distillation methods become less effective, even underperforming random sample selection, as IPC increases. Our examination of state-of-the-art trajectory-matching based distillation methods across various IPC scales reveals that these methods struggle to incorporate the complex, rare features of harder samples into the synthetic dataset even with the increased IPC, resulting in a persistent coverage gap between easy and hard test samples. Motivated by such observations, we introduce SelMatch, a novel distillation method that effectively scales with IPC. SelMatch uses selection-based initialization and partial updates through trajectory matching to manage the synthetic dataset’s desired difficulty level tailored to IPC scales. When tested on CIFAR-10/100 and TinyImageNet, SelMatch consistently outperforms leading selection-only and distillation-only methods across subset ratios from 5% to 30%.

[LG-111] Revision Matters: Generative Design Guided by Revision Edits

链接: https://arxiv.org/abs/2406.18559
作者: Tao Li,Chin-Yi Cheng,Amber Xie,Gang Li,Yang Li
关键词: user interface, interface or graphical, Layout, graphical layout, iterative revision process
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Layout design, such as user interface or graphical layout in general, is fundamentally an iterative revision process. Through revising a design repeatedly, the designer converges on an ideal layout. In this paper, we investigate how revision edits from human designer can benefit a multimodal generative model. To do so, we curate an expert dataset that traces how human designers iteratively edit and improve a layout generation with a prompted language goal. Based on such data, we explore various supervised fine-tuning task setups on top of a Gemini multimodal backbone, a large multimodal model. Our results show that human revision plays a critical role in iterative layout refinement. While being noisy, expert revision edits lead our model to a surprisingly strong design FID score ~10 which is close to human performance (~6). In contrast, self-revisions that fully rely on model’s own judgement, lead to an echo chamber that prevents iterative improvement, and sometimes leads to generative degradation. Fortunately, we found that providing human guidance plays at early stage plays a critical role in final generation. In such human-in-the-loop scenario, our work paves the way for iterative design revision based on pre-trained large multimodal models.

[LG-112] Planted: a dataset for planted forest identification from multi-satellite time series

链接: https://arxiv.org/abs/2406.18554
作者: Luis Miguel Pazos-Outón,Cristina Nader Vasconcelos,Anton Raichuk,Anurag Arnab,Dan Morris,Maxim Neumann
关键词: Protecting and restoring, restoring forest ecosystems, carbon sequestration, ecosystems is critical, critical for biodiversity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Protecting and restoring forest ecosystems is critical for biodiversity conservation and carbon sequestration. Forest monitoring on a global scale is essential for prioritizing and assessing conservation efforts. Satellite-based remote sensing is the only viable solution for providing global coverage, but to date, large-scale forest monitoring is limited to single modalities and single time points. In this paper, we present a dataset consisting of data from five public satellites for recognizing forest plantations and planted tree species across the globe. Each satellite modality consists of a multi-year time series. The dataset, named \PlantD, includes over 2M examples of 64 tree label classes (46 genera and 40 species), distributed among 41 countries. This dataset is released to foster research in forest monitoring using multimodal, multi-scale, multi-temporal data sources. Additionally, we present initial baseline results and evaluate modality fusion and data augmentation approaches for this dataset.

[LG-113] Visual Analysis of Prediction Uncertainty in Neural Networks for Deep Image Synthesis

链接: https://arxiv.org/abs/2406.18545
作者: Soumya Dutta,Faheem Nizar,Ahmad Amaan,Ayan Acharya
关键词: artificial intelligence systems, Deep neural networks, solving challenging visualization, challenging visualization problems, Ubiquitous applications
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ubiquitous applications of Deep neural networks (DNNs) in different artificial intelligence systems have led to their adoption in solving challenging visualization problems in recent years. While sophisticated DNNs offer an impressive generalization, it is imperative to comprehend the quality, confidence, robustness, and uncertainty associated with their prediction. A thorough understanding of these quantities produces actionable insights that help application scientists make informed decisions. Unfortunately, the intrinsic design principles of the DNNs cannot beget prediction uncertainty, necessitating separate formulations for robust uncertainty-aware models for diverse visualization applications. To that end, this contribution demonstrates how the prediction uncertainty and sensitivity of DNNs can be estimated efficiently using various methods and then interactively compared and contrasted for deep image synthesis tasks. Our inspection suggests that uncertainty-aware deep visualization models generate illustrations of informative and superior quality and diversity. Furthermore, prediction uncertainty improves the robustness and interpretability of deep visualization models, making them practical and convenient for various scientific domains that thrive on visual analyses.

[LG-114] Self-Supervised Time-Series Anomaly Detection Using Learnable Data Augmentation

链接: https://arxiv.org/abs/2406.12260
作者: Kukjin Choi,Jihun Yi,Jisoo Mok,Sungroh Yoon
关键词: Continuous efforts, advance anomaly detection, industrial sites, anomaly detection, made to advance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 4 figures, IEEE Transactions on Emerging Topics in Computational Intelligence

点击查看摘要

Abstract:Continuous efforts are being made to advance anomaly detection in various manufacturing processes to increase the productivity and safety of industrial sites. Deep learning replaced rule-based methods and recently emerged as a promising method for anomaly detection in diverse industries. However, in the real world, the scarcity of abnormal data and difficulties in obtaining labeled data create limitations in the training of detection models. In this study, we addressed these shortcomings by proposing a learnable data augmentation-based time-series anomaly detection (LATAD) technique that is trained in a self-supervised manner. LATAD extracts discriminative features from time-series data through contrastive learning. At the same time, learnable data augmentation produces challenging negative samples to enhance learning efficiency. We measured anomaly scores of the proposed technique based on latent feature similarities. As per the results, LATAD exhibited comparable or improved performance to the state-of-the-art anomaly detection assessments on several benchmark datasets and provided a gradient-based diagnosis technique to help identify root causes.

[LG-115] Stochastic Gradient Piecewise Deterministic Monte Carlo Samplers

链接: https://arxiv.org/abs/2406.19051
作者: Paul Fearnhead,Sebastiano Grazzi,Chris Nemeth,Gareth O. Roberts
关键词: piecewise deterministic Markov, Monte Carlo methods, deterministic Markov processes, Monte Carlo, suggested using Monte
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Recent work has suggested using Monte Carlo methods based on piecewise deterministic Markov processes (PDMPs) to sample from target distributions of interest. PDMPs are non-reversible continuous-time processes endowed with momentum, and hence can mix better than standard reversible MCMC samplers. Furthermore, they can incorporate exact sub-sampling schemes which only require access to a single (randomly selected) data point at each iteration, yet without introducing bias to the algorithm’s stationary distribution. However, the range of models for which PDMPs can be used, particularly with sub-sampling, is limited. We propose approximate simulation of PDMPs with sub-sampling for scalable sampling from posterior distributions. The approximation takes the form of an Euler approximation to the true PDMP dynamics, and involves using an estimate of the gradient of the log-posterior based on a data sub-sample. We thus call this class of algorithms stochastic-gradient PDMPs. Importantly, the trajectories of stochastic-gradient PDMPs are continuous and can leverage recent ideas for sampling from measures with continuous and atomic components. We show these methods are easy to implement, present results on their approximation error and demonstrate numerically that this class of algorithms has similar efficiency to, but is more robust than, stochastic gradient Langevin dynamics.

[LG-116] Statistical Test for Data Analysis Pipeline by Selective Inference

链接: https://arxiv.org/abs/2406.18902
作者: Tomohiro Shiraishi,Tatsuya Matsukawa,Shuichi Nishino,Ichiro Takeuchi
关键词: data analysis, data analysis pipelines, transforms raw data, analysis, data
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A data analysis pipeline is a structured sequence of processing steps that transforms raw data into meaningful insights by effectively integrating various analysis algorithms. In this paper, we propose a novel statistical test designed to assess the statistical significance of data analysis pipelines. Our approach allows for the systematic development of valid statistical tests applicable to any data analysis pipeline configuration composed of a set of data analysis components. We have developed this framework by adapting selective inference, which has gained recent attention as a new statistical inference technique for data-driven hypotheses. The proposed statistical test is theoretically designed to control the type I error at the desired significance level in finite samples. As examples, we consider a class of pipelines composed of three missing value imputation algorithms, three outlier detection algorithms, and three feature selection algorithms. We confirm the validity of our statistical test through experiments with both synthetic and real data for this class of data analysis pipelines. Additionally, we present an implementation framework that facilitates testing across any configuration of data analysis pipelines in this class without extra implementation costs.

[LG-117] Length Optimization in Conformal Prediction

链接: https://arxiv.org/abs/2406.18814
作者: Shayan Kiyani,George Pappas,Hamed Hassani
关键词: Conditional validity, crucial aspects, prediction, conformal prediction, Achieving conditional validity
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Conditional validity and length efficiency are two crucial aspects of conformal prediction (CP). Achieving conditional validity ensures accurate uncertainty quantification for data subpopulations, while proper length efficiency ensures that the prediction sets remain informative and non-trivial. Despite significant efforts to address each of these issues individually, a principled framework that reconciles these two objectives has been missing in the CP literature. In this paper, we develop Conformal Prediction with Length-Optimization (CPL) - a novel framework that constructs prediction sets with (near-) optimal length while ensuring conditional validity under various classes of covariate shifts, including the key cases of marginal and group-conditional coverage. In the infinite sample regime, we provide strong duality results which indicate that CPL achieves conditional validity and length optimality. In the finite sample regime, we show that CPL constructs conditionally valid prediction sets. Our extensive empirical evaluations demonstrate the superior prediction set size performance of CPL compared to state-of-the-art methods across diverse real-world and synthetic datasets in classification, regression, and text-related settings.

[LG-118] Density Ratio Estimation via Sampling along Generalized Geodesics on Statistical Manifolds

链接: https://arxiv.org/abs/2406.18806
作者: Masanari Kimura,Howard Bondell
关键词: density ratio estimation, density ratio, ratio estimation, incremental density ratio, ratio
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The density ratio of two probability distributions is one of the fundamental tools in mathematical and computational statistics and machine learning, and it has a variety of known applications. Therefore, density ratio estimation from finite samples is a very important task, but it is known to be unstable when the distributions are distant from each other. One approach to address this problem is density ratio estimation using incremental mixtures of the two distributions. We geometrically reinterpret existing methods for density ratio estimation based on incremental mixtures. We show that these methods can be regarded as iterating on the Riemannian manifold along a particular curve between the two probability distributions. Making use of the geometry of the manifold, we propose to consider incremental density ratio estimation along generalized geodesics on this manifold. To achieve such a method requires Monte Carlo sampling along geodesics via transformations of the two distributions. We show how to implement an iterative algorithm to sample along these geodesics and show how changing the distances along the geodesic affect the variance and accuracy of the estimation of the density ratio. Our experiments demonstrate that the proposed approach outperforms the existing approaches using incremental mixtures that do not take the geometry of the

[LG-119] Learning to Remove Cuts in Integer Linear Programming

链接: https://arxiv.org/abs/2406.18781
作者: Pol Puigdemont,Stratis Skoulakis,Grigorios Chrysos,Volkan Cevher
关键词: Cutting plane methods, integer linear programs, solving integer linear, Cutting plane, linear programs
类目: Optimization and Control (math.OC); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注: International Conference on Machine Learning

点击查看摘要

Abstract:Cutting plane methods are a fundamental approach for solving integer linear programs (ILPs). In each iteration of such methods, additional linear constraints (cuts) are introduced to the constraint set with the aim of excluding the previous fractional optimal solution while not affecting the optimal integer solution. In this work, we explore a novel approach within cutting plane methods: instead of only adding new cuts, we also consider the removal of previous cuts introduced at any of the preceding iterations of the method under a learnable parametric criteria. We demonstrate that in fundamental combinatorial optimization settings such cut removal policies can lead to significant improvements over both human-based and machine learning-guided cut addition policies even when implemented with simple models.

[LG-120] Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization

链接: https://arxiv.org/abs/2406.18679
作者: Xiang Li,Vivek Govindan,Rohit Paturi,Sundararajan Srinivasan
关键词: embedding-based Speaker Diarization, neural diarization, traditional embedding-based Speaker, models offer significant, Speaker Diarization
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at INTERSPEECH 2024

点击查看摘要

Abstract:End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker embedding framework alongside the EEND module. In this paper, we propose a novel framework applying EEND both locally and globally for long-form audio without separate speaker embeddings. This approach achieves significant relative DER reduction of 13% and 10% over the conventional 1-pass EEND on Callhome American English and RT03-CTS datasets respectively and marginal improvements over EEND-vector-clustering without the need for additional speaker embeddings. Furthermore, we discuss the computational complexity of our proposed framework and explore strategies for reducing processing times.

[LG-121] A simple and improved algorithm for noisy convex zeroth-order optimisation

链接: https://arxiv.org/abs/2406.18672
作者: Alexandra Carpentier
关键词: bounded convex set, convex set, bounded convex, zeroth order optimisation, bar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we study the problem of noisy, convex, zeroth order optimisation of a function f over a bounded convex set \bar\mathcal X\subset \mathbbR^d . Given a budget n of noisy queries to the function f that can be allocated sequentially and adaptively, our aim is to construct an algorithm that returns a point \hat x\in \bar\mathcal X such that f(\hat x) is as small as possible. We provide a conceptually simple method inspired by the textbook center of gravity method, but adapted to the noisy and zeroth order setting. We prove that this method is such that the f(\hat x) - \min_x\in \bar\mathcal X f(x) is of smaller order than d^2/\sqrtn up to poly-logarithmic terms. We slightly improve upon existing literature, where to the best of our knowledge the best known rate is in [Lattimore, 2024] is of order d^2.5/\sqrtn , albeit for a more challenging problem. Our main contribution is however conceptual, as we believe that our algorithm and its analysis bring novel ideas and are significantly simpler than existing approaches.

[LG-122] Contraction of Private Quantum Channels and Private Quantum Hypothesis Testing

链接: https://arxiv.org/abs/2406.18651
作者: Theshani Nuradha,Mark M. Wilde
关键词: contraction coefficient, privacy constraints, quantum generalized divergence, relative decrease, contraction
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 36 pages; See independent work titled “Sample Complexity of Locally Differentially Private Quantum Hypothesis Testing” by Hao-Chung Cheng, Christoph Hirche, and Cambyse Rouzé

点击查看摘要

Abstract:A quantum generalized divergence by definition satisfies the data-processing inequality; as such, the relative decrease in such a divergence under the action of a quantum channel is at most one. This relative decrease is formally known as the contraction coefficient of the channel and the divergence. Interestingly, there exist combinations of channels and divergences for which the contraction coefficient is strictly less than one. Furthermore, understanding the contraction coefficient is fundamental for the study of statistical tasks under privacy constraints. To this end, here we establish upper bounds on contraction coefficients for the hockey-stick divergence under privacy constraints, where privacy is quantified with respect to the quantum local differential privacy (QLDP) framework, and we fully characterize the contraction coefficient for the trace distance under privacy constraints. With the machinery developed, we also determine an upper bound on the contraction of both the Bures distance and quantum relative entropy relative to the normalized trace distance, under QLDP constraints. Next, we apply our findings to establish bounds on the sample complexity of quantum hypothesis testing under privacy constraints. Furthermore, we study various scenarios in which the sample complexity bounds are tight, while providing order-optimal quantum channels that achieve those bounds. Lastly, we show how private quantum channels provide fairness and Holevo information stability in quantum learning settings.

[LG-123] Robust Low-Cost Drone Detection and Classification in Low SNR Environments

链接: https://arxiv.org/abs/2406.18624
作者: Stefan Glüge,Matthias Nyfeler,Ahmad Aghaebrahimian,Nicola Ramagnano,Christof Schüpbach
关键词: unmanned aerial vehicles, raised significant safety, significant safety concerns, safety concerns due, aerial vehicles
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 11 pages, submitted to IEEE Open Journal of Signal Processing

点击查看摘要

Abstract:The proliferation of drones, or unmanned aerial vehicles (UAVs), has raised significant safety concerns due to their potential misuse in activities such as espionage, smuggling, and infrastructure disruption. This paper addresses the critical need for effective drone detection and classification systems that operate independently of UAV cooperation. We evaluate various convolutional neural networks (CNNs) for their ability to detect and classify drones using spectrogram data derived from consecutive Fourier transforms of signal components. The focus is on model robustness in low signal-to-noise ratio (SNR) environments, which is critical for real-world applications. A comprehensive dataset is provided to support future model development. In addition, we demonstrate a low-cost drone detection system using a standard computer, software-defined radio (SDR) and antenna, validated through real-world field testing. On our development dataset, all models consistently achieved an average balanced classification accuracy of = 85% at SNR -12dB. In the field test, these models achieved an average balance accuracy of 80%, depending on transmitter distance and antenna direction. Our contributions include: a publicly available dataset for model development, a comparative analysis of CNN for drone detection under low SNR conditions, and the deployment and field evaluation of a practical, low-cost detection system.

[LG-124] Unbiased least squares regression via averaged stochastic gradient descent

链接: https://arxiv.org/abs/2406.18623
作者: Nabil Kahalé
关键词: stochastic gradient descent, time-average stochastic gradient, gradient descent estimator, squares regression problem, Hessian matrix
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 33 pages, 4 figures

点击查看摘要

Abstract:We consider an on-line least squares regression problem with optimal solution \theta^* and Hessian matrix H, and study a time-average stochastic gradient descent estimator of \theta^* . For k\ge2 , we provide an unbiased estimator of \theta^* that is a modification of the time-average estimator, runs with an expected number of time-steps of order k, with O(1/k) expected excess risk. The constant behind the O notation depends on parameters of the regression and is a poly-logarithmic function of the smallest eigenvalue of H. We provide both a biased and unbiased estimator of the expected excess risk of the time-average estimator and of its unbiased counterpart, without requiring knowledge of either H or \theta^* . We describe an “average-start” version of our estimators with similar properties. Our approach is based on randomized multilevel Monte Carlo. Our numerical experiments confirm our theoretical findings.

[LG-125] Inducing Riesz and orthonormal bases in L2 via composition operators

链接: https://arxiv.org/abs/2406.18613
作者: Yahya Saleh,Armin Iske
关键词: composition operator, investigate perturbations, Riesz bases, orthonormal bases, Abstract
类目: Functional Analysis (math.FA); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We investigate perturbations of orthonormal bases of L^2 via a composition operator C_h induced by a mapping h . We provide a comprehensive characterization of the mapping h required for the perturbed sequence to form an orthonormal or Riesz basis. Restricting our analysis to differentiable mappings, we reveal that all Riesz bases of the given form are induced by bi-Lipschitz mappings. In addition, we discuss implications of these results for approximation theory, highlighting the potential of using bijective neural networks to construct complete sequences with favorable approximation properties.

[LG-126] Optimal spanning tree reconstruction in symbolic regression

链接: https://arxiv.org/abs/2406.18612
作者: Radoslav G. Neychev,Innokentiy A. Shibaev,Vadim V. Strijov
关键词: regression model generation, investigates the problem, problem of regression, model generation, regression model
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the problem of regression model generation. A model is a superposition of primitive functions. The model structure is described by a weighted colored graph. Each graph vertex corresponds to some primitive function. An edge assigns a superposition of two functions. The weight of an edge equals the probability of superposition. To generate an optimal model one has to reconstruct its structure from its graph adjacency matrix. The proposed algorithm reconstructs the~minimum spanning tree from the~weighted colored graph. This paper presents a novel solution based on the prize-collecting Steiner tree algorithm. This algorithm is compared with its alternatives.

[LG-127] Confidence interval estimation of mixed oil length with conditional diffusion model

链接: https://arxiv.org/abs/2406.18603
作者: Yanfeng Yang,Lihong Zhang,Ziqi Chen,Miaomiao Yu,Lei Chen
关键词: mixed oil length, mixed oil, oil length plays, Accurately estimating, oil length
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately estimating the mixed oil length plays a big role in the economic benefit for oil pipeline network. While various proposed methods have tried to predict the mixed oil length, they often exhibit an extremely high probability (around 50%) of underestimating it. This is attributed to their failure to consider the statistical variability inherent in the estimated length of mixed oil. To address such issues, we propose to use the conditional diffusion model to learn the distribution of the mixed oil length given pipeline features. Subsequently, we design a confidence interval estimation for the length of the mixed oil based on the pseudo-samples generated by the learned diffusion model. To our knowledge, we are the first to present an estimation scheme for confidence interval of the oil-mixing length that considers statistical variability, thereby reducing the possibility of underestimating it. When employing the upper bound of the interval as a reference for excluding the mixed oil, the probability of underestimation can be as minimal as 5%, a substantial reduction compared to 50%. Furthermore, utilizing the mean of the generated pseudo samples as the estimator for the mixed oil length enhances prediction accuracy by at least 10% compared to commonly used methods.

[LG-128] Multi-level Phenotypic Models of Cardiovascular Disease and Obstructive Sleep Apnea Comorbidities: A Longitudinal Wisconsin Sleep Cohort Study

链接: https://arxiv.org/abs/2406.18602
作者: Duy Nguyen,Ca Hoang,Phat K. Huynh,Tien Truong,Dang Nguyen,Abhay Sharma,Trung Q. Le
关键词: posing unique challenges, obstructive sleep apnea, posing unique, interactions of comorbidities, CVD progression due
类目: Applications (stat.AP); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 30 pages, 5 figure, 5 tables

点击查看摘要

Abstract:Cardiovascular diseases (CVDs) are notably prevalent among patients with obstructive sleep apnea (OSA), posing unique challenges in predicting CVD progression due to the intricate interactions of comorbidities. Traditional models typically lack the necessary dynamic and longitudinal scope to accurately forecast CVD trajectories in OSA patients. This study introduces a novel multi-level phenotypic model to analyze the progression and interplay of these conditions over time, utilizing data from the Wisconsin Sleep Cohort, which includes 1,123 participants followed for decades. Our methodology comprises three advanced steps: (1) Conducting feature importance analysis through tree-based models to underscore critical predictive variables like total cholesterol, low-density lipoprotein (LDL), and diabetes. (2) Developing a logistic mixed-effects model (LGMM) to track longitudinal transitions and pinpoint significant factors, which displayed a diagnostic accuracy of 0.9556. (3) Implementing t-distributed Stochastic Neighbor Embedding (t-SNE) alongside Gaussian Mixture Models (GMM) to segment patient data into distinct phenotypic clusters that reflect varied risk profiles and disease progression pathways. This phenotypic clustering revealed two main groups, with one showing a markedly increased risk of major adverse cardiovascular events (MACEs), underscored by the significant predictive role of nocturnal hypoxia and sympathetic nervous system activity from sleep data. Analysis of transitions and trajectories with t-SNE and GMM highlighted different progression rates within the cohort, with one cluster progressing more slowly towards severe CVD states than the other. This study offers a comprehensive understanding of the dynamic relationship between CVD and OSA, providing valuable tools for predicting disease onset and tailoring treatment approaches.

[LG-129] A Multi-resolution Low-rank Tensor Decomposition

链接: https://arxiv.org/abs/2406.18560
作者: Sergio Rozada,Antonio G. Marques
关键词: efficient and parsimonious, variety of fields, fundamental problem, problem with numerous, numerous applications
类目: General Mathematics (math.GM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The (efficient and parsimonious) decomposition of higher-order tensors is a fundamental problem with numerous applications in a variety of fields. Several methods have been proposed in the literature to that end, with the Tucker and PARAFAC decompositions being the most prominent ones. Inspired by the latter, in this work we propose a multi-resolution low-rank tensor decomposition to describe (approximate) a tensor in a hierarchical fashion. The central idea of the decomposition is to recast the tensor into \emphmultiple lower-dimensional tensors to exploit the structure at different levels of resolution. The method is first explained, an alternating least squares algorithm is discussed, and preliminary simulations illustrating the potential practical relevance are provided.

[LG-130] Renal digital pathology visual knowledge search platform based on language large model and book knowledge

链接: https://arxiv.org/abs/2406.18556
作者: Xiaomin Lv,Chong Lai,Liya Ding,Maode Lai,Qingrong Sun
关键词: Large models, require exploration, applications in digital, renal pathology, renal pathology images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Large models have become mainstream, yet their applications in digital pathology still require exploration. Meanwhile renal pathology images play an important role in the diagnosis of renal diseases. We conducted image segmentation and paired corresponding text descriptions based on 60 books for renal pathology, clustering analysis for all image and text description features based on large models, ultimately building a retrieval system based on the semantic features of large models. Based above analysis, we established a knowledge base of 10,317 renal pathology images and paired corresponding text descriptions, and then we evaluated the semantic feature capabilities of 4 large models, including GPT2, gemma, LLma and Qwen, and the image-based feature capabilities of dinov2 large model. Furthermore, we built a semantic retrieval system to retrieve pathological images based on text descriptions, and named RppD (this http URL).

信息检索

[IR-0] Which Neurons Matter in IR? Applying Integrated Gradients-based Methods to Understand Cross-Encoders

链接: https://arxiv.org/abs/2406.19309
作者: Mathias Vast,Basile Van Cooten,Laure Soulier,Benjamin Piwowarski
关键词: Information Retrieval, Retrieval-Augmented Generation, importance of Information, recent addition, addition of Retrieval-Augmented
类目: Information Retrieval (cs.IR)
*备注: Accepted at ICTIR 2024

点击查看摘要

Abstract:With the recent addition of Retrieval-Augmented Generation (RAG), the scope and importance of Information Retrieval (IR) has expanded. As a result, the importance of a deeper understanding of IR models also increases. However, interpretability in IR remains under-explored, especially when it comes to the models’ inner mechanisms. In this paper, we explore the possibility of adapting Integrated Gradient-based methods in an IR context to identify the role of individual neurons within the model. In particular, we provide new insights into the role of what we call “relevance” neurons, as well as how they deal with unseen data. Finally, we carry out an in-depth pruning study to validate our findings.

[IR-1] Grounded and Transparent Response Generation for Conversational Information-Seeking Systems

链接: https://arxiv.org/abs/2406.19281
作者: Weronika Łajewska
关键词: coherent responses remains, query rewriting, previous conversational information-seeking, CIS, synthesizing retrieved information
类目: Information Retrieval (cs.IR)
*备注: Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM '24), 2024

点击查看摘要

Abstract:While previous conversational information-seeking (CIS) research has focused on passage retrieval, reranking, and query rewriting, the challenge of synthesizing retrieved information into coherent responses remains. The proposed research delves into the intricacies of response generation in CIS systems. Open-ended information-seeking dialogues introduce multiple challenges that may lead to potential pitfalls in system responses. The study focuses on generating responses grounded in the retrieved passages and being transparent about the system’s limitations. Specific research questions revolve around obtaining confidence-enriched information nuggets, automatic detection of incomplete or incorrect responses, generating responses communicating the system’s limitations, and evaluating enhanced responses. By addressing these research tasks the study aspires to contribute to the advancement of conversational response generation, fostering more trustworthy interactions in CIS dialogues, and paving the way for grounded and transparent systems to meet users’ needs in an information-driven world.

[IR-2] FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts

链接: https://arxiv.org/abs/2406.19237
作者: Shubhankar Singh,Purvi Chaurasia,Yerram Varun,Pranshu Pandya,Vatsal Gupta,Vivek Gupta,Dan Roth
关键词: question answering lack, spatial reasoning skills, visual question answering, evaluating spatial reasoning, Existing benchmarks
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark’s potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.

[IR-3] RAVEN: Multitask Retrieval Augmented Vision-Language Learning

链接: https://arxiv.org/abs/2406.19150
作者: Varun Nagaraj Rao,Siddharth Choudhary,Aditya Deshpande,Ravi Kumar Satzoda,Srikar Appalaraju
关键词: exacerbated resource barriers, large language models, scaling of large, large language, world knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The scaling of large language models to encode all the world’s knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they’re limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks. Our results and extensive ablations across retrieved modalities for the image captioning and VQA tasks indicate significant performance improvements compared to non retrieved baselines +1 CIDEr on MSCOCO, +4 CIDEr on NoCaps and nearly a +3% accuracy on specific VQA question types. This underscores the efficacy of applying RAG approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.

[IR-4] Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

链接: https://arxiv.org/abs/2406.19102
作者: Lokesh Mishra,Sohayl Dhibi,Yusik Kim,Cesar Berrospi Ramis,Shubham Gupta,Michele Dolfi,Peter Staar
关键词: greenhouse gas emissions, water consumption, waste management, KPIs assess, climate change
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted at the NLP4Climate workshop in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)

点击查看摘要

Abstract:Environment, Social, and Governance (ESG) KPIs assess an organization’s performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.

[IR-5] Efficient course recommendations with T5-based ranking and summarization

链接: https://arxiv.org/abs/2406.19018
作者: Thijmen Bijl,Niels van Weeren,Suzan Verberne
关键词: recommender system, skill-occupation pairs, recommender system BrightFit, in-production recommender system, retrieval
类目: Information Retrieval (cs.IR)
*备注: ReNeuIR 2024 (at SIGIR 2024) - 3rd Workshop on Reaching Efficiency in Neural Information Retrieval, 18 July, 2024, Washington D.C, USA

点击查看摘要

Abstract:In this paper, we implement and evaluate a two-stage retrieval pipeline for a course recommender system that ranks courses for skill-occupation pairs. The in-production recommender system BrightFit provides course recommendations from multiple sources. Some of the course descriptions are long and noisy, while retrieval and ranking in an online system have to be highly efficient. We developed a two-step retrieval pipeline with RankT5 finetuned on MSMARCO as re-ranker. We compare two summarizers for course descriptions: a LongT5 model that we finetuned for the task, and a generative LLM (Vicuna) with in-context learning. We experiment with quantization to reduce the size of the ranking model and increase inference speed. We evaluate our rankers on two newly labelled datasets, with an A/B test, and with a user questionnaire. On the two labelled datasets, our proposed two-stage ranking with automatic summarization achieves a substantial improvement over the in-production (BM25) ranker: nDCG@10 scores improve from 0.482 to 0.684 and from 0.447 to 0.844 on the two datasets. We also achieve a 40% speed-up by using a quantized version of RankT5. The improved quality of the ranking was confirmed by the questionnaire completed by 29 respondents, but not by the A/B test. In the A/B test, a higher clickthrough rate was observed for the BM25-ranking than for the proposed two-stage retrieval. We conclude that T5-based re-ranking and summarization for online course recommendation can obtain much better effectiveness than single-step lexical retrieval, and that quantization has a large effect on RankT5. In the online evaluation, however, other factors than relevance play a role (such as speed and interpretability of the retrieval results), as well as individual preferences.

[IR-6] owards a Formal Characterization of User Simulation Objectives in Conversational Information Access

链接: https://arxiv.org/abs/2406.19007
作者: Nolwenn Bernard,Krisztian Balog
关键词: facilitating reproducible experiments, evaluating conversational information, conversational information access, information access agents, enabling the generation
类目: Information Retrieval (cs.IR)
*备注: Proceedings of the 2024 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '24), July 13, 2024, Washington DC, DC, USA

点击查看摘要

Abstract:User simulation is a promising approach for automatically training and evaluating conversational information access agents, enabling the generation of synthetic dialogues and facilitating reproducible experiments at scale. However, the objectives of user simulation for the different uses remain loosely defined, hindering the development of effective simulators. In this work, we formally characterize the distinct objectives for user simulators: training aims to maximize behavioral similarity to real users, while evaluation focuses on the accurate prediction of real-world conversational agent performance. Through an empirical study, we demonstrate that optimizing for one objective does not necessarily lead to improved performance on the other. This finding underscores the need for tailored design considerations depending on the intended use of the simulator. By establishing clear objectives and proposing concrete measures to evaluate user simulators against those objectives, we pave the way for the development of simulators that are specifically tailored to their intended use, ultimately leading to more effective conversational agents.

[IR-7] Amplify Graph Learning for Recommendation via Sparsity Completion

链接: https://arxiv.org/abs/2406.18984
作者: Peng Yuan,Haojie Li,Minying Fang,Xu Yu,Yongjing Hao,Junwei Du
关键词: Graph learning models, based recommendation systems, collaborative filtering, Graph, widely deployed
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph learning models have been widely deployed in collaborative filtering (CF) based recommendation systems. Due to the issue of data sparsity, the graph structure of the original input lacks potential positive preference edges, which significantly reduces the performance of recommendations. In this paper, we study how to enhance the graph structure for CF more effectively, thereby optimizing the representation of graph nodes. Previous works introduced matrix completion techniques into CF, proposing the use of either stochastic completion methods or superficial structure completion to address this issue. However, most of these approaches employ random numerical filling that lack control over noise perturbations and limit the in-depth exploration of higher-order interaction features of nodes, resulting in biased graph representations. In this paper, we propose an Amplify Graph Learning framework based on Sparsity Completion (called AGL-SC). First, we utilize graph neural network to mine direct interaction features between user and item nodes, which are used as the inputs of the encoder. Second, we design a factorization-based method to mine higher-order interaction features. These features serve as perturbation factors in the latent space of the hidden layer to facilitate generative enhancement. Finally, by employing the variational inference, the above multi-order features are integrated to implement the completion and enhancement of missing graph structures. We conducted benchmark and strategy experiments on four real-world datasets related to recommendation tasks. The experimental results demonstrate that AGL-SC significantly outperforms the state-of-the-art methods. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2406.18984 [cs.IR] (or arXiv:2406.18984v1 [cs.IR] for this version)

[IR-8] Multi-modal Food Recommendation using Clustering and Self-supervised Learning

链接: https://arxiv.org/abs/2406.18962
作者: Yixin Zhang,Xin Zhou,Qianwen Meng,Fanglin Zhu,Yonghui Xu,Zhiqi Shen,Lizhen Cui
关键词: digital lifestyle services, unique dietary predilections, recommendation systems serve, lifestyle services, designed to assist
类目: Information Retrieval (cs.IR)
*备注: Working paper

点击查看摘要

Abstract:Food recommendation systems serve as pivotal components in the realm of digital lifestyle services, designed to assist users in discovering recipes and food items that resonate with their unique dietary predilections. Typically, multi-modal descriptions offer an exhaustive profile for each recipe, thereby ensuring recommendations that are both personalized and accurate. Our preliminary investigation of two datasets indicates that pre-trained multi-modal dense representations might precipitate a deterioration in performance compared to ID features when encapsulating interactive relationships. This observation implies that ID features possess a relative superiority in modeling interactive collaborative signals. Consequently, contemporary cutting-edge methodologies augment ID features with multi-modal information as supplementary features, overlooking the latent semantic relations between recipes. To rectify this, we present CLUSSL, a novel food recommendation framework that employs clustering and self-supervised learning. Specifically, CLUSSL formulates a modality-specific graph tailored to each modality with discrete/continuous features, thereby transforming semantic features into structural representation. Furthermore, CLUSSL procures recipe representations pertinent to different modalities via graph convolutional operations. A self-supervised learning objective is proposed to foster independence between recipe representations derived from different unimodal graphs. Comprehensive experiments on real-world datasets substantiate that CLUSSL consistently surpasses state-of-the-art recommendation benchmarks in performance.

[IR-9] A Surprisingly Simple yet Effective Multi-Query Rewriting Method for Conversational Passage Retrieval

链接: https://arxiv.org/abs/2406.18960
作者: Ivica Kostric,Krisztian Balog
关键词: Conversational passage retrieval, Conversational passage, natural language, coreference and ellipsis, requires the resolution
类目: Information Retrieval (cs.IR)
*备注: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

点击查看摘要

Abstract:Conversational passage retrieval is challenging as it often requires the resolution of references to previous utterances and needs to deal with the complexities of natural language, such as coreference and ellipsis. To address these challenges, pre-trained sequence-to-sequence neural query rewriters are commonly used to generate a single de-contextualized query based on conversation history. Previous research shows that combining multiple query rewrites for the same user utterance has a positive effect on retrieval performance. We propose the use of a neural query rewriter to generate multiple queries and show how to integrate those queries in the passage retrieval pipeline efficiently. The main strength of our approach lies in its simplicity: it leverages how the beam search algorithm works and can produce multiple query rewrites at no additional cost. Our contributions further include devising ways to utilize multi-query rewrites in both sparse and dense first-pass retrieval. We demonstrate that applying our approach on top of a standard passage retrieval pipeline delivers state-of-the-art performance without sacrificing efficiency.

[IR-10] owards Personalized Federated Multi-scenario Multi-task Recommendation

链接: https://arxiv.org/abs/2406.18938
作者: Yue Ding,Yanbiao Ji,Xun Cai,Xin Xin,Xiaofeng Gao,Hongtao Lu
关键词: recommender system applications, click-through rate, conversion rate, modern recommender system, Multi-task recommender systems
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In modern recommender system applications, such as e-commerce, predicting multiple targets like click-through rate (CTR) and post-view click-through \ conversion rate (CTCVR) is common. Multi-task recommender systems are gaining traction in research and practical use. Existing multi-task recommender systems tackle diverse business scenarios, merging and modeling these scenarios unlocks shared knowledge to boost overall performance. As new and more complex real-world recommendation scenarios have emerged, data privacy issues make it difficult to train a single global multi-task recommendation model that processes multiple separate scenarios. In this paper, we propose a novel framework for personalized federated multi-scenario multi-task recommendation, called PF-MSMTrec. We assign each scenario to a dedicated client, with each client utilizing the Mixture-of-Experts (MMoE) structure. Our proposed method aims to tackle the unique challenge posed by multiple optimization conflicts in this setting. We introduce a bottom-up joint learning mechanism. Firstly, we design a parameter template to decouple the parameters of the expert network. Thus, scenario parameters are shared knowledge for federated parameter aggregation, while task-specific parameters are personalized local parameters. Secondly, we conduct personalized federated learning for the parameters of each expert network through a federated communication round, utilizing three modules: federated batch normalization, conflict coordination, and personalized aggregation. Finally, we perform another round of personalized federated parameter aggregation on the task tower network to obtain the prediction results for multiple tasks. We conduct extensive experiments on two public datasets, and the results demonstrate that our proposed method surpasses state-of-the-art methods. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2406.18938 [cs.IR] (or arXiv:2406.18938v1 [cs.IR] for this version)

[IR-11] Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

链接: https://arxiv.org/abs/2406.18836
作者: Huaying Zhang,Rintaro Yanagi,Ren Togo,Takahiro Ogawa,Miki Haseyama
关键词: composed image retrieval, textual inversion network, zero-shot composed image, inversion network, zero-shot CIR
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Accepted as a conference paper in IEEE ICIP 2024

点击查看摘要

Abstract:This paper proposes a novel zero-shot composed image retrieval (CIR) method considering the query-target relationship by masked image-text pairs. The objective of CIR is to retrieve the target image using a query image and a query text. Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text and use a pre-trained visual-language model to realize the retrieval. However, they do not consider the query-target relationship to train the textual inversion network to acquire information for retrieval. In this paper, we propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs. By exploiting the abundant image-text pairs that are convenient to obtain with a masking strategy for learning the query-target relationship, it is expected that accurate zero-shot CIR using a retrieval-focused textual inversion network can be realized. Experimental results show the effectiveness of the proposed method.

[IR-12] ELCoRec: Enhance Language Understanding with Co-Propagation of Numerical and Categorical Features for Recommendation

链接: https://arxiv.org/abs/2406.18825
作者: Jizheng Chen,Kounianhua Du,Jianghao Lin,Bo Chen,Ruiming Tang,Weinan Zhang
关键词: Large language models, natural language processing, Large language, paid much attention, potential for recommendation
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models have been flourishing in the natural language processing (NLP) domain, and their potential for recommendation has been paid much attention to. Despite the intelligence shown by the recommendation-oriented finetuned models, LLMs struggle to fully understand the user behavior patterns due to their innate weakness in interpreting numerical features and the overhead for long context, where the temporal relations among user behaviors, subtle quantitative signals among different ratings, and various side features of items are not well explored. Existing works only fine-tune a sole LLM on given text data without introducing that important information to it, leaving these problems unsolved. In this paper, we propose ELCoRec to Enhance Language understanding with CoPropagation of numerical and categorical features for Recommendation. Concretely, we propose to inject the preference understanding capability into LLM via a GAT expert model where the user preference is better encoded by parallelly propagating the temporal relations, and rating signals as well as various side information of historical items. The parallel propagation mechanism could stabilize heterogeneous features and offer an informative user preference encoding, which is then injected into the language models via soft prompting at the cost of a single token embedding. To further obtain the user’s recent interests, we proposed a novel Recent interaction Augmented Prompt (RAP) template. Experiment results over three datasets against strong baselines validate the effectiveness of ELCoRec. The code is available at https://anonymous.4open.science/r/CIKM_Code_Repo-E6F5/README.md.

[IR-13] A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems

链接: https://arxiv.org/abs/2406.18747
作者: Karn N. Watcharasupat,Alexander Lerch
关键词: significant recent progress, source separation, four-stem vocals, audio source separation, significant recent
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Submitted to the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

点击查看摘要

Abstract:Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached the performance level of the significantly more complex 6-stem Hybrid Transformer Demucs on VDBO stems and outperformed it on guitar and piano. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs. Implementation is available at this https URL.

[IR-14] Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models

链接: https://arxiv.org/abs/2406.18740
作者: Baharan Nouriinanloo,Maxime Lamothe
关键词: natural language processing, language processing tasks, Large Language Models, natural language, language processing
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been revolutionizing a myriad of natural language processing tasks with their diverse zero-shot capabilities. Indeed, existing work has shown that LLMs can be used to great effect for many tasks, such as information retrieval (IR), and passage ranking. However, current state-of-the-art results heavily lean on the capabilities of the LLM being used. Currently, proprietary, and very large LLMs such as GPT-4 are the highest performing passage re-rankers. Hence, users without the resources to leverage top of the line LLMs, or ones that are closed source, are at a disadvantage. In this paper, we investigate the use of a pre-filtering step before passage re-ranking in IR. Our experiments show that by using a small number of human generated relevance scores, coupled with LLM relevance scoring, it is effectively possible to filter out irrelevant passages before re-ranking. Our experiments also show that this pre-filtering then allows the LLM to perform significantly better at the re-ranking task. Indeed, our results show that smaller models such as Mixtral can become competitive with much larger proprietary models (e.g., ChatGPT and GPT-4).

[IR-15] Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching

链接: https://arxiv.org/abs/2406.18579
作者: Xuri Ge,Fuhai Chen,Songpei Xu,Fuxiang Tao,Jie Wang,Joemon M. Jose
关键词: Image-text matching, computer vision, fundamental problem, problem in computer, explicit
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: 22pages, 5 Figures, 6 tables, the extension of CMSEI in WACV23, and submitted to ACM TIST. arXiv admin note: text overlap with arXiv:2210.08908

点击查看摘要

Abstract:Image-text matching (ITM) is a fundamental problem in computer vision. The key issue lies in jointly learning the visual and textual representation to estimate their similarity accurately. Most existing methods focus on feature enhancement within modality or feature interaction across modalities, which, however, neglects the contextual information of the object representation based on the inter-object relationships that match the corresponding sentences with rich contextual semantics. In this paper, we propose a Hybrid-modal Interaction with multiple Relational Enhancements (termed \textitHire) for image-text matching, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modelling. In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects with salient spatial and semantic relational connectivities, guided by the explicit relationships of the objects’ spatial positions and their scene graph. We use implicit relationship modelling for potential relationship interactions before explicit modelling to improve the fault tolerance of explicit relationship detection. Then the visual and textual semantic representations are refined jointly via inter-modal interactive attention and cross-modal alignment. To correlate the context of objects with the textual context, we further refine the visual semantic representation via cross-level object-sentence and word-image-based interactive attention. Extensive experiments validate that the proposed hybrid-modal interaction with implicit and explicit modelling is more beneficial for image-text matching. And the proposed \textitHire obtains new state-of-the-art results on MS-COCO and Flickr30K benchmarks.

[IR-16] DRAK: Unlocking Molecular Insights with Domain-Specific Retrieval-Augmented Knowledge in LLMs

链接: https://arxiv.org/abs/2406.18535
作者: Jinzhe Liu,Xiangsheng Huang,Zhuo Chen,Yin Fang
关键词: Large Language Models, Large Language, Language Models, encounter challenges, unique syntax
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Ongoing work; 11 pages, 6 Figures, 2 Tables

点击查看摘要

Abstract:Large Language Models (LLMs) encounter challenges with the unique syntax of specific domains, such as biomolecules. Existing fine-tuning or modality alignment techniques struggle to bridge the domain knowledge gap and understand complex molecular data, limiting LLMs’ progress in specialized fields. To overcome these limitations, we propose an expandable and adaptable non-parametric knowledge injection framework named Domain-specific Retrieval-Augmented Knowledge (DRAK), aimed at enhancing reasoning capabilities in specific domains. Utilizing knowledge-aware prompts and gold label-induced reasoning, DRAK has developed profound expertise in the molecular domain and the capability to handle a broad spectrum of analysis tasks. We evaluated two distinct forms of DRAK variants, proving that DRAK exceeds previous benchmarks on six molecular tasks within the Mol-Instructions dataset. Extensive experiments have underscored DRAK’s formidable performance and its potential to unlock molecular insights, offering a unified paradigm for LLMs to tackle knowledge-intensive tasks in specific domains. Our code will be available soon.

人工智能

[AI-0] he Remarkable Robustness of LLMs: Stages of Inference?

链接: https://arxiv.org/abs/2406.19384
作者: Vedang Lad,Wes Gurnee,Max Tegmark
关键词: Large Language Models, Large Language, swapping adjacent layers, Language Models, deleting and swapping
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We demonstrate and investigate the remarkable robustness of Large Language Models by deleting and swapping adjacent layers. We find that deleting and swapping interventions retain 72-95% of the original model’s prediction accuracy without fine-tuning, whereas models with more layers exhibit more robustness. Based on the results of the layer-wise intervention and further experiments, we hypothesize the existence of four universal stages of inference across eight different models: detokenization, feature engineering, prediction ensembling, and residual sharpening. The first stage integrates local information, lifting raw token representations into higher-level contextual representations. Next is the iterative refinement of task and entity-specific features. Then, the second half of the model begins with a phase transition, where hidden representations align more with the vocabulary space due to specialized model components. Finally, the last layer sharpens the following token distribution by eliminating obsolete features that add noise to the prediction.

[AI-1] Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space

链接: https://arxiv.org/abs/2406.19370
作者: Core Francisco Park,Maya Okawa,Andrew Lee,Ekdeep Singh Lubana,Hidenori Tanaka
关键词: Modern generative models, models demonstrate impressive, demonstrate impressive capabilities, manipulate abstract concepts, Modern generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Modern generative models demonstrate impressive capabilities, likely stemming from an ability to identify and manipulate abstract concepts underlying their training data. However, fundamental questions remain: what determines the concepts a model learns, the order in which it learns them, and its ability to manipulate those concepts? To address these questions, we propose analyzing a model’s learning dynamics via a framework we call the concept space, where each axis represents an independent concept underlying the data generating process. By characterizing learning dynamics in this space, we identify how the speed at which a concept is learned, and hence the order of concept learning, is controlled by properties of the data we term concept signal. Further, we observe moments of sudden turns in the direction of a model’s learning dynamics in concept space. Surprisingly, these points precisely correspond to the emergence of hidden capabilities, i.e., where latent interventions show the model possesses the capability to manipulate a concept, but these capabilities cannot yet be elicited via naive input prompting. While our results focus on synthetically defined toy datasets, we hypothesize a general claim on emergence of hidden capabilities may hold: generative models possess latent capabilities that emerge suddenly and consistently during training, though a model might not exhibit these capabilities under naive input prompting.

[AI-2] Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?

链接: https://arxiv.org/abs/2406.19354
作者: Peter Hase,Thomas Hofweber,Xiang Zhou,Elias Stengel-Eskin,Mohit Bansal
关键词: model editing, model editing problem, editing, model, editing problem concerns
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 23 pages, 4 figures

点击查看摘要

Abstract:The model editing problem concerns how language models should learn new facts about the world over time. While empirical research on model editing has drawn widespread attention, the conceptual foundations of model editing remain shaky – perhaps unsurprisingly, since model editing is essentially belief revision, a storied problem in philosophy that has eluded succinct solutions for decades. Model editing nonetheless demands a solution, since we need to be able to control the knowledge within language models. With this goal in mind, this paper critiques the standard formulation of the model editing problem and proposes a formal testbed for model editing research. We first describe 12 open problems with model editing, based on challenges with (1) defining the problem, (2) developing benchmarks, and (3) assuming LLMs have editable beliefs in the first place. Many of these challenges are extremely difficult to address, e.g. determining far-reaching consequences of edits, labeling probabilistic entailments between facts, and updating beliefs of agent simulators. Next, we introduce a semi-synthetic dataset for model editing based on Wikidata, where we can evaluate edits against labels given by an idealized Bayesian agent. This enables us to say exactly how belief revision in language models falls short of a desirable epistemic standard. We encourage further research exploring settings where such a gold standard can be compared against. Our code is publicly available at: this https URL

[AI-3] IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

链接: https://arxiv.org/abs/2406.19349
作者: Lucky Susanto,Musa Izzanardi Wijanarko,Prasetia Anugrah Pratama,Traci Hong,Ika Idris,Alham Fikri Aji,Derry Wijaya
关键词: Hate speech poses, Hate speech, Indonesian hate speech, social harmony, poses a significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hate speech poses a significant threat to social harmony. Over the past two years, Indonesia has seen a ten-fold increase in the online hate speech ratio, underscoring the urgent need for effective detection mechanisms. However, progress is hindered by the limited availability of labeled data for Indonesian texts. The condition is even worse for marginalized minorities, such as Shia, LGBTQ, and other ethnic minorities because hate speech is underreported and less understood by detection tools. Furthermore, the lack of accommodation for subjectivity in current datasets compounds this issue. To address this, we introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity classification dataset. Comprising 43,692 entries annotated by 19 diverse individuals, the dataset focuses on texts targeting vulnerable groups in Indonesia, specifically during the hottest political event in the country: the presidential election. We establish baselines for seven binary classification tasks, achieving a macro-F1 score of 0.78 with a BERT model (IndoBERTweet) fine-tuned for hate speech classification. Furthermore, we demonstrate how incorporating demographic information can enhance the zero-shot performance of the large language model, gpt-3.5-turbo. However, we also caution that an overemphasis on demographic information can negatively impact the fine-tuned model performance due to data fragmentation.

[AI-4] Efficient World Models with Context-Aware Tokenization

链接: https://arxiv.org/abs/2406.19320
作者: Vincent Micheli,Eloi Alonso,François Fleuret
关键词: deep Reinforcement Learning, Reinforcement Learning, Scaling up deep, deep Reinforcement, methods presents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024

点击查看摘要

Abstract:Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose \Delta -IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, \Delta -IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at this https URL.

[AI-5] Jump Starting Bandits with LLM-Generated Prior Knowledge

链接: https://arxiv.org/abs/2406.19317
作者: Parand A. Alamdari,Yanshuai Cao,Kevin H. Wilson
关键词: integrating Large Language, Large Language Models, Large Language, present substantial evidence, substantial evidence demonstrating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.

[AI-6] LiveBench: A Challenging Contamination-Free LLM Benchmark

链接: https://arxiv.org/abs/2406.19314
作者: Colin White,Samuel Dooley,Manley Roberts,Arka Pal,Ben Feuer,Siddhartha Jain,Ravid Shwartz-Ziv,Neel Jain,Khalid Saifullah,Siddartha Naidu,Chinmay Hegde,Yann LeCun,Tom Goldstein,Willie Neiswanger,Micah Goldblum
关键词: Test set contamination, fair LLM evaluation, render benchmarks obsolete, quickly render benchmarks, newer model training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Test set contamination, wherein test data from a benchmark ends up in a newer model’s training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

[AI-7] From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

链接: https://arxiv.org/abs/2406.19292
作者: Zheyang Xiong,Vasilis Papageorgiou,Kangwook Lee,Dimitris Papailiopoulos
关键词: Large Language Models, Large Language, Recent studies, shown that Large, accurately retrieve information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs’ information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., 10.5% improvement on 20 documents MDQA at position 10 for GPT-3.5 Turbo). We also find that finetuned LLMs’ performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from 2.33% to 6.19% ). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.

[AI-8] HuatuoGPT-Vision Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

链接: https://arxiv.org/abs/2406.19280
作者: Junying Chen,Ruyi Ouyang,Anningzhe Gao,Shunian Chen,Guiming Hardy Chen,Xidong Wang,Ruifei Zhang,Zhenyang Cai,Ke Ji,Guangjun Yu,Xiang Wan,Benyou Wang
关键词: large language models, multimodal large language, rapid development, large language, medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed’s large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an ‘unblinded’ capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

[AI-9] Commodification of Compute

链接: https://arxiv.org/abs/2406.19261
作者: Jesper Kristensen,David Wender,Carl Anthony
关键词: big data analytics, artificial intelligence, big data, data analytics, rapid advancements
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:The rapid advancements in artificial intelligence, big data analytics, and cloud computing have precipitated an unprecedented demand for computational resources. However, the current landscape of computational resource allocation is characterized by significant inefficiencies, including underutilization and price volatility. This paper addresses these challenges by introducing a novel global platform for the commodification of compute hours, termed the Global Compute Exchange (GCX) (Patent Pending). The GCX leverages blockchain technology and smart contracts to create a secure, transparent, and efficient marketplace for buying and selling computational power. The GCX is built in a layered fashion, comprising Market, App, Clearing, Risk Management, Exchange (Offchain), and Blockchain (Onchain) layers, each ensuring a robust and efficient operation. This platform aims to revolutionize the computational resource market by fostering a decentralized, efficient, and transparent ecosystem that ensures equitable access to computing power, stimulates innovation, and supports diverse user needs on a global scale. By transforming compute hours into a tradable commodity, the GCX seeks to optimize resource utilization, stabilize pricing, and democratize access to computational resources. This paper explores the technological infrastructure, market potential, and societal impact of the GCX, positioning it as a pioneering solution poised to drive the next wave of innovation in commodities and compute.

[AI-10] AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AI

链接: https://arxiv.org/abs/2406.19256
作者: Kaveen Hiniduma,Suren Byna,Jean Luca Bez,Ravi Madduri
关键词: including Artificial Intelligence, Artificial Intelligence, universally agreed quote, including Artificial, Garbage In Garbage
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 9 figures, Accepted to SSDBM 2024

点击查看摘要

Abstract:“Garbage In Garbage Out” is a universally agreed quote by computer scientists from various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest a considerable amount of time and effort in preparing the data for AI. However, there are no standard methods or frameworks for assessing the “readiness” of data for AI. To provide a quantifiable assessment of the readiness of data for AI processes, we define parameters of AI data readiness and introduce AIDRIN (AI Data Readiness Inspector). AIDRIN is a framework covering a broad range of readiness dimensions available in the literature that aid in evaluating the readiness of data quantitatively and qualitatively. AIDRIN uses metrics in traditional data quality assessment such as completeness, outliers, and duplicates for data evaluation. Furthermore, AIDRIN uses metrics specific to assess data for AI, such as feature importance, feature correlations, class imbalance, fairness, privacy, and FAIR (Findability, Accessibility, Interoperability, and Reusability) principle compliance. AIDRIN provides visualizations and reports to assist data scientists in further investigating the readiness of data. The AIDRIN framework enhances the efficiency of the machine learning pipeline to make informed decisions on data readiness for AI applications.

[AI-11] AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2406.19251
作者: Jia Fu,Xiaoting Qin,Fangkai Yang,Lu Wang,Jue Zhang,Qingwei Lin,Yubo Chen,Dongmei Zhang,Saravan Rajmohan,Qi Zhang
关键词: Large Language Models, Language Models, Recent advancements, Retrieval-Augmented Generation, Models have transformed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models have transformed ML/AI development, necessitating a reevaluation of AutoML principles for the Retrieval-Augmented Generation (RAG) systems. To address the challenges of hyper-parameter optimization and online adaptation in RAG, we propose the AutoRAG-HP framework, which formulates the hyper-parameter tuning as an online multi-armed bandit (MAB) problem and introduces a novel two-level Hierarchical MAB (Hier-MAB) method for efficient exploration of large search spaces. We conduct extensive experiments on tuning hyper-parameters, such as top-k retrieved documents, prompt compression ratio, and embedding methods, using the ALCE-ASQA and Natural Questions datasets. Our evaluation from jointly optimization all three hyper-parameters demonstrate that MAB-based online learning methods can achieve Recall@5 \approx 0.8 for scenarios with prominent gradients in search space, using only \sim20% of the LLM API calls required by the Grid Search approach. Additionally, the proposed Hier-MAB approach outperforms other baselines in more challenging optimization scenarios. The code will be made available at this https URL.

[AI-12] Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models

链接: https://arxiv.org/abs/2406.19243
作者: Borodin Kirill Nikolayevich,Kudryavtsev Vasiliy Dmitrievich,Mkrtchian Grach Maratovich,Gorodnichev Mikhail Genadievich,Korzh Dmitrii Sergeevich
关键词: crucial components, automatic speaker verification, voice, biometric security, speaker
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:One of the most crucial components in the field of biometric security is the automatic speaker verification system, which is based on the speaker’s voice. It is possible to utilise ASVs in isolation or in conjunction with other AI models. In the contemporary era, the quality and quantity of neural networks are increasing exponentially. Concurrently, there is a growing number of systems that aim to manipulate data through the use of voice conversion and text-to-speech models. The field of voice biometrics forgery is aided by a number of challenges, including SSTC, ASVSpoof, and SingFake. This paper presents a system for automatic speaker verification. The primary objective of our model is the extraction of embeddings from the target speaker’s audio in order to obtain information about important characteristics of his voice, such as pitch, energy, and the duration of phonemes. This information is used in our multivoice TTS pipeline, which is currently under development. However, this model was employed within the SSTC challenge to verify users whose voice had undergone voice conversion, where it demonstrated an EER of 20.669. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) Cite as: arXiv:2406.19243 [cs.SD] (or arXiv:2406.19243v1 [cs.SD] for this version)

[AI-13] Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

链接: https://arxiv.org/abs/2406.19236
作者: Minghan Li,Heng Li,Zhi-Qi Cheng,Yifei Dong,Yuxuan Zhou,Jun-Yan He,Qi Dai,Teruko Mitamura,Alexander G. Hauptmann
关键词: aims to develop, navigate based, dynamic human activities, current VLN frameworks, VLN
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 30 pages, 18 figures, Project Page: this https URL

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN’s unique challenges, underscores the need for further research to enhance HA-VLN agents’ real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.

[AI-14] Seeing Is Believing: Black-Box Membership Inference Attacks Against Retrieval Augmented Generation

链接: https://arxiv.org/abs/2406.19234
作者: Yuying Li,Gaoyang Liu,Yang Yang,Chen Wang
关键词: enhances Large Language, Large Language Models, Large Language, Retrieval-Augmented Generation, enhances Large
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a state-of-the-art technique that enhances Large Language Models (LLMs) by retrieving relevant knowledge from an external, non-parametric database. This approach aims to mitigate common LLM issues such as hallucinations and outdated knowledge. Although existing research has demonstrated security and privacy vulnerabilities within RAG systems, making them susceptible to attacks like jailbreaks and prompt injections, the security of the RAG system’s external databases remains largely underexplored. In this paper, we employ Membership Inference Attacks (MIA) to determine whether a sample is part of the knowledge database of a RAG system, using only black-box API access. Our core hypothesis posits that if a sample is a member, it will exhibit significant similarity to the text generated by the RAG system. To test this, we compute the cosine similarity and the model’s perplexity to establish a membership score, thereby building robust features. We then introduce two novel attack strategies: a Threshold-based Attack and a Machine Learning-based Attack, designed to accurately identify membership. Experimental validation of our methods has achieved a ROC AUC of 82%.

[AI-15] ools Fail: Detecting Silent Errors in Faulty Tools

链接: https://arxiv.org/abs/2406.19228
作者: Jimin Sun,So Yeon Min,Yingshan Chang,Yonatan Bisk
关键词: control robots, retrieve knowledge, perform tasks, mainstay of LLMs, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 12 figures

点击查看摘要

Abstract:Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model’s ability to detect “silent” tool errors, and reflect on how to plan. This more directly aligns with the increasingly popular use of models as tools. We provide an initial approach to failure recovery with promising results both on a controlled calculator setting and embodied agent planning.

[AI-16] -FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

链接: https://arxiv.org/abs/2406.19223
作者: Björn Deiseroth,Manuel Brack,Patrick Schramowski,Kristian Kersting,Samuel Weinbach
关键词: Large Language Models, Tokenizers are crucial, Language Models, recently stagnated, inherent weaknesses
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2406.19223 [cs.CL] (or arXiv:2406.19223v1 [cs.CL] for this version)

[AI-17] Hack Me If You Can: Aggregating AutoEncoders for Countering Persistent Access Threats Within Highly Imbalanced Data

链接: https://arxiv.org/abs/2406.19220
作者: Sidahmed Benabderrahmane,Ngoc Hoang,Petko Valtchev,James Cheney,Talal Rahwan
关键词: Advanced Persistent Threats, Advanced Persistent, Persistent Threats, gain unauthorized access, targeted cyberattacks designed
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: To appear Future Generation Computer Systems

点击查看摘要

Abstract:Advanced Persistent Threats (APTs) are sophisticated, targeted cyberattacks designed to gain unauthorized access to systems and remain undetected for extended periods. To evade detection, APT cyberattacks deceive defense layers with breaches and exploits, thereby complicating exposure by traditional anomaly detection-based security methods. The challenge of detecting APTs with machine learning is compounded by the rarity of relevant datasets and the significant imbalance in the data, which makes the detection process highly burdensome. We present AE-APT, a deep learning-based tool for APT detection that features a family of AutoEncoder methods ranging from a basic one to a Transformer-based one. We evaluated our tool on a suite of provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The outcomes showed that AE-APT has significantly higher detection rates compared to its competitors, indicating superior performance in detecting and ranking anomalies.

[AI-18] hink Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

链接: https://arxiv.org/abs/2406.19217
作者: Zhimin Shao,Jialang Xu,Danail Stoyanov,Evangelos B. Mazomenos,Yueming Jin
关键词: minimally invasive surgery, robot-assisted minimally invasive, surgical data science, data science, ensuring safe
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Despite significant advancements in robotic systems and surgical data science, ensuring safe and optimal execution in robot-assisted minimally invasive surgery (RMIS) remains a complex challenge. Current surgical error detection methods involve two parts: identifying surgical gestures and then detecting errors within each gesture clip. These methods seldom consider the rich contextual and semantic information inherent in surgical videos, limiting their performance due to reliance on accurate gesture identification. Motivated by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Thought (COG) prompting, leveraging contextual information from surgical videos. This encompasses two reasoning modules designed to mimic the decision-making processes of expert surgeons. Concretely, we first design a Gestural-Visual Reasoning module, which utilizes transformer and attention architectures for gesture prompting, while the second, a Multi-Scale Temporal Reasoning module, employs a multi-stage temporal convolutional network with both slow and fast paths for temporal information extraction. We extensively validate our method on the public benchmark RMIS dataset JIGSAWS. Our method encapsulates the reasoning processes inherent to surgical activities enabling it to outperform the state-of-the-art by 4.6% in F1 score, 4.6% in Accuracy, and 5.9% in Jaccard index while processing each frame in 6.69 milliseconds on average, demonstrating the great potential of our approach in enhancing the safety and efficacy of RMIS procedures and surgical education. The code will be available.

[AI-19] Estimating Long-term Heterogeneous Dose-response Curve: Generalization Bound Leveraging Optimal Transport Weights

链接: https://arxiv.org/abs/2406.19195
作者: Zeqin Yang,Weilin Chen,Ruichu Cai,Yuguang Yan,Zhifeng Hao,Zhipeng Yu,Zhichao Zou,Zhen Peng,Jiecheng Guo
关键词: causal effect estimation, Long-term causal effect, long-term average effects, significant but challenging, Long-term causal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Long-term causal effect estimation is a significant but challenging problem in many applications. Existing methods rely on ideal assumptions to estimate long-term average effects, e.g., no unobserved confounders or a binary treatment,while in numerous real-world applications, these assumptions could be violated and average effects are unable to provide individual-level this http URL this paper,we address a more general problem of estimating the long-term heterogeneous dose-response curve (HDRC) while accounting for unobserved confounders. Specifically, to remove unobserved confounding in observational data, we introduce an optimal transport weighting framework to align the observational data to the experimental data with theoretical guarantees. Furthermore,to accurately predict the heterogeneous effects of continuous treatment, we establish a generalization bound on counterfactual prediction error by leveraging the reweighted distribution induced by optimal transport. Finally, we develop an HDRC estimator building upon the above theoretical foundations. Extensive experimental studies conducted on multiple synthetic and semi-synthetic datasets demonstrate the effectiveness of our proposed method.

[AI-20] BISeizuRe: BERT-Inspired Seizure Data Representation to Improve Epilepsy Monitoring

链接: https://arxiv.org/abs/2406.19189
作者: Luca Benfenati,Thorir Mar Ingolfsson,Andrea Cossettini,Daniele Jahier Pagliari,Alessio Burrello,Luca Benini
关键词: Hospital EEG Corpus, University Hospital EEG, Temple University Hospital, Scalp EEG Database, study presents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 4 pages, 2 tables, 2 figures

点击查看摘要

Abstract:This study presents a novel approach for EEG-based seizure detection leveraging a BERT-based model. The model, BENDR, undergoes a two-phase training process. Initially, it is pre-trained on the extensive Temple University Hospital EEG Corpus (TUEG), a 1.5 TB dataset comprising over 10,000 subjects, to extract common EEG data patterns. Subsequently, the model is fine-tuned on the CHB-MIT Scalp EEG Database, consisting of 664 EEG recordings from 24 pediatric patients, of which 198 contain seizure events. Key contributions include optimizing fine-tuning on the CHB-MIT dataset, where the impact of model architecture, pre-processing, and post-processing techniques are thoroughly examined to enhance sensitivity and reduce false positives per hour (FP/h). We also explored custom training strategies to ascertain the most effective setup. The model undergoes a novel second pre-training phase before subject-specific fine-tuning, enhancing its generalization capabilities. The optimized model demonstrates substantial performance enhancements, achieving as low as 0.23 FP/h, 2.5 \times lower than the baseline model, with a lower but still acceptable sensitivity rate, showcasing the effectiveness of applying a BERT-based approach on EEG-based seizure detection.

[AI-21] RAVEN: Multitask Retrieval Augmented Vision-Language Learning

链接: https://arxiv.org/abs/2406.19150
作者: Varun Nagaraj Rao,Siddharth Choudhary,Aditya Deshpande,Ravi Kumar Satzoda,Srikar Appalaraju
关键词: exacerbated resource barriers, large language models, scaling of large, large language, world knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The scaling of large language models to encode all the world’s knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they’re limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks. Our results and extensive ablations across retrieved modalities for the image captioning and VQA tasks indicate significant performance improvements compared to non retrieved baselines +1 CIDEr on MSCOCO, +4 CIDEr on NoCaps and nearly a +3% accuracy on specific VQA question types. This underscores the efficacy of applying RAG approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.

[AI-22] BackMix: Mitigating Shortcut Learning in Echocardiography with Minimal Supervision

链接: https://arxiv.org/abs/2406.19148
作者: Kit Mills Bransby,Arian Beqiri,Woo-Jin Cho Kim,Jorge Oliveira,Agisilaos Chartsias,Alberto Gomez
关键词: learn spurious correlations, Neural networks, Clever Hans effect, correct prediction, wrong reason
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at MICCAI 2024 (Pre-print)

点击查看摘要

Abstract:Neural networks can learn spurious correlations that lead to the correct prediction in a validation set, but generalise poorly because the predictions are right for the wrong reason. This undesired learning of naive shortcuts (Clever Hans effect) can happen for example in echocardiogram view classification when background cues (e.g. metadata) are biased towards a class and the model learns to focus on those background features instead of on the image content. We propose a simple, yet effective random background augmentation method called BackMix, which samples random backgrounds from other examples in the training set. By enforcing the background to be uncorrelated with the outcome, the model learns to focus on the data within the ultrasound sector and becomes invariant to the regions outside this. We extend our method in a semi-supervised setting, finding that the positive effects of BackMix are maintained with as few as 5% of segmentation labels. A loss weighting mechanism, wBackMix, is also proposed to increase the contribution of the augmented examples. We validate our method on both in-distribution and out-of-distribution datasets, demonstrating significant improvements in classification accuracy, region focus and generalisability. Our source code is available at: this https URL

[AI-23] YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention

链接: https://arxiv.org/abs/2406.19136
作者: Chenxu Wang,Haowei Ming,Jian He,Yao Lu
关键词: drug ADME processes, ADME processes, effectiveness and safety, essential for determining, determining their therapeutic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 12 figures, 6 tables

点击查看摘要

Abstract:The accurate prediction of drug molecule solubility is essential for determining their therapeutic effectiveness and safety, influencing the drug’s ADME processes. Traditional solubility prediction techniques often fail to capture the complex nature of molecular tructures, leading to notable deviations between predictions and actual results. For example, the Discussion on Advanced Drug-Like Compound Structures. Lusci highlighted issues in capturing crucial cyclic structural information in molecules with ring structures. To overcome this issue, our research introduces a novel deep learning framework combining attention-based transformers, Long Short-Term Memory (LSTM) networks, and Graph Convolutional Networks (GCN), aimed at enhancing the precision of solubility predictions. Utilizing a training set of 9,943 compounds and testing on an anticancer compound dataset, our method achieved a correlation coefficient ( R^2 ) of 0.55 and a Root Mean Square Error (RMSE) of 0.59, which outperforms the benchmark models’ scores of 0.52 ( R^2 ) and 0.61 (RMSE). Importantly, in an additional independent test, our model significantly outperformed the baseline with an RMSE of 1.05 compared to 1.28, a relative accuracy improvement of 45.9%. This research not only demonstrates the vast potential of deep learning for improving solubility prediction accuracy but also offers novel insights for drug design and selection in the future. Continued efforts will be directed towards optimizing the model architecture and extending its application to better support the drug development process, underscoring the pivotal role of deep learning in drug discovery.

[AI-24] owards Learning Abductive Reasoning using VSA Distributed Representations

链接: https://arxiv.org/abs/2406.19121
作者: Giacomo Camposampiero,Michael Hersche,Aleksandar Terzić,Roger Wattenhofer,Abu Sebastian,Abbas Rahimi
关键词: Abductive Rule Learner, Learner with Context-awareness, solves abstract reasoning, abstract reasoning tasks, reasoning tasks based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注: Accepted at the 18th International Conference on Neural-Symbolic Learning and Reasoning (NeSy) 2024

点击查看摘要

Abstract:We introduce the Abductive Rule Learner with Context-awareness (ARLC), a model that solves abstract reasoning tasks based on Learn-VRF. ARLC features a novel and more broadly applicable training objective for abductive reasoning, resulting in better interpretability and higher accuracy when solving Raven’s progressive matrices (RPM). ARLC allows both programming domain knowledge and learning the rules underlying a data distribution. We evaluate ARLC on the I-RAVEN dataset, showcasing state-of-the-art accuracy across both in-distribution and out-of-distribution (unseen attribute-rule pairs) tests. ARLC surpasses neuro-symbolic and connectionist baselines, including large language models, despite having orders of magnitude fewer parameters. We show ARLC’s robustness to post-programming training by incrementally learning from examples on top of programmed knowledge, which only improves its performance and does not result in catastrophic forgetting of the programmed solution. We validate ARLC’s seamless transfer learning from a 2x2 RPM constellation to unseen constellations. Our code is available at this https URL.

[AI-25] CHEW: A Dataset of CHanging Events in Wikipedia

链接: https://arxiv.org/abs/2406.19116
作者: Hsuvas Borkakoty,Luis Espinosa-Anke
关键词: naturally occurring text, occurring text, introduce CHEW, Wikipedia expressed, dataset of changing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Short Paper

点击查看摘要

Abstract:We introduce CHEW, a novel dataset of changing events in Wikipedia expressed in naturally occurring text. We use CHEW for probing LLMs for their timeline understanding of Wikipedia entities and events in generative and classification experiments. Our results suggest that LLMs, despite having temporal information available, struggle to construct accurate timelines. We further show the usefulness of CHEW-derived embeddings for identifying meaning shift.

[AI-26] Computational Life: How Well-formed Self-replicating Programs Emerge from Simple Interaction

链接: https://arxiv.org/abs/2406.19108
作者: Blaise Agüera y Arcas,Jyrki Alakuijala,James Evans,Ben Laurie,Alexander Mordvintsev,Eyvind Niklasson,Ettore Randazzo,Luca Versari
关键词: Artificial Life, fields of Origin, Life, Life and Artificial, Origin of Life
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 19 pages

点击查看摘要

Abstract:The fields of Origin of Life and Artificial Life both question what life is and how it emerges from a distinct set of “pre-life” dynamics. One common feature of most substrates where life emerges is a marked shift in dynamics when self-replication appears. While there are some hypotheses regarding how self-replicators arose in nature, we know very little about the general dynamics, computational principles, and necessary conditions for self-replicators to emerge. This is especially true on “computational substrates” where interactions involve logical, mathematical, or programming rules. In this paper we take a step towards understanding how self-replicators arise by studying several computational substrates based on various simple programming languages and machine instruction sets. We show that when random, non self-replicating programs are placed in an environment lacking any explicit fitness landscape, self-replicators tend to arise. We demonstrate how this occurs due to random interactions and self-modification, and can happen with and without background random mutations. We also show how increasingly complex dynamics continue to emerge following the rise of self-replicators. Finally, we show a counterexample of a minimalistic programming language where self-replicators are possible, but so far have not been observed to arise.

[AI-27] Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

链接: https://arxiv.org/abs/2406.19102
作者: Lokesh Mishra,Sohayl Dhibi,Yusik Kim,Cesar Berrospi Ramis,Shubham Gupta,Michele Dolfi,Peter Staar
关键词: greenhouse gas emissions, water consumption, waste management, KPIs assess, climate change
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted at the NLP4Climate workshop in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)

点击查看摘要

Abstract:Environment, Social, and Governance (ESG) KPIs assess an organization’s performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.

[AI-28] Dimensions underlying the representational alignment of deep neural networks with humans

链接: https://arxiv.org/abs/2406.19087
作者: Florian P. Mahner,Lukas Muttenthaler,Umut Güçlü,Martin N. Hebart
关键词: artificial intelligence, machine learning, Determining the similarities, Determining, DNN
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Determining the similarities and differences between humans and artificial intelligence is an important goal both in machine learning and cognitive neuroscience. However, similarities in representations only inform us about the degree of alignment, not the factors that determine it. Drawing upon recent developments in cognitive science, we propose a generic framework for yielding comparable representations in humans and deep neural networks (DNN). Applying this framework to humans and a DNN model of natural images revealed a low-dimensional DNN embedding of both visual and semantic dimensions. In contrast to humans, DNNs exhibited a clear dominance of visual over semantic features, indicating divergent strategies for representing images. While in-silico experiments showed seemingly-consistent interpretability of DNN dimensions, a direct comparison between human and DNN representations revealed substantial differences in how they process images. By making representations directly comparable, our results reveal important challenges for representational alignment, offering a means for improving their comparability.

[AI-29] EmPO: Theory-Driven Dataset Construction for Empathetic Response Generation through Preference Optimization

链接: https://arxiv.org/abs/2406.19071
作者: Ondrej Sotolar
关键词: emotionally intelligent multi-turn, intelligent multi-turn conversations, Empathetic response generation, conversational agents, crucial for facilitating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: v01, 4 pages short paper, ACL style

点击查看摘要

Abstract:Empathetic response generation is a desirable aspect of conversational agents, crucial for facilitating engaging and emotionally intelligent multi-turn conversations between humans and machines. Leveraging large language models for this task has shown promising results, yet challenges persist in ensuring both the empathetic quality of the responses and retention of the generalization performance of the models. In this paper, we propose a novel approach where we construct theory-driven preference datasets and use them to align LLMs with preference optimization algorithms to address these challenges. To measure empathetic response generation, we employ the EmpatheticDialogues dataset, assessing empathy with the diff-EPITOME and BERTscore metrics, and evaluate the generalization performance on the MMLU benchmark. We make all datasets, source code, and models publicly available.

[AI-30] Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO

链接: https://arxiv.org/abs/2406.19057
作者: Fuseini Mumuni,Alhassan Mumuni
关键词: Segment Anything Model, Grounding DINO, Model, achieved impressive performance, zero-shot object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Grounding DINO and the Segment Anything Model (SAM) have achieved impressive performance in zero-shot object detection and image segmentation, respectively. Together, they have a great potential in revolutionizing zero-shot semantic segmentation or data annotation. Yet, in specialized domains like medical image segmentation, objects of interest (e.g., organs, tissues, and tumors) may not fall in existing class names. To address this problem, the referring expression comprehension (REC) ability of Grounding DINO is leveraged to detect arbitrary targets by their language descriptions. However, recent studies have highlighted severe limitation of the REC framework in this application setting owing to its tendency to make false positive predictions when the target is absent in the given image. And, while this bottleneck is central to the prospect of open-set semantic segmentation, it is still largely unknown how much improvement can be achieved by studying the prediction errors. To this end, we perform empirical studies on eight publicly available datasets and reveal that these errors consistently follow a predictable pattern and can, thus, be mitigated by a simple strategy. Specifically, we show that these false positive detections with appreciable confidence scores generally occupy large image areas and can usually be filtered by their relative sizes. More importantly, we expect these observations to inspire future research in improving REC-based detection and automated segmentation. Using this technique, we evaluate the performance of SAM on multiple datasets from various specialized domains and report significant improvement in segmentation performance and annotation time savings over manual approaches.

[AI-31] A look under the hood of the Interactive Deep Learning Enterprise (No-IDLE)

链接: https://arxiv.org/abs/2406.19054
作者: Daniel Sonntag,Michael Barz,Thiago Gouvêa
关键词: German Federal Ministry, German Federal, Federal Ministry, Ministry of Education, reveals deeper insights
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: DFKI Technical Report

点击查看摘要

Abstract:This DFKI technical report presents the anatomy of the No-IDLE prototype system (funded by the German Federal Ministry of Education and Research) that provides not only basic and fundamental research in interactive machine learning, but also reveals deeper insights into users’ behaviours, needs, and goals. Machine learning and deep learning should become accessible to millions of end users. No-IDLE’s goals and scienfific challenges centre around the desire to increase the reach of interactive deep learning solutions for non-experts in machine learning. One of the key innovations described in this technical report is a methodology for interactive machine learning combined with multimodal interaction which will become central when we start interacting with semi-intelligent machines in the upcoming area of neural networks and large language models.

[AI-32] FedMap: Iterative Magnitude-Based Pruning for Communication-Efficient Federated Learning

链接: https://arxiv.org/abs/2406.19050
作者: Alexander Herzog,Robbie Southam,Ioannis Mavromatis,Aftab Khan
关键词: distributed machine learning, Federated Learning, preserving privacy, distributed machine, enables training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE Transactions on Neural Networks and Learning Systems

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning approach that enables training on decentralized data while preserving privacy. However, FL systems often involve resource-constrained client devices with limited computational power, memory, storage, and bandwidth. This paper introduces FedMap, a novel method that aims to enhance the communication efficiency of FL deployments by collaboratively learning an increasingly sparse global model through iterative, unstructured pruning. Importantly, FedMap trains a global model from scratch, unlike other methods reported in the literature, making it ideal for privacy-critical use cases such as in the medical and finance domains, where suitable pre-training data is often limited. FedMap adapts iterative magnitude-based pruning to the FL setting, ensuring all clients prune and refine the same subset of the global model parameters, therefore gradually reducing the global model size and communication overhead. The iterative nature of FedMap, forming subsequent models as subsets of predecessors, avoids parameter reactivation issues seen in prior work, resulting in stable performance. In this paper we provide an extensive evaluation of FedMap across diverse settings, datasets, model architectures, and hyperparameters, assessing performance in both IID and non-IID environments. Comparative analysis against the baseline approach demonstrates FedMap’s ability to achieve more stable client model performance. For IID scenarios, FedMap achieves over 90 % pruning without significant performance degradation. In non-IID settings, it achieves at least ~80 % pruning while maintaining accuracy. FedMap offers a promising solution to alleviate communication bottlenecks in FL systems while retaining model accuracy.

[AI-33] Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation

链接: https://arxiv.org/abs/2406.19049
作者: Amartya Sanyal,Yaxi Hu,Yaodong Yu,Yian Ma,Yixin Wang,Bernhard Schölkopf
关键词: widely observed phenomenon, machine learning, widely observed, OOD, data configurations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:“Accuracy-on-the-line” is a widely observed phenomenon in machine learning, where a model’s accuracy on in-distribution (ID) and out-of-distribution (OOD) data is positively correlated across different hyperparameters and data configurations. But when does this useful relationship break down? In this work, we explore its robustness. The key observation is that noisy data and the presence of nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon. In these cases, ID and OOD accuracy can become negatively correlated, leading to “Accuracy-on-the-wrong-line”. This phenomenon can also occur in the presence of spurious (shortcut) features, which tend to overshadow the more complex signal (core, non-spurious) features, resulting in a large nuisance feature space. Moreover, scaling to larger datasets does not mitigate this undesirable behavior and may even exacerbate it. We formally prove a lower bound on Out-of-distribution (OOD) error in a linear classification model, characterizing the conditions on the noise and nuisance features for a large OOD error. We finally demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.

[AI-34] BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection

链接: https://arxiv.org/abs/2406.19048
作者: Yang Song,Lin Wang
关键词: camera features, features, Lidar features, autonomous driving, Image Enhancement Module
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:3D object detection is an important task that has been widely applied in autonomous driving. Recently, fusing multi-modal inputs, i.e., LiDAR and camera data, to perform this task has become a new trend. Existing methods, however, either ignore the sparsity of Lidar features or fail to preserve the original spatial structure of LiDAR and the semantic density of camera features simultaneously due to the modality gap. To address issues, this letter proposes a novel bidirectional complementary Lidar-camera fusion framework, called BiCo-Fusion that can achieve robust semantic- and spatial-aware 3D object detection. The key insight is to mutually fuse the multi-modal features to enhance the semantics of LiDAR features and the spatial awareness of the camera features and adaptatively select features from both modalities to build a unified 3D representation. Specifically, we introduce Pre-Fusion consisting of a Voxel Enhancement Module (VEM) to enhance the semantics of voxel features from 2D camera features and Image Enhancement Module (IEM) to enhance the spatial characteristics of camera features from 3D voxel features. Both VEM and IEM are bidirectionally updated to effectively reduce the modality gap. We then introduce Unified Fusion to adaptively weight to select features from the enchanted Lidar and camera features to build a unified 3D representation. Extensive experiments demonstrate the superiority of our BiCo-Fusion against the prior arts. Project page: this https URL.

[AI-35] Lithium-Ion Battery System Health Monitoring and Fault Analysis from Field Data Using Gaussian Processes

链接: https://arxiv.org/abs/2406.19015
作者: Joachim Schaeffer,Eric Lenz,Duncan Gulla,Martin Z. Bazant,Richard D. Braatz,Rolf Findeisen
关键词: Health monitoring, safe and sustainable, sustainable operation, battery systems, Gaussian process resistance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Health monitoring, fault analysis, and detection are critical for the safe and sustainable operation of battery systems. We apply Gaussian process resistance models on lithium iron phosphate battery field data to effectively separate the time-dependent and operating point-dependent resistance. The data set contains 29 battery systems returned to the manufacturer for warranty, each with eight cells in series, totaling 232 cells and 131 million data rows. We develop probabilistic fault detection rules using recursive spatiotemporal Gaussian processes. These processes allow the quick processing of over a million data points, enabling advanced online monitoring and furthering the understanding of battery pack failure in the field. The analysis underlines that often, only a single cell shows abnormal behavior or a knee point, consistent with weakest-link failure for cells connected in series, amplified by local resistive heating. The results further the understanding of how batteries degrade and fail in the field and demonstrate the potential of efficient online monitoring based on data. We open-source the code and publish the large data set upon completion of the review of this article.

[AI-36] FedMLP: Federated Multi-Label Medical Image Classification under Task Heterogeneity

链接: https://arxiv.org/abs/2406.18995
作者: Zhaobin Sun(1),Nannan Wu(1),Junjie Shi(1),Li Yu(1),Xin Yang(1),Kwang-Ting Cheng(2),Zengqiang Yan(1) ((1) School of Electronic Information and Communications, Huazhong University of Science and Technology, (2) School of Engineering, Hong Kong University of Science and Technology)
关键词: enables decentralized organizations, preserving data privacy, made significant progress, collaboratively train models, Cross-silo federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Early accepted by MICCAI 2024

点击查看摘要

Abstract:Cross-silo federated learning (FL) enables decentralized organizations to collaboratively train models while preserving data privacy and has made significant progress in medical image classification. One common assumption is task homogeneity where each client has access to all classes during training. However, in clinical practice, given a multi-label classification task, constrained by the level of medical knowledge and the prevalence of diseases, each institution may diagnose only partial categories, resulting in task heterogeneity. How to pursue effective multi-label medical image classification under task heterogeneity is under-explored. In this paper, we first formulate such a realistic label missing setting in the multi-label FL domain and propose a two-stage method FedMLP to combat class missing from two aspects: pseudo label tagging and global knowledge learning. The former utilizes a warmed-up model to generate class prototypes and select samples with high confidence to supplement missing labels, while the latter uses a global model as a teacher for consistency regularization to prevent forgetting missing class knowledge. Experiments on two publicly-available medical datasets validate the superiority of FedMLP against the state-of-the-art both federated semi-supervised and noisy label learning approaches under task heterogeneity. Code is available at this https URL.

[AI-37] Semi-supervised Concept Bottleneck Models

链接: https://arxiv.org/abs/2406.18992
作者: Lijie Hu,Tianhao Huang,Huanyi Xie,Chenyang Ren,Zhengyu Hu,Lu Yu,Di Wang
关键词: garnered increasing attention, increasing attention due, provide concept-based explanations, black-box deep learning, achieving high final
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) have garnered increasing attention due to their ability to provide concept-based explanations for black-box deep learning models while achieving high final prediction accuracy using human-like concepts. However, the training of current CBMs heavily relies on the accuracy and richness of annotated concepts in the dataset. These concept labels are typically provided by experts, which can be costly and require significant resources and effort. Additionally, concept saliency maps frequently misalign with input saliency maps, causing concept predictions to correspond to irrelevant input features - an issue related to annotation alignment. To address these limitations, we propose a new framework called SSCBM (Semi-supervised Concept Bottleneck Model). Our SSCBM is suitable for practical situations where annotated data is scarce. By leveraging joint training on both labeled and unlabeled data and aligning the unlabeled data at the concept level, we effectively solve these issues. We proposed a strategy to generate pseudo labels and an alignment loss. Experiments demonstrate that our SSCBM is both effective and efficient. With only 20% labeled data, we achieved 93.19% (96.39% in a fully supervised setting) concept accuracy and 75.51% (79.82% in a fully supervised setting) prediction accuracy.

[AI-38] Alignment For Performance Improvement in Conversation Bots

链接: https://arxiv.org/abs/2406.18954
作者: Raghav Garg,Kapil Sharma,Shrey Singla
关键词: Identity Preference Optimization, achieve superior adherence, paper shows, achieve superior, predefined guidelines
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper shows that alignment methods can achieve superior adherence to guardrails compared to instruction fine-tuning alone in conversational agents, also known as bots, within predefined guidelines or ‘guardrails’. It examines traditional training approaches such as instruction fine-tuning and the recent advancements in direct alignment methods like Identity Preference Optimization (IPO), and Kahneman-Tversky Optimization (KTO). The effectiveness of alignment techniques both pre and post-instruction tuning is highlighted, illustrating their potential to optimize conversational bots in domains that require strict adherence to specified rules, such as customer care.

[AI-39] Investigating and Defending Shortcut Learning in Personalized Diffusion Models

链接: https://arxiv.org/abs/2406.18944
作者: Yixin Liu,Ruoxi Chen,Lichao Sun
关键词: Personalized diffusion models, adapting pre-trained, gained popularity, popularity for adapting, specific topics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Preprint

点击查看摘要

Abstract:Personalized diffusion models have gained popularity for adapting pre-trained text-to-image models to generate images of specific topics with only a few images. However, recent studies find that these models are vulnerable to minor adversarial perturbation, and the fine-tuning performance is largely degraded on corrupted datasets. Such characteristics are further exploited to craft protective perturbation on sensitive images like portraits that prevent unauthorized generation. In response, diffusion-based purification methods have been proposed to remove these perturbations and retain generation performance. However, existing works lack detailed analysis of the fundamental shortcut learning vulnerability of personalized diffusion models and also turn to over-purifying the images cause information loss. In this paper, we take a closer look at the fine-tuning process of personalized diffusion models through the lens of shortcut learning and propose a hypothesis that could explain the underlying manipulation mechanisms of existing perturbation methods. Specifically, we find that the perturbed images are greatly shifted from their original paired prompt in the CLIP-based latent space. As a result, training with this mismatched image-prompt pair creates a construction that causes the models to dump their out-of-distribution noisy patterns to the identifier, thus causing serious performance degradation. Based on this observation, we propose a systematic approach to retain the training performance with purification that realigns the latent image and its semantic meaning and also introduces contrastive learning with a negative token to decouple the learning of wanted clean identity and the unwanted noisy pattern, that shows strong potential capacity against further adaptive perturbation.

[AI-40] Evaluating AI Group Fairness: a Fuzzy Logic Perspective

链接: https://arxiv.org/abs/2406.18939
作者: Emmanouil Krasanakis,Symeon Papadopoulos
关键词: Artificial intelligence systems, Artificial intelligence, address fairness concerns, genders or races, concerns by evaluating
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: preprint, 32 pages, 7 figures, 2 theorems, 6 appendices

点击查看摘要

Abstract:Artificial intelligence systems often address fairness concerns by evaluating and mitigating measures of group discrimination, for example that indicate biases against certain genders or races. However, what constitutes group fairness depends on who is asked and the social context, whereas definitions are often relaxed to accept small deviations from the statistical constraints they set out to impose. Here we decouple definitions of group fairness both from the context and from relaxation-related uncertainty by expressing them in the axiomatic system of Basic fuzzy Logic (BL) with loosely understood predicates, like encountering group members. We then evaluate the definitions in subclasses of BL, such as Product or Lukasiewicz logics. Evaluation produces continuous instead of binary truth values by choosing the logic subclass and truth values for predicates that reflect uncertain context-specific beliefs, such as stakeholder opinions gathered through questionnaires. Internally, it follows logic-specific rules to compute the truth values of definitions. We show that commonly held propositions standardize the resulting mathematical formulas and we transcribe logic and truth value choices to layperson terms, so that anyone can answer them. We also use our framework to study several literature definitions of algorithmic fairness, for which we rationalize previous expedient practices that are non-probabilistic and show how to re-interpret their formulas and parameters in new contexts.

[AI-41] Federated Graph Semantic and Structural Learning

链接: https://arxiv.org/abs/2406.18937
作者: Wenke Huang,Guancheng Wan,Mang Ye,Bo Du
关键词: Federated graph learning, learning collaboratively learns, identically distributed property, graph learning collaboratively, Federated graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated graph learning collaboratively learns a global graph neural network with distributed graphs, where the non-independent and identically distributed property is one of the major challenges. Most relative arts focus on traditional distributed tasks like images and voices, incapable of graph structures. This paper firstly reveals that local client distortion is brought by both node-level semantics and graph-level structure. First, for node-level semantics, we find that contrasting nodes from distinct classes is beneficial to provide a well-performing discrimination. We pull the local node towards the global node of the same class and push it away from the global node of different classes. Second, we postulate that a well-structural graph neural network possesses similarity for neighbors due to the inherent adjacency relationships. However, aligning each node with adjacent nodes hinders discrimination due to the potential class inconsistency. We transform the adjacency relationships into the similarity distribution and leverage the global model to distill the relation knowledge into the local model, which preserves the structural information and discriminability of the local model. Empirical results on three graph datasets manifest the superiority of the proposed method over its counterparts.

[AI-42] Reasoning About Action and Change

链接: https://arxiv.org/abs/2406.18930
作者: Florence Dupin de Saint-Cyr(IRIT-ADRIA, UT3),Andreas Herzig(IRIT-LILaC, CNRS),Jérôme Lang(LAMSADE, PSL, IRIT-ADRIA),Pierre Marquis(CRIL)
关键词: ranging from basic, interfaces and applications, current issues, provide an overview, basic work
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:The purpose of this book is to provide an overview of AI research, ranging from basic work to interfaces and applications, with as much emphasis on results as on current issues. It is aimed at an audience of master students and Ph.D. students, and can be of interest as well for researchers and engineers who want to know more about AI. The book is split into three volumes.

[AI-43] Learning Pareto Set for Multi-Objective Continuous Robot Control

链接: https://arxiv.org/abs/2406.18924
作者: Tianye Shu,Ke Shang,Cheng Gong,Yang Nan,Hisao Ishibuchi
关键词: multiple conflicting objectives, Pareto-optimal policies called, Pareto set, Pareto-optimal deep policies, called the Pareto
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:For a control problem with multiple conflicting objectives, there exists a set of Pareto-optimal policies called the Pareto set instead of a single optimal policy. When a multi-objective control problem is continuous and complex, traditional multi-objective reinforcement learning (MORL) algorithms search for many Pareto-optimal deep policies to approximate the Pareto set, which is quite resource-consuming. In this paper, we propose a simple and resource-efficient MORL algorithm that learns a continuous representation of the Pareto set in a high-dimensional policy parameter space using a single hypernet. The learned hypernet can directly generate various well-trained policy networks for different user preferences. We compare our method with two state-of-the-art MORL algorithms on seven multi-objective continuous robot control problems. Experimental results show that our method achieves the best overall performance with the least training parameters. An interesting observation is that the Pareto set is well approximated by a curved line or surface in a high-dimensional parameter space. This observation will provide insight for researchers to design new MORL algorithms.

[AI-44] me Matters: Scaling Laws for Any Budget

链接: https://arxiv.org/abs/2406.18922
作者: Itay Inbar,Luke Sernau
关键词: primary cost driver, primary cost, cost driver, wall-clock training time, training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A primary cost driver for training large models is wall-clock training time. We show that popular time estimates based on FLOPs are poor estimates, and construct a more accurate proxy based on memory copies. We show that with some simple accounting, we can estimate the training speed of a transformer model from its hyperparameters. Combined with a scaling law curve like Chinchilla, this lets us estimate the final loss of the model. We fit our estimate to real data with a linear regression, and apply the result to rewrite Chinchilla in terms of a model’s estimated training time as opposed to the amount of training data. This gives an expression for the loss in terms of the model’s hyperparameters alone. We show that this expression is accurate across a wide range of model hyperparameter values, enabling us to analytically make architectural decisions and train models more efficiently.

[AI-45] rustUQA: A Trustful Framework for Unified Structured Data Question Answering

链接: https://arxiv.org/abs/2406.18916
作者: Wen Zhang,Long Jin,Yushan Zhu,Jiaoyan Chen,Zhiwei Huang,Junjie Wang,Yin Hua,Lei Liang,Huajun Chen
关键词: Large Language Models, Natural language question, Natural language, Language Models, language question answering
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Natural language question answering (QA) over structured data sources such as tables and knowledge graphs (KGs) have been widely investigated, for example with Large Language Models (LLMs). The main solutions include question to formal query parsing and retrieval-based answer generation. However, current methods of the former often suffer from weak generalization, failing to dealing with multiple sources simultaneously, while the later is limited in trustfulness. In this paper, we propose UnifiedTQA, a trustful QA framework that can simultaneously support multiple types of structured data in a unified way. To this end, it adopts an LLM-friendly and unified knowledge representation method called Condition Graph (CG), and uses an LLM and demonstration-based two-level method for CG querying. For enhancement, it is also equipped with dynamic demonstration retrieval. We have evaluated UnifiedTQA with 5 benchmarks covering 3 types of structured data. It outperforms 2 existing unified structured data QA methods and in comparison with the baselines that are specific to a data type, it achieves state-of-the-art on 2 of them. Further more, we demonstrates potential of our method for more general QA tasks, QA over mixed structured data and QA across structured data.

[AI-46] he Rise of Artificial Intelligence in Educational Measurement: Opportunities and Ethical Challenges

链接: https://arxiv.org/abs/2406.18900
作者: Okan Bulut,Maggie Beiting-Parrish,Jodi M. Casabianca,Sharon C. Slater,Hong Jiao,Dan Song,Christopher M. Ormerod,Deborah Gbemisola Fabiyi,Rodica Ivan,Cole Walsh,Oscar Rios,Joshua Wilson,Seyma N. Yildirim-Erbasli,Tarid Wongvorachan,Joyce Xinle Liu,Bin Tan,Polina Morilova
关键词: enabling automated scoring, rapid content analysis, natural language processing, revolutionized assessment methods, artificial intelligence
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 59 pages, 3 figures, a joint work of the Special Interest Group on Artificial Intelligence in Measurement and Education (AIME) from the National Council of Measurement in Education (NCME)

点击查看摘要

Abstract:The integration of artificial intelligence (AI) in educational measurement has revolutionized assessment methods, enabling automated scoring, rapid content analysis, and personalized feedback through machine learning and natural language processing. These advancements provide timely, consistent feedback and valuable insights into student performance, thereby enhancing the assessment experience. However, the deployment of AI in education also raises significant ethical concerns regarding validity, reliability, transparency, fairness, and equity. Issues such as algorithmic bias and the opacity of AI decision-making processes pose risks of perpetuating inequalities and affecting assessment outcomes. Responding to these concerns, various stakeholders, including educators, policymakers, and organizations, have developed guidelines to ensure ethical AI use in education. The National Council of Measurement in Education’s Special Interest Group on AI in Measurement and Education (AIME) also focuses on establishing ethical standards and advancing research in this area. In this paper, a diverse group of AIME members examines the ethical implications of AI-powered tools in educational measurement, explores significant challenges such as automation bias and environmental impact, and proposes solutions to ensure AI’s responsible and effective use in education.

[AI-47] Autonomous Control of a Novel Closed Chain Five Bar Active Suspension via Deep Reinforcement Learning

链接: https://arxiv.org/abs/2406.18899
作者: Nishesh Singh,Sidharth Ramesh,Abhishek Shankar,Jyotishka Duttagupta,Leander Stephen D’Souza,Sanjay Singh
关键词: Planetary exploration requires, exploration requires traversal, planetary exploration robots, Planetary exploration, rugged terrains
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:Planetary exploration requires traversal in environments with rugged terrains. In addition, Mars rovers and other planetary exploration robots often carry sensitive scientific experiments and components onboard, which must be protected from mechanical harm. This paper deals with an active suspension system focused on chassis stabilisation and an efficient traversal method while encountering unavoidable obstacles. Soft Actor-Critic (SAC) was applied along with Proportional Integral Derivative (PID) control to stabilise the chassis and traverse large obstacles at low speeds. The model uses the rover’s distance from surrounding obstacles, the height of the obstacle, and the chassis’ orientation to actuate the control links of the suspension accurately. Simulations carried out in the Gazebo environment are used to validate the proposed active system.

[AI-48] 360 in the Wild: Dataset for Depth Prediction and View Synthesis

链接: https://arxiv.org/abs/2406.18898
作者: Kibaek Park,Francois Rameau,Jaesik Park,In So Kweon
关键词: abundance of perspective, facilitated the emergence, learning-based strategies, single image depth, image depth estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The large abundance of perspective camera datasets facilitated the emergence of novel learning-based strategies for various tasks, such as camera localization, single image depth estimation, or view synthesis. However, panoramic or omnidirectional image datasets, including essential information, such as pose and depth, are mostly made with synthetic scenes. In this work, we introduce a large scale 360 ^\circ videos dataset in the wild. This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide. Hence, this dataset exhibits very diversified environments (e.g., indoor and outdoor) and contexts (e.g., with and without moving objects). Each of the 25K images constituting our dataset is provided with its respective camera’s pose and depth map. We illustrate the relevance of our dataset for two main tasks, namely, single image depth estimation and view synthesis.

[AI-49] Sequential three-way group decision-making for double hierarchy hesitant fuzzy linguistic term set

链接: https://arxiv.org/abs/2406.18884
作者: Nanfang Luo,Qinghua Zhang,Qin Xie,Yutai Wang,Longjun Yin,Guoyin Wang
关键词: characterized by complexity, life scenarios, complexity and uncertainty, essential part, Group decision-making
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Group decision-making (GDM) characterized by complexity and uncertainty is an essential part of various life scenarios. Most existing researches lack tools to fuse information quickly and interpret decision results for partially formed decisions. This limitation is particularly noticeable when there is a need to improve the efficiency of GDM. To address this issue, a novel multi-level sequential three-way decision for group decision-making (S3W-GDM) method is constructed from the perspective of granular computing. This method simultaneously considers the vagueness, hesitation, and variation of GDM problems under double hierarchy hesitant fuzzy linguistic term sets (DHHFLTS) environment. First, for fusing information efficiently, a novel multi-level expert information fusion method is proposed, and the concepts of expert decision table and the extraction/aggregation of decision-leveled information based on the multi-level granularity are defined. Second, the neighborhood theory, outranking relation and regret theory (RT) are utilized to redesign the calculations of conditional probability and relative loss function. Then, the granular structure of DHHFLTS based on the sequential three-way decision (S3WD) is defined to improve the decision-making efficiency, and the decision-making strategy and interpretation of each decision-level are proposed. Furthermore, the algorithm of S3W-GDM is given. Finally, an illustrative example of diagnosis is presented, and the comparative and sensitivity analysis with other methods are performed to verify the efficiency and rationality of the proposed method.

[AI-50] wo-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification

链接: https://arxiv.org/abs/2406.18859
作者: Ziyu Yang,Santhosh Cherian,Slobodan Vucetic
关键词: highly technical documents, technical documents aimed, documents aimed primarily, Radiology reports, doctor-doctor communication
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Radiology reports are highly technical documents aimed primarily at doctor-doctor communication. There has been an increasing interest in sharing those reports with patients, necessitating providing them patient-friendly simplifications of the original reports. This study explores the suitability of large language models in automatically generating those simplifications. We examine the usefulness of chain-of-thought and self-correction prompting mechanisms in this domain. We also propose a new evaluation protocol that employs radiologists and laypeople, where radiologists verify the factual correctness of simplifications, and laypeople assess simplicity and comprehension. Our experimental results demonstrate the effectiveness of self-correction prompting in producing high-quality simplifications. Our findings illuminate the preferences of radiologists and laypeople regarding text simplification, informing future research on this topic.

[AI-51] FFN: a Fine-grained Chinese-English Financial Domain Parallel Corpus

链接: https://arxiv.org/abs/2406.18856
作者: Yuxin Fu,Shijing Si,Leyi Mai,Xi-ang Li
关键词: Large Language Models, Large Language, financial domain remains, remains largely underexplored, domain remains largely
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: a simplified version of this paper is accepted by International Conference on Asian Language Processing 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have stunningly advanced the field of machine translation, though their effectiveness within the financial domain remains largely underexplored. To probe this issue, we constructed a fine-grained Chinese-English parallel corpus of financial news called FFN. We acquired financial news articles spanning between January 1st, 2014, to December 31, 2023, from mainstream media websites such as CNN, FOX, and China Daily. The dataset consists of 1,013 main text and 809 titles, all of which have been manually corrected. We measured the translation quality of two LLMs – ChatGPT and ERNIE-bot, utilizing BLEU, TER and chrF scores as the evaluation metrics. For comparison, we also trained an OpenNMT model based on our dataset. We detail problems of LLMs and provide in-depth analysis, intending to stimulate further research and solutions in this largely uncharted territory. Our research underlines the need to optimize LLMs within the specific field of financial translation to ensure accuracy and quality.

[AI-52] LICO: Large Language Models for In-Context Molecular Optimization

链接: https://arxiv.org/abs/2406.18851
作者: Tung Nguyen,Aditya Grover
关键词: Optimizing black-box functions, Optimizing black-box, science and engineering, fundamental problem, Optimizing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO achieves state-of-the-art performance on PMO, a challenging molecular optimization benchmark comprising over 20 objective functions.

[AI-53] Learning Retrieval Augmentation for Personalized Dialogue Generation

链接: https://arxiv.org/abs/2406.18847
作者: Qiushi Huang,Shuai Fu,Xubo Liu,Wenwu Wang,Tom Ko,Yu Zhang,Lilian Tang
关键词: generating highly tailored, gained significant attention, textbf, persona dialogue generation, Personalized dialogue generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP-2023

点击查看摘要

Abstract:Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose \textbfL earning Retrieval \textbfA ugmentation for \textbfP ersonalized \textbfD ial \textbfO gue \textbfG eneration ( \textbfL