本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI四个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2022-09-26)

今日共更新215篇论文,其中:

  • 17篇自然语言处理(NLP: cs.CL)
  • 66篇计算机视觉(CV: cs.CV)
  • 40篇机器学习(ML: cs.LG)
  • 5篇人工智能(AI: cs.AI)
  • 其它主题87篇

自然语言处理

NLP-0-标题 Promptagator Few-shot Dense Retrieval From 8 Examples

链接: https://arxiv.org/abs/2209.11755
作者: Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, Ming-Wei Chang
备注:

点击查看摘要

Abstract: Much recent research on information retrieval has focused on how to transfer from one task (typically with abundant supervised data) to various other tasks where supervision is limited, with the implicit assumption that it is possible to generalize from one task to all the rest. However, this overlooks the fact that there are many diverse and unique retrieval tasks, each targeting different search intents, queries, and search domains. In this paper, we suggest to work on Few-shot Dense Retrieval, a setting where each task comes with a short description and a few examples. To amplify the power of a few examples, we propose Prompt-base Query Generation for Retriever (Promptagator), which leverages large language models (LLM) as a few-shot query generator, and creates task-specific retrievers based on the generated data. Powered by LLM’s generalization ability, Promptagator makes it possible to create task-specific end-to-end retrievers solely based on a few examples without using Natural Questions or MS MARCO to train %question generators or dual encoders. Surprisingly, LLM prompting with no more than 8 examples allows dual encoders to outperform heavily engineered models trained on MS MARCO like ColBERT v2 by more than 1.2 nDCG on average on 11 retrieval sets. Further training standard-size re-rankers using the same generated data yields another 5.0 point nDCG improvement. Our studies determine that query generation can be far more effective than previously observed, especially when a small amount of task-specific knowledge is given.

摘要:有关信息检索的许多最新研究集中在如何从一个任务(通常具有丰富的监督数据)转移到有限的其他各种任务,并具有隐含的假设,即可以从一项任务推广到所有其他任务。但是,这忽略了这样一个事实,即有许多多样化和独特的检索任务,每个任务都针对不同的搜索意图,查询和搜索域。在本文中,我们建议使用几乎没有散热的检索,每个任务都有一个简短的描述和一些示例。为了扩大一些示例的功能,我们提出了针对检索器(即将到来)的及时基本查询生成,该查询将大型语言模型(LLM)作为几个弹片查询生成器,并根据生成的数据创建特定于任务的检索器。通过LLM的概括能力提供动力,即将到来,即可可以仅基于一些示例来创建特定于任务的端到端检索,而无需使用自然问题或MS MARCO来训练%问题生成器或双重编码器。出乎意料的是,LLM提示不超过8个示例,允许双重编码器在MARCO(例如Colbert V2)上训练的大量工程模型平均在11个检索套件中超过1.2 NDCG。使用相同生成数据的进一步培训标准尺寸的重新级别可获得5.0点NDCG的改进。我们的研究确定,查询产生比以前观察到的更有效,尤其是在给出少量特定于任务知识的情况下。

NLP-1-标题 Temporal Analysis on Topics Using Word2Vec

链接: https://arxiv.org/abs/2209.11717
作者: Angad Sandhu, Aneesh Edara, Faizan Wajid, Ashok Agrawala
备注:

点击查看摘要

Abstract: The present study proposes a novel method of trend detection and visualization - more specifically, modeling the change in a topic over time. Where current models used for the identification and visualization of trends only convey the popularity of a singular word based on stochastic counting of usage, the approach in the present study illustrates the popularity and direction that a topic is moving in. The direction in this case is a distinct subtopic within the selected corpus. Such trends are generated by modeling the movement of a topic by using k-means clustering and cosine similarity to group the distances between clusters over time. In a convergent scenario, it can be inferred that the topics as a whole are meshing (tokens between topics, becoming interchangeable). On the contrary, a divergent scenario would imply that each topics’ respective tokens would not be found in the same context (the words are increasingly different to each other). The methodology was tested on a group of articles from various media houses present in the 20 Newsgroups dataset.

摘要:本研究提出了一种新颖的趋势检测和可视化方法 - 更具体地说,随着时间的推移,主题的变化建模。如果当前用于识别和可视化趋势的模型仅传达基于用法随机计数的单一单词的普及,那么本研究中的方法说明了一个主题正在发展的普及和方向。在这种情况下,方向是选定语料库中的独特亚主题。通过使用K-均值聚类和余弦相似性对主题的移动进行建模来对这种趋势进行建模,以将簇之间的距离分组。在收敛的场景中,可以推断出整个主题是在网络上的(主题之间的令牌,可以互换)。相反,一个不同的场景暗示每个主题的各自的令牌在相同的上下文中都不会找到(彼此之间越来越不同)。该方法对20个新闻组数据集中存在的各种媒体房屋的一组文章进行了测试。

NLP-2-标题 Best Prompts for Text-to-Image Models and How to Find Them

链接: https://arxiv.org/abs/2209.11711
作者: Nikita Pavlichenko, Dmitry Ustalov
备注: 12 pages (4 main pages), 4 figures, 4 tables

点击查看摘要

Abstract: Recent progress in generative models, especially in text-guided diffusion models, has enabled the production of aesthetically-pleasing imagery resembling the works of professional human artists. However, one has to carefully compose the textual description, called the prompt, and augment it with a set of clarifying keywords. Since aesthetics are challenging to evaluate computationally, human feedback is needed to determine the optimal prompt formulation and keyword combination. In this paper, we present a human-in-the-loop approach to learning the most useful combination of prompt keywords using a genetic algorithm. We also show how such an approach can improve the aesthetic appeal of images depicting the same descriptions.

摘要:生成模型的最新进展,尤其是在文本引导的扩散模型中,使得能够生产出美学上令人愉悦的图像,类似于专业人类艺术家的作品。但是,必须仔细撰写称为提示的文本描述,并使用一组澄清的关键字进行扩展。由于美学在计算上的评估具有挑战性,因此需要人类反馈来确定最佳的及时及时组合和关键字组合。在本文中,我们提出了一种使用遗传算法来学习及时关键字最有用的组合的人类方法。我们还展示了这种方法如何改善描述相同描述的图像的美学吸引力。

NLP-3-标题 A Neural Model for Regular Grammar Induction

链接: https://arxiv.org/abs/2209.11628
作者: Peter Belcák, David Hofer, Roger Wattenhofer
备注: Accepted to the 21st IEEE International Conference on Machine Learning and Applications (ICMLA) 2022, 6 pages, 4 figures

点击查看摘要

Abstract: Grammatical inference is a classical problem in computational learning theory and a topic of wider influence in natural language processing. We treat grammars as a model of computation and propose a novel neural approach to induction of regular grammars from positive and negative examples. Our model is fully explainable, its intermediate results are directly interpretable as partial parses, and it can be used to learn arbitrary regular grammars when provided with sufficient data. Our method consistently attains high recall and precision scores across a range of tests of varying complexity. We make the detailed results and code readily available.

摘要:语法推断是计算学习理论中的一个经典问题,也是自然语言处理中更广泛影响的话题。我们将语法视为计算模型,并提出了一种新型的神经方法,以从正面和负面实例中诱导常规语法。我们的模型是完全可以解释的,其中间结果可直接解释为部分分析,并且可以在提供足够的数据时将其用于学习任意的常规语法。我们的方法始终在各种复杂性测试中获得高召回和精确得分。我们使详细的结果和代码随时可用。

NLP-4-标题 Robust Domain Adaptation for Machine Reading Comprehension

链接: https://arxiv.org/abs/2209.11615
作者: Liang Jiang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng
备注:

点击查看摘要

Abstract: Most domain adaptation methods for machine reading comprehension (MRC) use a pre-trained question-answer (QA) construction model to generate pseudo QA pairs for MRC transfer. Such a process will inevitably introduce mismatched pairs (i.e., noisy correspondence) due to i) the unavailable QA pairs in target documents, and ii) the domain shift during applying the QA construction model to the target domain. Undoubtedly, the noisy correspondence will degenerate the performance of MRC, which however is neglected by existing works. To solve such an untouched problem, we propose to construct QA pairs by additionally using the dialogue related to the documents, as well as a new domain adaptation method for MRC. Specifically, we propose Robust Domain Adaptation for Machine Reading Comprehension (RMRC) method which consists of an answer extractor (AE), a question selector (QS), and an MRC model. Specifically, RMRC filters out the irrelevant answers by estimating the correlation to the document via the AE, and extracts the questions by fusing the candidate questions in multiple rounds of dialogue chats via the QS. With the extracted QA pairs, MRC is fine-tuned and provides the feedback to optimize the QS through a novel reinforced self-training method. Thanks to the optimization of the QS, our method will greatly alleviate the noisy correspondence problem caused by the domain shift. To the best of our knowledge, this could be the first study to reveal the influence of noisy correspondence in domain adaptation MRC models and show a feasible way to achieve robustness to mismatched pairs. Extensive experiments on three datasets demonstrate the effectiveness of our method.

摘要:用于机器阅读理解的大多数领域适应方法(MRC)使用预先训练的问题解答(QA)构造模型来生成用于MRC传输的伪QA对。这样的过程将不可避免地引入不匹配的对(即嘈杂的对应关系),因此由于i)目标文档中不可用的QA对,ii)在将QA构造模型应用于目标域时的域移位。毫无疑问,嘈杂的信件将退化MRC的性能,但是现有作品忽略了MRC的性能。为了解决这样一个未触及的问题,我们建议通过使用与文档相关的对话以及MRC的新域适应方法来构建质量检查对。具体而言,我们建议用于机器阅读理解理解(RMRC)方法的强大域适应性,该方法由答案提取器(AE),问题选择器(QS)和MRC模型组成。具体而言,RMRC通过通过AE估算与文档的相关性来滤除无关的答案,并通过通过QS将候选问题融合在多轮对话聊天中来提取问题。使用提取的QA对,MRC进行了微调,并提供了反馈,以通过一种新颖的增强自我训练方法优化QS。得益于QS的优化,我们的方法将大大减轻域转移引起的嘈杂对应问题。据我们所知,这可能是揭示噪声对应性在域适应MRC模型中的影响的第一个研究,并显示出一种可行的方法来实现与错配对的鲁棒性。在三个数据集上进行的广泛实验证明了我们方法的有效性。

NLP-5-标题 An Interdisciplinary Perspective on Evaluation and Experimental Design for Visual Text Analytics Position Paper

链接: https://arxiv.org/abs/2209.11534
作者: Kostiantyn Kucher, Nicole Sultanum, Angel Daza, Vasiliki Simaki, Maria Skeppstedt, Barbara Plank, Jean-Daniel Fekete, Narges Mahyar
备注: To appear in Proceedings of the 2022 IEEE Workshop on Evaluation and Beyond - Methodological Approaches to Visualization (BELIV '22)

点击查看摘要

Abstract: Appropriate evaluation and experimental design are fundamental for empirical sciences, particularly in data-driven fields. Due to the successes in computational modeling of languages, for instance, research outcomes are having an increasingly immediate impact on end users. As the gap in adoption by end users decreases, the need increases to ensure that tools and models developed by the research communities and practitioners are reliable, trustworthy, and supportive of the users in their goals. In this position paper, we focus on the issues of evaluating visual text analytics approaches. We take an interdisciplinary perspective from the visualization and natural language processing communities, as we argue that the design and validation of visual text analytics include concerns beyond computational or visual/interactive methods on their own. We identify four key groups of challenges for evaluating visual text analytics approaches (data ambiguity, experimental design, user trust, and "big picture’’ concerns) and provide suggestions for research opportunities from an interdisciplinary perspective.

摘要:适当的评估和实验设计是经验科学的基础,尤其是在数据驱动领域。例如,由于语言的计算建模成功,研究成果对最终用户产生了越来越直接的影响。随着最终用户采用差距的减少,需求增加了,以确保研究社区和从业者开发的工具和模型可靠,可信赖,并且支持用户的目标。在该立场论文中,我们专注于评估视觉文本分析方法的问题。我们从可视化和自然语言处理社区中采用跨学科的角度,因为我们认为,视觉文本分析的设计和验证包括超越计算或视觉/交互方法的问题。我们确定了四个关键的挑战群,用于评估视觉文本分析方法(数据歧义,实验设计,用户信任和“大局”问题),并从跨学科的角度为研究机会提供建议。

NLP-6-标题 MetaPrompting Learning to Learn Better Prompts

链接: https://arxiv.org/abs/2209.11486
作者: Yutai Hou, Hongyuan Dong, Xinghao Wang, Bohan Li, Wanxiang Che
备注:

点击查看摘要

Abstract: Prompting method is regarded as one of the crucial progress for few-shot nature language processing. Recent research on prompting moves from discrete tokens based hard prompts'' to continuous soft prompts’', which employ learnable vectors as pseudo prompt tokens and achieve better performance. Though showing promising prospects, these soft-prompting methods are observed to rely heavily on good initialization to take effect. Unfortunately, obtaining a perfect initialization for soft prompts requires understanding of inner language models working and elaborate design, which is no easy task and has to restart from scratch for each new task. To remedy this, we propose a generalized soft prompting method called MetaPrompting, which adopts the well-recognized model-agnostic meta-learning algorithm to automatically find better prompt initialization that facilitates fast adaptation to new prompting tasks.Extensive experiments show MetaPrompting tackles soft prompt initialization problem and brings significant improvement on four different datasets (over 6 points improvement in accuracy for 1-shot setting), achieving new state-of-the-art performance.

摘要:提示方法被认为是几击自然语言处理的关键进展之一。最近对基于离散令牌的硬提示''转移到连续软提示’'的最新研究,这些提示将可学习的向量用作伪提示代币并实现更好的性能。尽管显示出有希望的前景,但观察到这些软宣传的方法在很大程度上依赖良好的初始化来生效。不幸的是,获得软提示的完美初始化需要了解内在语言模型的工作和精心设计,这绝非易事,必须从头开始重新启动每个新任务。为了解决此问题,我们提出了一种称为Metaprompting的广义软提示方法,该方法采用了良好认可的模型 - 静态元学习算法,以自动找到更好的及时初始化,从而快速适应新的促进任务。问题并在四个不同的数据集上带来了显着改善(1次设置的准确性提高了6分),从而实现了新的最新性能。

NLP-7-标题 ET5 A Novel End-to-end Framework for Conversational Machine Reading Comprehension

链接: https://arxiv.org/abs/2209.11484
作者: Xiao Zhang, Heyan Huang, Zewen Chi, Xian-Ling Mao
备注: Accepted by COLING2022

点击查看摘要

Abstract: Conversational machine reading comprehension (CMRC) aims to assist computers to understand an natural language text and thereafter engage in a multi-turn conversation to answer questions related to the text. Existing methods typically require three steps: (1) decision making based on entailment reasoning; (2) span extraction if required by the above decision; (3) question rephrasing based on the extracted span. However, for nearly all these methods, the span extraction and question rephrasing steps cannot fully exploit the fine-grained entailment reasoning information in decision making step because of their relative independence, which will further enlarge the information gap between decision making and question phrasing. Thus, to tackle this problem, we propose a novel end-to-end framework for conversational machine reading comprehension based on shared parameter mechanism, called entailment reasoning T5 (ET5). Despite the lightweight of our proposed framework, experimental results show that the proposed ET5 achieves new state-of-the-art results on the ShARC leaderboard with the BLEU-4 score of 55.2. Our model and code are publicly available at this https URL.

摘要:对话机阅读理解(CMRC)旨在帮助计算机理解自然语言文本,然后进行多转交谈以回答与文本有关的问题。现有方法通常需要三个步骤:(1)基于需要推理的决策; (2)如果上述决定的要求,请跨越提取; (3)基于提取的跨度重新绘制问题。但是,对于几乎所有这些方法,跨度提取和问题的改写步骤无法完全利用决策制定步骤中的细粒度构成推理信息,因为它们的相对独立性将进一步扩大决策制定和问题措辞之间的信息差距。因此,为了解决这个问题,我们提出了一个基于共享参数机制的对话机读取理解理解的新颖端到端框架,称为Intailment推理T5(ET5)。尽管我们提出的框架轻量级,但实验结果表明,拟议的ET5以55.2的BLEU-4分数在Sharc排行榜上取得了新的最新结果。我们的模型和代码在此HTTPS URL上公开可用。

NLP-8-标题 News Category Dataset

链接: https://arxiv.org/abs/2209.11429
作者: Rishabh Misra
备注:

点击查看摘要

Abstract: People rely on news to know what is happening around the world and inform their daily lives. In today’s world, when the proliferation of fake news is rampant, having a large-scale and high-quality source of authentic news articles with the published category information is valuable to learning authentic news’ Natural Language syntax and semantics. As part of this work, we present a News Category Dataset that contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost, along with useful metadata to enable various NLP tasks. In this paper, we also produce some novel insights from the dataset and describe various existing and potential applications of our dataset.

摘要:人们依靠新闻来了解世界各地正在发生的事情并告知他们的日常生活。在当今的世界中,当假新闻的扩散猖ramp时,拥有大规模且高质量的真实新闻文章来源,其中包含出版类别的信息对于学习真实新闻的自然语言语法和语义是有价值的。作为这项工作的一部分,我们提供了一个新闻类别数据集,其中包含从HuffPost获得的2012年至2018年的200K新闻头条,以及有用的元数据以实现各种NLP任务。在本文中,我们还从数据集中产生了一些新颖的见解,并描述了数据集的各种现有和潜在应用。

NLP-9-标题 Zero-shot Domain Adaptation for Neural Machine Translation with Retrieved Phrase-level Prompts

链接: https://arxiv.org/abs/2209.11409
作者: Zewei Sun, Qingnan Jiang, Shujian Huang, Jun Cao, Shanbo Cheng, Mingxuan Wang
备注:

点击查看摘要

Abstract: Domain adaptation is an important challenge for neural machine translation. However, the traditional fine-tuning solution requires multiple extra training and yields a high cost. In this paper, we propose a non-tuning paradigm, resolving domain adaptation with a prompt-based method. Specifically, we construct a bilingual phrase-level database and retrieve relevant pairs from it as a prompt for the input sentences. By utilizing Retrieved Phrase-level Prompts (RePP), we effectively boost the translation quality. Experiments show that our method improves domain-specific machine translation for 6.2 BLEU scores and improves translation constraints for 11.5% accuracy without additional training.

摘要:域适应是神经机器翻译的重要挑战。但是,传统的微调解决方案需要多次额外的培训,并产生高昂的成本。在本文中,我们提出了一种非调节范式,通过基于及时的方法解决域的适应性。具体来说,我们构建了双语短语级数据库,并从中检索相关对作为输入句子的提示。通过利用检索到的短语级提示(REPP),我们有效地提高了翻译质量。实验表明,我们的方法改善了域特异性的机器翻译,可用于6.2 BLEU分数,并改善了在没有额外训练的情况下,精度为11.5%的翻译约束。

NLP-10-标题 IDEA Interactive DoublE Attentions from Label Embedding for Text Classification

链接: https://arxiv.org/abs/2209.11407
作者: Ziyuan Wang, Hailiang Huang, Songqiao Han
备注: Accepted by ICTAI2022

点击查看摘要

Abstract: Current text classification methods typically encode the text merely into embedding before a naive or complicated classifier, which ignores the suggestive information contained in the label text. As a matter of fact, humans classify documents primarily based on the semantic meaning of the subcategories. We propose a novel model structure via siamese BERT and interactive double attentions named IDEA ( Interactive DoublE Attentions) to capture the information exchange of text and label names. Interactive double attentions enable the model to exploit the inter-class and intra-class information from coarse to fine, which involves distinguishing among all labels and matching the semantical subclasses of ground truth labels. Our proposed method outperforms the state-of-the-art methods using label texts significantly with more stable results.

摘要:当前文本分类方法通常仅在天真或复杂的分类器之前将文本编码为嵌入,该分类器忽略了标签文本中包含的建议信息。实际上,人类主要基于子类别的语义含义对文档进行分类。我们通过暹罗伯特(Siamese Bert)和名为Ideas(交互式双重注意力)的交互式双重注意提出了一种新颖的模型结构,以捕获文本和标签名称的信息交换。交互式双重注意力使该模型能够从粗糙到细小的类中开利的类和类内部信息,这涉及区分所有标签并匹配地面真实标签的语义子类。我们提出的方法的表现优于最新方法,使用标签文本显着,结果更稳定。

NLP-11-标题 Conversational QA Dataset Generation with Answer Revision

链接: https://arxiv.org/abs/2209.11396
作者: Seonjeong Hwang, Gary Geunbae Lee
备注: COLING 2022

点击查看摘要

Abstract: Conversational question–answer generation is a task that automatically generates a large-scale conversational question answering dataset based on input passages. In this paper, we introduce a novel framework that extracts question-worthy phrases from a passage and then generates corresponding questions considering previous conversations. In particular, our framework revises the extracted answers after generating questions so that answers exactly match paired questions. Experimental results show that our simple answer revision approach leads to significant improvement in the quality of synthetic data. Moreover, we prove that our framework can be effectively utilized for domain adaptation of conversational question answering.

摘要:对话问题 - 答案生成是一项任务,它会自动根据输入段落生成大规模的对话问题回答数据集。在本文中,我们介绍了一个新颖的框架,该框架从一段段落中提取了值得问候的短语,然后在考虑以前的对话时产生相应的问题。特别是,我们的框架在生成问题后修改了提取的答案,以便答案与配对的问题完全匹配。实验结果表明,我们简单的答案修订方法可显着改善合成数据的质量。此外,我们证明我们的框架可以有效地用于域的适应会话问答。

NLP-12-标题 Improving Conversational Recommender System via Contextual and Time-Aware Modeling with Less Domain-Specific Knowledge

链接: https://arxiv.org/abs/2209.11386
作者: Lingzhi Wang, Shafiq Joty, Wei Gao, Xingshan Zeng, Kam-Fai Wong
备注:

点击查看摘要

Abstract: Conversational Recommender Systems (CRS) has become an emerging research topic seeking to perform recommendations through interactive conversations, which generally consist of generation and recommendation modules. Prior work on CRS tends to incorporate more external and domain-specific knowledge like item reviews to enhance performance. Despite the fact that the collection and annotation of the external domain-specific information needs much human effort and degenerates the generalizability, too much extra knowledge introduces more difficulty to balance among them. Therefore, we propose to fully discover and extract internal knowledge from the context. We capture both entity-level and contextual-level representations to jointly model user preferences for the recommendation, where a time-aware attention is designed to emphasize the recently appeared items in entity-level representations. We further use the pre-trained BART to initialize the generation module to alleviate the data scarcity and enhance the context modeling. In addition to conducting experiments on a popular dataset (ReDial), we also include a multi-domain dataset (OpenDialKG) to show the effectiveness of our model. Experiments on both datasets show that our model achieves better performance on most evaluation metrics with less external knowledge and generalizes well to other domains. Additional analyses on the recommendation and generation tasks demonstrate the effectiveness of our model in different scenarios.

摘要:对话推荐系统(CRS)已成为一个新兴的研究主题,寻求通过交互式对话进行建议,这些对话通常由发电和建议模块组成。 CRS的先前工作倾向于将更多的外部和领域特定知识纳入项目评论,以提高性能。尽管事实的收集和注释特定于外部领域的信息需要大量的人类努力并脱离了普遍性,但过多的额外知识在它们之间带来了更大的困难。因此,我们建议从上下文中充分发现和提取内部知识。我们将实体级别和上下文级别的表示形式捕获为对建议的共同模拟用户的偏好,在这种情况下,时间吸引的注意力旨在强调实体级表示中最近出现的项目。我们进一步使用预训练的巴特来初始化生成模块,以减轻数据稀缺性并增强上下文建模。除了在流行数据集(REDIAIL)上进行实验外,我们还包括一个多域数据集(OpenDialKg)来显示我们模型的有效性。两个数据集的实验都表明,我们的模型在大多数评估指标上都具有更好的性能,其外部知识较少,并且可以很好地推广到其他领域。对建议和生成任务的其他分析证明了我们在不同情况下模型的有效性。

NLP-13-标题 Extending Word-Level Quality Estimation for Post-Editing Assistance

链接: https://arxiv.org/abs/2209.11378
作者: Yizhen Wei, Takehito Utsuro, Masaaki Nagata
备注:

点击查看摘要

Abstract: We define a novel concept called extended word alignment in order to improve post-editing assistance efficiency. Based on extended word alignment, we further propose a novel task called refined word-level QE that outputs refined tags and word-level correspondences. Compared to original word-level QE, the new task is able to directly point out editing operations, thus improves efficiency. To extract extended word alignment, we adopt a supervised method based on mBERT. To solve refined word-level QE, we firstly predict original QE tags by training a regression model for sequence tagging based on mBERT and XLM-R. Then, we refine original word tags with extended word alignment. In addition, we extract source-gap correspondences, meanwhile, obtaining gap tags. Experiments on two language pairs show the feasibility of our method and give us inspirations for further improvement.

摘要:我们定义了一个名为“扩展单词对齐”的新颖概念,以提高后编辑辅助效率。基于扩展的单词对齐方式,我们进一步提出了一个名为精制单词级量化宽松的新颖任务,该任务输出精制标签和单词级对应关系。与原始单词级别的量化宽松相比,新任务能够直接指出编辑操作,从而提高效率。为了提取扩展单词对齐,我们采用了基于Mbert的监督方法。为了解决精致的单词级量化宽松,我们首先通过训练基于Mbert和XLM-R的序列标记的回归模型来预测原始量化量子标签。然后,我们使用扩展单词对齐来完善原始文字标签。另外,我们提取源差距对应关系,同时获得GAP标签。两种语言对的实验显示了我们方法的可行性,并为我们提供了进一步改进的灵感。

NLP-14-标题 Towards Faithful Model Explanation in NLP A Survey

链接: https://arxiv.org/abs/2209.11326
作者: Qing Lyu, Marianna Apidianaki, Chris Callison-Burch
备注: 62 pages

点击查看摘要

Abstract: End-to-end neural NLP architectures are notoriously difficult to understand, which gives rise to numerous efforts towards model explainability in recent years. An essential principle of model explanation is Faithfulness, i.e., an explanation should accurately represent the reasoning process behind the model’s prediction. This survey first discusses the definition and evaluation of Faithfulness, as well as its significance for explainability. We then introduce the recent advances in faithful explanation by grouping approaches into five categories: similarity methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. Each category will be illustrated with its representative studies, advantages, and shortcomings. Finally, we discuss all the above methods in terms of their common virtues and limitations, and reflect on future work directions towards faithful explainability. For researchers interested in studying interpretability, this survey will offer an accessible and comprehensive overview of the area, laying the basis for further exploration. For users hoping to better understand their own models, this survey will be an introductory manual helping with choosing the most suitable explanation method(s).

摘要:众所周知,端到端的神经NLP体系结构很难理解,这引起了近年来为解释性模型的许多努力。模型解释的基本原则是忠诚,即,解释应准确地代表模型预测背后的推理过程。这项调查首先讨论了忠诚的定义和评估及其对解释性的意义。然后,我们通过将方法分为五类来介绍忠实解释的最新进展:相似性方法,模型内部结构的分析,基于反向传播的方法,反事实干预和自我解释模型。每个类别将通过其代表性研究,优势和缺点来说明。最后,我们从它们的共同美德和局限性方面讨论了上述所有方法,并反思未来的工作方向忠实的解释性。对于有兴趣研究可解释性的研究人员,这项调查将为该领域提供可访问且全面的概述,为进一步探索提供基础。对于希望更好地了解自己的模型的用户,该调查将是一项介绍性手册,帮助选择最合适的解释方法。

NLP-15-标题 ProgPrompt Generating Situated Robot Task Plans using Large Language Models

链接: https://arxiv.org/abs/2209.11302
作者: Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg
备注:

点击查看摘要

Abstract: Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with program-like specifications of the available actions and objects in an environment, as well as with example programs that can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical robot arm for tabletop tasks. Website at this http URL

摘要:任务计划可能需要定义有关机器人需要采取行动的世界的无数领域知识。为了改善这项工作,可以使用大型语言模型(LLM)在任务计划期间为潜在的下一个操作评分,甚至直接生成动作序列,鉴于没有其他域信息的自然语言指令。但是,这样的方法要么需要列举所有可能的下一步评分,要么生成可能包含在当前机器人中给定机器人上不可能操作的自由形式文本。我们提出了一个程序化的LLM提示结构,该结构能够跨越位置环境,机器人功能和任务的计划生成功能。我们的关键见解是提示LLM具有环境中可用操作和对象的类似程序的规格,以及可以执行的示例程序。我们通过消融实验提出了有关迅速结构和生成约束的具体建议,证明了虚拟屋家庭任务中最先进的成功率,并将我们的方法部署在桌面任务的物理机器人组上。网站上此HTTP URL

NLP-16-标题 XF2T Cross-lingual Fact-to-Text Generation for Low-Resource Languages

链接: https://arxiv.org/abs/2209.11252
作者: Shivprasad Sagare, Tushar Abhishek, Bhavyajeet Singh, Anubhav Sharma, Manish Gupta, Vasudeva Varma
备注:

点击查看摘要

Abstract: Multiple business scenarios require an automated generation of descriptive human-readable text from structured input data. Hence, fact-to-text generation systems have been developed for various downstream tasks like generating soccer reports, weather and financial reports, medical reports, person biographies, etc. Unfortunately, previous work on fact-to-text (F2T) generation has focused primarily on English mainly due to the high availability of relevant datasets. Only recently, the problem of cross-lingual fact-to-text (XF2T) was proposed for generation across multiple languages alongwith a dataset, XALIGN for eight languages. However, there has been no rigorous work on the actual XF2T generation problem. We extend XALIGN dataset with annotated data for four more languages: Punjabi, Malayalam, Assamese and Oriya. We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset, which we call XALIGNV2. Further, we investigate the performance of different text generation strategies: multiple variations of pretraining, fact-aware embeddings and structure-aware input encoding. Our extensive experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results on average across the twelve languages. We make our code, dataset and model publicly available, and hope that this will help advance further research in this critical area.

摘要:多种业务场景需要从结构化输入数据中自动生成描述性的人类可读文本。因此,已经开发了针对各种下游任务的事实到文本的系统主要是由于相关数据集的高可用性。直到最近,提出了跨语言事实与文本(XF2T)的问题,该问题是针对多种语言的生成,以及一个数据集,Xalign的八种语言。但是,实际上XF2T生成问题没有严格的工作。我们使用另外四种语言的注释数据扩展了Xalign数据集:旁遮普语,马拉雅拉姆语,阿萨姆语和Oriya。我们在扩展的多语言数据集上使用基于变压器的流行文本生成模型进行了广泛的研究,我们称之为Xalignv2。此外,我们研究了不同文本生成策略的性能:预处理,事实感知的嵌入和结构意识的输入编码的多种变化。我们的广泛实验表明,使用具有结构意识的输入编码的事实感知的嵌入式的多语言MT5模型可以平均在十二种语言中获得最佳结果。我们将代码,数据集和模型公开可用,并希望这将有助于进一步在此关键领域进行进一步的研究。

机器学习

ML-0-标题 GLSO Grammar-guided Latent Space Optimization for Sample-efficient Robot Design Automation

链接: https://arxiv.org/abs/2209.11748
作者: Jiaheng Hu, Julian Whiman, Howie Choset
备注:

点击查看摘要

Abstract: Robots have been used in all sorts of automation, and yet the design of robots remains mainly a manual task. We seek to provide design tools to automate the design of robots themselves. An important challenge in robot design automation is the large and complex design search space which grows exponentially with the number of components, making optimization difficult and sample inefficient. In this work, we present Grammar-guided Latent Space Optimization (GLSO), a framework that transforms design automation into a low-dimensional continuous optimization problem by training a graph variational autoencoder (VAE) to learn a mapping between the graph-structured design space and a continuous latent space. This transformation allows optimization to be conducted in a continuous latent space, where sample efficiency can be significantly boosted by applying algorithms such as Bayesian Optimization. GLSO guides training of the VAE using graph grammar rules and robot world space features, such that the learned latent space focus on valid robots and is easier for the optimization algorithm to explore. Importantly, the trained VAE can be reused to search for designs specialized to multiple different tasks without retraining. We evaluate GLSO by designing robots for a set of locomotion tasks in simulation, and demonstrate that our method outperforms related state-of-the-art robot design automation methods.

摘要:机器人已用于各种自动化,但机器人的设计仍然主要是手动任务。我们试图提供设计工具来自动化机器人自己的设计。机器人设计自动化中的一个重要挑战是,大型且复杂的设计搜索空间随着组件的数量成倍增长,从而使优化难度和样本效率低下。在这项工作中,我们介绍了语法引导潜在空间优化(GLSO),该框架通过训练图形变量自动编码器(VAE)将设计自动化转换为低维连续优化问题,以学习图形结构的设计空间之间的映射和一个连续的潜在空间。这种转换允许在连续的潜在空间中进行优化,在这种情况下,通过应用诸如贝叶斯优化等算法,可以显着提高样品效率。 GLSO使用图形语法规则和机器人世界空间特征指导VAE训练VAE,从而使学习的潜在空间专注于有效的机器人,并且更容易探索优化算法。重要的是,可以重复使用训练有素的VAE来搜索专门针对多个不同任务的设计,而无需再培训。我们通过为模拟中的一组运动任务设计机器人来评估GLSO,并证明我们的方法优于相关的最新机器人设计自动化方法。

ML-1-标题 Unified Algorithms for RL with Decision-Estimation Coefficients No-Regret PAC and Reward-Free Learning

链接: https://arxiv.org/abs/2209.11745
作者: Fan Chen, Song Mei, Yu Bai
备注:

点击查看摘要

Abstract: Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al. (2021) as a necessary and sufficient complexity measure for sample-efficient no-regret RL. This paper makes progress towards a unified theory for RL with the DEC framework. First, we propose two new DEC-type complexity measures: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning. Next, we design new unified sample-efficient algorithms for all three learning goals. Our algorithms instantiate variants of the Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA improves upon the algorithms of Foster et al. (2021) which require either bounding a variant of the DEC which may be prohibitively large, or designing problem-specific estimation subroutines. As applications, we recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems using essentially a single algorithm. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling or Maximum Likelihood Estimation, showing that they enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC.

摘要:寻找样品效率学习的统一复杂度度量和算法是增强学习研究的核心主题(RL)。 Foster等人最近提出了决策估计系数(DEC)。 (2021)作为样品有效的NO-REGRET RL的必要和足够的复杂度度量。本文通过DEC框架朝着RL的统一理论取得了进步。首先,我们提出了两项​​新的DEC类型复杂性度量:探索性DEC(EDEC)和无奖励DEC(RFDEC)。我们表明,它们对于样本有效的PAC学习和无奖励学习是必要的,因此扩展了原始DEC,该DEC仅捕获了无需重新学习。接下来,我们为所有三个学习目标设计新的统一样品效率算法。我们的算法实例化估计到决策的变体(E2D)元算法具有强大而通用的模型估计值。即使在无重组的设置中,我们的算法E2D-TA也会在Foster等人的算法上提高。 (2021)需要对DEC的变体进行边界,该变体可能是过于大的,或者设计特定问题的估计值。作为应用程序,我们恢复了现有的,并获得了使用单个算法的各种可拖动RL问题的新样品学习结果。最后,作为一种连接,我们根据后采样或最大似然估计重新分析了两种现有的基于乐观模型的算法,表明它们在与DEC相似的结构条件下具有与E2D-TA相似的遗憾界限。

ML-2-标题 From Weakly Supervised Learning to Active Learning

链接: https://arxiv.org/abs/2209.11629
作者: Vivien Cabannes
备注: PhD Thesis, Ecole Normale Superieure, 2022

点击查看摘要

Abstract: Applied mathematics and machine computations have raised a lot of hope since the recent success of supervised learning. Many practitioners in industries have been trying to switch from their old paradigms to machine learning. Interestingly, those data scientists spend more time scrapping, annotating and cleaning data than fine-tuning models. This thesis is motivated by the following question: can we derive a more generic framework than the one of supervised learning in order to learn from clutter data? This question is approached through the lens of weakly supervised learning, assuming that the bottleneck of data collection lies in annotation. We model weak supervision as giving, rather than a unique target, a set of target candidates. We argue that one should look for an optimistic'' function that matches most of the observations. This allows us to derive a principle to disambiguate partial labels. We also discuss the advantage to incorporate unsupervised learning techniques into our framework, in particular manifold regularization approached through diffusion techniques, for which we derived a new algorithm that scales better with input dimension then the baseline method. Finally, we switch from passive to active weakly supervised learning, introducing the active labeling’’ framework, in which a practitioner can query weak information about chosen data. Among others, we leverage the fact that one does not need full information to access stochastic gradients and perform stochastic gradient descent.

摘要:自最近的学习成功以来,应用数学和机器计算已经引起了很多希望。许多行业的从业人员一直在尝试从旧范式切换到机器学习。有趣的是,这些数据科学家比微调模型花费更多的时间取消,注释和清洁数据。该论文是由以下问题激发的:我们可以比监督学习的一个更通用的框架来从混乱数据中学习吗?假设数据收集的瓶颈在于注释。我们将弱的监督建模为给予而不是独特的目标,即一组目标候选者。我们认为,应该寻找与大多数观测值相匹配的乐观''功能。这使我们能够得出一个原理来消除部分标签。我们还讨论了将无监督的学习技术纳入我们的框架的优势,特别是通过扩散技术接近的歧管正则化,为此我们得出了一种新算法,该算法通过输入维度比基线方法更好地扩展。最后,我们从被动转换为主动监督的学习,引入了主动标签’'框架,其中从业者可以查询有关所选数据的弱信息。除其他外,我们利用一个事实,即一个事实不需要全部信息来访问随机梯度并执行随机梯度下降。

ML-3-标题 Neural Clamping Joint Input Perturbation and Temperature Scaling for Neural Network Calibration

链接: https://arxiv.org/abs/2209.11604
作者: Yung-Chen Tang, Pin-Yu Chen, Tsung-Yi Ho
备注:

点击查看摘要

Abstract: Neural network calibration is an essential task in deep learning to ensure consistency between the confidence of model prediction and the true correctness likelihood. In this paper, we propose a new post-processing calibration method called Neural Clamping, which employs a simple joint input-output transformation on a pre-trained classifier via a learnable universal input perturbation and an output temperature scaling parameter. Moreover, we provide theoretical explanations on why Neural Clamping is provably better than temperature scaling. Evaluated on CIFAR-100 and ImageNet image recognition datasets and a variety of deep neural network models, our empirical results show that Neural Clamping significantly outperforms state-of-the-art post-processing calibration methods.

摘要:神经网络校准是深度学习的重要任务,以确保模型预测的信心与真正的正确性可能性之间的一致性。在本文中,我们提出了一种称为Neural夹紧的新的后处理校准方法,该方法通过可学习的通用输入扰动和输出温度扩展参数在预训练的分类器上采用简单的联合输入输出转换。此外,我们提供了理论上的解释,说明为什么神经夹具比温度缩放更好。在CIFAR-100和Imagenet图像识别数据集以及各种深神经网络模型上进行了评估,我们的经验结果表明,神经夹具明显优于最先进的后处理校准方法。

ML-4-标题 Machine Learning and Analytical Power Consumption Models for 5G Base Stations

链接: https://arxiv.org/abs/2209.11600
作者: Nicola Piovesan, David Lopez-Perez, Antonio De Domenico, Xinli Geng, Harvey Bao, Merouane Debbah
备注: Accepted by IEEE Communications Magazine

点击查看摘要

Abstract: The energy consumption of the fifth generation(5G) of mobile networks is one of the major concerns of the telecom industry. However, there is not currently an accurate and tractable approach to evaluate 5G base stations (BSs) power consumption. In this article, we propose a novel model for a realistic characterisation of the power consumption of 5G multi-carrier BSs, which builds on a large data collection campaign. At first, we define a machine learning architecture that allows modelling multiple 5G BS products. Then, we exploit the knowledge gathered by this framework to derive a realistic and analytically tractable power consumption model, which can help driving both theoretical analyses as well as feature standardisation, development and optimisation frameworks. Notably, we demonstrate that such model has high precision, and it is able of capturing the benefits of energy saving mechanisms. We believe this analytical model represents a fundamental tool for understanding 5G BSs power consumption, and accurately optimising the network energy efficiency.

摘要:移动网络第五代(5G)的能源消耗是电信行业的主要关注点之一。但是,目前没有一种评估5G基站(BSS)功耗的准确且可进行的方法。在本文中,我们提出了一个新颖的模型,以实现5G多载波BSS功耗的现实表征,该模型以大型数据收集活动为基础。首先,我们定义了允许对多个5G BS产品进行建模的机器学习体系结构。然后,我们利用该框架收集的知识来得出一个现实且可分析的功耗模型,这可以帮助推动理论分析以及功能标准化,开发和优化框架。值得注意的是,我们证明了这种模型具有很高的精度,并且能够捕获节能机制的好处。我们认为,该分析模型是理解5G BSS功耗的基本工具,并准确地优化了网络能源效率。

ML-5-标题 Quantification before Selection Active Dynamics Preference for Robust Reinforcement Learning

链接: https://arxiv.org/abs/2209.11596
作者: Kang Xu, Yan Ma, Wei Li
备注:

点击查看摘要

Abstract: Training a robust policy is critical for policy deployment in real-world systems or dealing with unknown dynamics mismatch in different dynamic systems. Domain Randomization~(DR) is a simple and elegant approach that trains a conservative policy to counter different dynamic systems without expert knowledge about the target system parameters. However, existing works reveal that the policy trained through DR tends to be over-conservative and performs poorly in target domains. Our key insight is that dynamic systems with different parameters provide different levels of difficulty for the policy, and the difficulty of behaving well in a system is constantly changing due to the evolution of the policy. If we can actively sample the systems with proper difficulty for the policy on the fly, it will stabilize the training process and prevent the policy from becoming over-conservative or over-optimistic. To operationalize this idea, we introduce Active Dynamics Preference~(ADP), which quantifies the informativeness and density of sampled system parameters. ADP actively selects system parameters with high informativeness and low density. We validate our approach in four robotic locomotion tasks with various discrepancies between the training and testing environments. Extensive results demonstrate that our approach has superior robustness for system inconsistency compared to several baselines.

摘要:培训强大的策略对于现实世界中的策略部署至关重要,或者处理不同动态系统中未知动态不匹配。域随机化〜(DR)是一种简单而优雅的方法,可以训练保守的政策,以反对不同的动态系统,而无需有关目标系统参数的专家知识。但是,现有的作品表明,通过DR培训的政策往往保守过度保守,并且在目标领域的表现差。我们的关键见解是,具有不同参数的动态系统为策略提供了不同级别的难度,并且由于策略的发展,在系统中表现良好的难度正在不断变化。如果我们可以为该政策进行适当的困难来积极地对系统进行采样,它将稳定培训过程,并防止政策变得过于保守或过度优势。为了实现这一想法,我们引入了主动动力学偏好(ADP),从而量化了采样系统参数的信息性和密度。 ADP积极选择具有高信息性和低密度的系统参数。我们在四个机器人运动任务中验证我们的方法,并在训练环境和测试环境之间存在各种差异。广泛的结果表明,与几个基线相比,我们的方法对系统不一致具有较高的鲁棒性。

ML-6-标题 Differentially private partitioned variational inference

链接: https://arxiv.org/abs/2209.11595
作者: Mikko A. Heikkilä, Matthew Ashman, Siddharth Swaroop, Richard E. Turner, Antti Honkela
备注: 30 pages, 4 figures

点击查看摘要

Abstract: Learning a privacy-preserving model from distributed sensitive data is an increasingly important problem, often formulated in the federated learning context. Variational inference has recently been extended to the non-private federated learning setting via the partitioned variational inference algorithm. For privacy protection, the current gold standard is called differential privacy. Differential privacy guarantees privacy in a strong, mathematically clearly defined sense. In this paper, we present differentially private partitioned variational inference, the first general framework for learning a variational approximation to a Bayesian posterior distribution in the federated learning setting while minimising the number of communication rounds and providing differential privacy guarantees for data subjects. We propose three alternative implementations in the general framework, one based on perturbing local optimisation done by individual parties, and two based on perturbing global updates (one using a version of federated averaging, one adding virtual parties to the protocol), and compare their properties both theoretically and empirically. We show that perturbing the local optimisation works well with simple and complex models as long as each party has enough local data. However, the privacy is always guaranteed independently by each party. In contrast, perturbing the global updates works best with relatively simple models. Given access to suitable secure primitives, such as secure aggregation or secure shuffling, the performance can be improved by all parties guaranteeing privacy jointly.

摘要:从分布式敏感数据中学习隐私的模型是一个越来越重要的问题,通常在联邦学习环境中提出。最近通过分区的变异推理算法扩展到了非私有联盟学习设置。为了保护隐私,当前的黄金标准称为差异隐私。差异隐私在强大的数学上明确定义的意义上保证了隐私。在本文中,我们介绍了差异化的分区变异推断,这是学习与联合学习环境中贝叶斯后分布的差异近似的第一个通用框架,同时最大程度地减少了通信弹的数量并为数据主体提供差异隐私保证。我们在通用框架中提出了三个替代实现,一个基于单个方面的本地优化,而两个基于扰动全局更新(一种使用联合平均版本,一个将虚拟方添加到协议中),并比较其属性,并比较其属性理论上和经验。我们表明,只要各方都有足够的本地数据,扰动本地优化与简单且复杂的模型效果很好。但是,每个方始终独立保证隐私。相比之下,扰动全局更新与相对简单的模型最有效。鉴于可以访问合适的安全原始词,例如安全聚合或安全的改组,所有各方都可以共同保证隐私。

ML-7-标题 Learning Rigid Body Dynamics with Lagrangian Graph Neural Network

链接: https://arxiv.org/abs/2209.11588
作者: Ravinder Bhattoo, Sayan Ranu, N. M. Anoop Krishnan
备注: Accepted at NeurIPS 2022

点击查看摘要

Abstract: Lagrangian and Hamiltonian neural networks (LNN and HNN respectively) encode strong inductive biases that allow them to outperform other models of physical systems significantly. However, these models have, thus far, mostly been limited to simple systems such as pendulums and springs or a single rigid body such as a gyroscope or a rigid rotor. Here, we present a Lagrangian graph neural network (LGNN) that can learn the dynamics of rigid bodies by exploiting their topology. We demonstrate the performance of LGNN by learning the dynamics of ropes, chains, and trusses with the bars modeled as rigid bodies. LGNN also exhibits generalizability – LGNN trained on chains with a few segments exhibits generalizability to simulate a chain with large number of links and arbitrary link length. We also show that the LGNN can simulate unseen hybrid systems including bars and chains, on which they have not been trained on. Specifically, we show that the LGNN can be used to model the dynamics of complex real-world structures such as the stability of tensegrity structures. Finally, we discuss the non-diagonal nature of the mass matrix and it’s ability to generalize in complex systems.

摘要:Lagrangian和Hamiltonian神经网络(分别是LNN和HNN)编码强诱导偏见,使它们能够显着优于其他物理系统模型。但是,到目前为止,这些模型大多仅限于简单的系统,例如摆和弹簧或单个刚体的身体,例如陀螺仪或刚性转子。在这里,我们提出了一个拉格朗日图神经网络(LGNN),可以通过利用其拓扑来学习刚体的动态。我们通过学习以刚体为刚体的棒的绳索,链条和桁架的动力学来证明LGNN的性能。 LGNN还表现出普遍性 - 在链条上训练了一些细分市场的LGNN具有概括性,以模拟具有大量链接和任意链路长度的链条。我们还表明,LGNN可以模拟看不见的混合动力系统,包括尚未接受过培训的酒吧和链条。具体而言,我们表明LGNN可用于建模复杂的现实世界结构的动力学,例如紧张结构的稳定性。最后,我们讨论了质量矩阵的非对角性性质及其在复杂系统中概括的能力。

ML-8-标题 Applications of Machine Learning in Chemical and Biological Oceanography

链接: https://arxiv.org/abs/2209.11557
作者: Balamurugan Sadaiappan, Preethiya Balakrishnan, Vishal CR, Neethu T Vijayan, Mahendran Subramanian, Mangesh U Gauns
备注: 58 Pages, 5 Figures

点击查看摘要

Abstract: Machine learning (ML) refers to computer algorithms that predict a meaningful output or categorise complex systems based on a large amount of data. ML applied in a variety of areas, including natural science, engineering, space exploration, and even gaming development. This article focused on the use of machine learning in the field of chemical and biological oceanography. In the prediction of global fixed nitrogen levels, partial carbon dioxide pressure, and other chemical properties, the application of ML is a promising tool. Machine learning is also utilised in the field of biological oceanography to detect planktonic forms from various images (i.e., microscopy, FlowCAM and video recorder), spectrometers, and other signal processing techniques. Moreover, ML successfully classified the mammals using their acoustics, detecting endangered mammalian and fish species in a specific environment. Most importantly, using environmental data, the ML proved to be an effective method for predicting hypoxic conditions and the harmful algal bloom events, an important measurement in terms of environmental monitoring. Furthermore, machine learning was used to construct a number of databases for various species that will be useful to other researchers, and the creation of new algorithms will help the marine research community better comprehend the chemistry and biology of the ocean.

摘要:机器学习(ML)是指根据大量数据预测有意义的输出或对复杂系统进行分类的计算机算法。 ML应用于各个领域,包括自然科学,工程,太空探索甚至游戏开发。本文的重点是在化学和生物海洋学领域使用机器学习。在预测全球固定氮水平,部分二氧化碳压力和其他化学特性时,ML的应用是一种有前途的工具。机器学习还用于生物海洋学领域,可从各种图像(即显微镜,流车和视频记录器),光谱仪和其他信号处理技术中检测浮游形式。此外,ML使用其声学成功地对哺乳动物进行了分类,在特定的环境中检测到濒临灭绝的哺乳动物和鱼类。最重要的是,使用环境数据,ML被证明是预测缺氧条件和有害藻华事件的有效方法,这是对环境监测的重要测量。此外,机器学习被用来为各种物种构建许多对其他研究人员有用的数据库,而创建新算法将帮助海洋研究界更好地理解海洋的化学和生物学。

ML-9-标题 On Efficient Reinforcement Learning for Full-length Game of StarCraft II

链接: https://arxiv.org/abs/2209.11553
作者: Ruo-Ze Liu, Zhen-Jia Pang, Zhou-Yu Meng, Wenhai Wang, Yang Yu, Tong Lu
备注: 48 pages,21 figures

点击查看摘要

Abstract: StarCraft II (SC2) poses a grand challenge for reinforcement learning (RL), of which the main difficulties include huge state space, varying action space, and a long time horizon. In this work, we investigate a set of RL techniques for the full-length game of StarCraft II. We investigate a hierarchical RL approach involving extracted macro-actions and a hierarchical architecture of neural networks. We investigate a curriculum transfer training procedure and train the agent on a single machine with 4 GPUs and 48 CPU threads. On a 64x64 map and using restrictive units, we achieve a win rate of 99% against the level-1 built-in AI. Through the curriculum transfer learning algorithm and a mixture of combat models, we achieve a 93% win rate against the most difficult non-cheating level built-in AI (level-7). In this extended version of the paper, we improve our architecture to train the agent against the cheating level AIs and achieve the win rate against the level-8, level-9, and level-10 AIs as 96%, 97%, and 94%, respectively. Our codes are at this https URL. To provide a baseline referring the AlphaStar for our work as well as the research and open-source community, we reproduce a scaled-down version of it, mini-AlphaStar (mAS). The latest version of mAS is 1.07, which can be trained on the raw action space which has 564 actions. It is designed to run training on a single common machine, by making the hyper-parameters adjustable. We then compare our work with mAS using the same resources and show that our method is more effective. The codes of mini-AlphaStar are at this https URL. We hope our study could shed some light on the future research of efficient reinforcement learning on SC2 and other large-scale games.

摘要:Starcraft II(SC2)对强化学习(RL)提出了巨大的挑战,其中主要困难包括巨大的状态空间,不同的动作空间和长期的视野。在这项工作中,我们研究了《星际争霸II》全长游戏的一系列RL技术。我们研究了涉及提取的宏观活动和神经网络的层次结构的层次RL方法。我们研究了课程转移培训程序,并在具有4个GPU和48个CPU线的单台计算机上训练代理。在64x64地图并使用限制性单元上,我们对内置AI的获胜率达到99%。通过课程转移学习算法和战斗模型的混合物,我们在最困难的非作战水平内置AI(7级)中获得了93%的胜利率。在本文的扩展版本中,我们改进了架构,以针对作弊水平训练代理商,并在8级,9级和10级AIS上达到胜利率,为96%,97%和94 %, 分别。我们的代码在此HTTPS URL。为了为我们的工作以及研究和开源社区提供基线,我们将其复制了一个缩放版本的Mini-Alphastar(MAS)。 MAS的最新版本为1.07,可以在具有564个动作的原始动作空间上进行培训。它旨在通过使超参数可调节来在单个普通机器上进行训练。然后,我们使用相同的资源将我们的工作与MAS进行比较,并表明我们的方法更有效。迷你α的代码在此HTTPS URL处。我们希望我们的研究能够阐明对SC2和其他大型游戏有效增强学习的未来研究。

ML-10-标题 A Unified Perspective on Natural Gradient Variational Inference with Gaussian Mixture Models

链接: https://arxiv.org/abs/2209.11533
作者: Oleg Arenz, Philipp Dahlinger, Zihan Ye, Michael Volpp, Gerhard Neumann
备注:

点击查看摘要

Abstract: Variational inference with Gaussian mixture models (GMMs) enables learning of highly-tractable yet multi-modal approximations of intractable target distributions. GMMs are particular relevant for problem settings with up to a few hundred dimensions, for example in robotics, for modelling distributions over trajectories or joint distributions. This work focuses on two very effective methods for GMM-based variational inference that both employ independent natural gradient updates for the individual components and the categorical distribution of the weights. We show for the first time, that their derived updates are equivalent, although their practical implementations and theoretical guarantees differ. We identify several design choices that distinguish both approaches, namely with respect to sample selection, natural gradient estimation, stepsize adaptation, and whether trust regions are enforced or the number of components adapted. We perform extensive ablations on these design choices and show that they strongly affect the efficiency of the optimization and the variability of the learned distribution. Based on our insights, we propose a novel instantiation of our generalized framework, that combines first-order natural gradient estimates with trust-regions and component adaption, and significantly outperforms both previous methods in all our experiments.

摘要:使用高斯混合模型(GMM)的变异推断,可以学习可侵袭性目标分布的高度收缩但多模式的近似值。 GMM与最多数百个维度的问题设置特别相关,例如机器人技术,用于对轨迹或联合分布进行建模。这项工作着重于基于GMM的两种非常有效的方法,这些方法既采用独立的自然梯度更新来为单个组件和权重的分类分布。我们首次表明,尽管它们的实际实现和理论保证有所不同,但他们的派生更新是等效的。我们确定了几种设计选择,可以区分两种方法,即在样本选择,自然梯度估计,步骤适应以及信任区域是否得到强制或适应的组件数量方面。我们对这些设计选择进行广泛的消融,并表明它们强烈影响了优化的效率和学习分布的可变性。基于我们的见解,我们提出了对广义框架的新颖实例化,该实例将一阶自然梯度估计与信任区域和组件适应相结合,并且在我们所有实验中都显着优于以前的两种方法。

ML-11-标题 An artificial neural network-based system for detecting machine failures using tiny sound data A case study

链接: https://arxiv.org/abs/2209.11527
作者: Thanh Tran, Sebastian Bader, Jan Lundgren
备注: 8 pages, 9 figures, conference

点击查看摘要

Abstract: In an effort to advocate the research for a deep learning-based machine failure detection system, we present a case study of our proposed system based on a tiny sound dataset. Our case study investigates a variational autoencoder (VAE) for augmenting a small drill sound dataset from Valmet AB. A Valmet dataset contains 134 sounds that have been divided into two categories: “Anomaly” and “Normal” recorded from a drilling machine in Valmet AB, a company in Sundsvall, Sweden that supplies equipment and processes for the production of biofuels. Using deep learning models to detect failure drills on such a small sound dataset is typically unsuccessful. We employed a VAE to increase the number of sounds in the tiny dataset by synthesizing new sounds from original sounds. The augmented dataset was created by combining these synthesized sounds with the original sounds. We used a high-pass filter with a passband frequency of 1000 Hz and a low-pass filter with a passband frequency of 22\kern 0.16667em000 Hz to pre-process sounds in the augmented dataset before transforming them to Mel spectrograms. The pre-trained 2D-CNN Alexnet was then trained using these Mel spectrograms. When compared to using the original tiny sound dataset to train pre-trained Alexnet, using the augmented sound dataset enhanced the CNN model’s classification results by 6.62%(94.12% when trained on the augmented dataset versus 87.5% when trained on the original dataset).

摘要:为了提倡研究基于深度学习的机器故障检测系统的研究,我们根据微小的声音数据集对我们提出的系统进行了案例研究。我们的案例研究调查了一个变异自动编码器(VAE),用于增强Valmet AB的小型钻头数据集。一个气门数据集包含134种声音,分为两类:从Valmet AB的一台钻机中记录的“异常”和“正常”,这是瑞典Sundsvall的一家公司,该公司为生物燃料的生产提供设备和流程。使用深度学习模型来检测如此小的声音数据集上的故障钻头通常没有成功。我们采用了VAE来通过合成原始声音的新声音来增加微小数据集中的声音数量。增强数据集是通过将这些合成的声音与原始声音相结合来创建的。我们使用了一个高通滤波器,其通带频率为1000 Hz和一个具有22 \ kern的Passband频率的低通滤波器0.16667EM000 Hz,以在增强数据集中的预处理声音中,然后将其转换为MEL频谱图。然后使用这些MEL频谱图对预训练的2D-CNN ALEXNET进行训练。与使用原始的小声音数据集进行训练预先训练的Alexnet时,使用增强声音数据集将CNN模型的分类结果提高了6.62 \%(94.12 \%(在增强数据集对87.5 \%训练的原始训练时,接受了87.5 \%)数据集)。

ML-12-标题 The complexity of unsupervised learning of lexicographic preferences

链接: https://arxiv.org/abs/2209.11505
作者: Hélène Fargier (IRIT-ADRIA, ANITI), Pierre-François Gimenez (CIDRE), Jérôme Mengin (IRIT-ADRIA, ANITI), Bao Ngoc Le Nguyen (INSA Toulouse)
备注:

点击查看摘要

Abstract: This paper considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees.

摘要:例如,本文考虑了在线配置器通常使用的一组替代方案中学习用户偏好的任务。在许多设置中,学习者在过去的互动过程中只有一组选定的替代方案。Fargier等。[2018]提出了一种在这种环境中学习用户偏好模型的方法,该模型对先前选择的替代方案进行了排名尽可能高;以及在这种情况下学习的算法,是一种特定的偏好模型:词典偏好树(LP-Trees)。在本文中,我们研究了与这种方法相关的复杂性理论问题。我们对学习LP-Tree的样本复杂性给出了上限,这在属性数量上是对数。我们还证明,计算最小化经验风险的LP树当仅限于线性LP-Trees的类别时,可以在多项式时间内完成。

ML-13-标题 Sequential Causal Effect Variational Autoencoder Time Series Causal Link Estimation under Hidden Confounding

链接: https://arxiv.org/abs/2209.11497
作者: Violeta Teodora Trifunov, Maha Shadaydeh, Joachim Denzler
备注:

点击查看摘要

Abstract: Estimating causal effects from observational data in the presence of latent variables sometimes leads to spurious relationships which can be misconceived as causal. This is an important issue in many fields such as finance and climate science. We propose Sequential Causal Effect Variational Autoencoder (SCEVAE), a novel method for time series causality analysis under hidden confounding. It is based on the CEVAE framework and recurrent neural networks. The causal link’s intensity of the confounded variables is calculated by using direct causal criteria based on Pearl’s do-calculus. We show the efficacy of SCEVAE by applying it to synthetic datasets with both linear and nonlinear causal links. Furthermore, we apply our method to real aerosol-cloud-climate observation data. We compare our approach to a time series deconfounding method with and without substitute confounders on the synthetic data. We demonstrate that our method performs better by comparing both methods to the ground truth. In the case of real data, we use the expert knowledge of causal links and show how the use of correct proxy variables aids data reconstruction.

摘要:在存在潜在变量的情况下,观察数据中估计因果关系的效果有时会导致虚假关系,这可能被错误地被视为因果关系。这是许多领域的重要问题,例如金融和气候科学。我们提出了序性因果效应变异自动编码器(SCEVAE),这是一种在隐藏混杂下的时间序列因果关系分析的新方法。它基于CEVAE框架和复发性神经网络。通过基于Pearl的Do-Calculus使用直接因果标准来计算因果链接的混杂变量强度。我们通过将其应用于具有线性和非线性因果链接的合成数据集,以显示SCEVAE的功效。此外,我们将方法应用于真实的气溶胶气候观察数据。我们将我们的方法与在合成数据上有或没有替代混杂因素的时间序列变形方法进行比较。我们证明我们的方法通过将两种方法与地面真理进行比较来表现更好。对于真实数据,我们使用因果链接的专家知识,并显示正确的代理变量的使用如何帮助数据重建。

ML-14-标题 Active Few-Shot Classification a New Paradigm for Data-Scarce Learning Settings

链接: https://arxiv.org/abs/2209.11481
作者: Aymane Abdali, Vincent Gripon, Lucas Drumetz, Bartosz Boguslawski
备注:

点击查看摘要

Abstract: We consider a novel formulation of the problem of Active Few-Shot Classification (AFSC) where the objective is to classify a small, initially unlabeled, dataset given a very restrained labeling budget. This problem can be seen as a rival paradigm to classical Transductive Few-Shot Classification (TFSC), as both these approaches are applicable in similar conditions. We first propose a methodology that combines statistical inference, and an original two-tier active learning strategy that fits well into this framework. We then adapt several standard vision benchmarks from the field of TFSC. Our experiments show the potential benefits of AFSC can be substantial, with gains in average weighted accuracy of up to 10% compared to state-of-the-art TFSC methods for the same labeling budget. We believe this new paradigm could lead to new developments and standards in data-scarce learning settings.

摘要:我们考虑了一个新颖的表述,目的是将小规模的,最初的标签,数据集分类,其中有一个非常约束的标签预算。这个问题可以看作是与经典的跨托管少数射击分类(TFSC)的竞争对手范式,因为这两种方法都适用于相似的条件。我们首先提出了一种结合统计推断的方法,以及一种非常适合该框架的原始两级积极学习策略。然后,我们从TFSC领域调整了几个标准视觉基准。我们的实验表明,AFSC的潜在优势可能是很大的,与最先进的TFSC方法相比,对于同一标签预算,平均加权准确性高达10%。我们认为,这种新的范式可能会导致数据筛选学习设置的新发展和标准。

ML-15-标题 Optimizing Class Distribution in Memory for Multi-Label Online Continual Learning

链接: https://arxiv.org/abs/2209.11469
作者: Yan-Shuo Liang, Wu-Jun Li
备注:

点击查看摘要

Abstract: Online continual learning, especially when task identities and task boundaries are unavailable, is a challenging continual learning setting. One representative kind of methods for online continual learning is replay-based methods, in which a replay buffer called memory is maintained to keep a small part of past samples for overcoming catastrophic forgetting. When tackling with online continual learning, most existing replay-based methods focus on single-label problems in which each sample in the data stream has only one label. But multi-label problems may also happen in the online continual learning setting in which each sample may have more than one label. In the online setting with multi-label samples, the class distribution in data stream is typically highly imbalanced, and it is challenging to control class distribution in memory since changing the number of samples belonging to one class may affect the number of samples belonging to other classes. But class distribution in memory is critical for replay-based memory to get good performance, especially when the class distribution in data stream is highly imbalanced. In this paper, we propose a simple but effective method, called optimizing class distribution in memory (OCDM), for multi-label online continual learning. OCDM formulates the memory update mechanism as an optimization problem and updates the memory by solving this problem. Experiments on two widely used multi-label datasets show that OCDM can control the class distribution in memory well and can outperform other state-of-the-art methods.

摘要:在线持续学习,尤其是在任务身份和任务边界不可用时,是一个充满挑战的持续学习设置。一种代表性的在线持续学习方法是基于重播的方法,其中保留称为内存的重播缓冲区,以保留过去样本的一小部分,以克服灾难性的遗忘。当通过在线持续学习来解决时,大多数现有的基于重播的方法都集中在单标签问题上,其中数据流中的每个样本只有一个标签。但是,在在线持续学习环境中,多标签问题也可能发生,在线持续学习环境中,每个样本可能具有多个标签。在使用多标签样本的在线设置中,数据流中的类分布通常是高度不平衡的,并且在内存中控制类别的分配是一项挑战课程。但是,内存中的课程分布对于基于重播的内存至关重要,以获得良好的性能,尤其是当数据流中的类分布高度不平衡时。在本文中,我们提出了一种简单但有效的方法,称为多标签在线持续学习,称为内存中的班级分布(OCDM)。 OCDM将内存更新机制制定为优化问题,并通过解决此问题来更新内存。在两个广泛使用的多标签数据集上的实验表明,OCDM可以很好地控制内存中的类分布,并且可以胜过其他最先进的方法。

ML-16-标题 Smart Active Sampling to enhance Quality Assurance Efficiency

链接: https://arxiv.org/abs/2209.11464
作者: Clemens Heistracher, Stefan Stricker, Pedro Casas, Daniel Schall, Jana Kemnitz
备注:

点击查看摘要

Abstract: We propose a new sampling strategy, called smart active sapling, for quality inspections outside the production line. Based on the principles of active learning a machine learning model decides which samples are sent to quality inspection. On the one hand, this minimizes the production of scrap parts due to earlier detection of quality violations. On the other hand, quality inspection costs are reduced for smooth operation.

摘要:我们提出了一种新的抽样策略,称为Smart Active Sapling,用于生产线之外的质量检查。根据主动学习的原则,机器学习模型决定将哪些样品发送到质量检查。一方面,由于较早发现质量违规行为,这可以最大程度地减少废料零件的产生。另一方面,质量检查成本降低了,以进行平稳运行。

ML-17-标题 A Preliminary Investigation of MLOps Practices in GitHub

链接: https://arxiv.org/abs/2209.11453
作者: Fabio Calefato, Filippo Lanubile, Luigi Quaranta
备注: Presented at ESEM '22, the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement

点击查看摘要

Abstract: Background. The rapid and growing popularity of machine learning (ML) applications has led to an increasing interest in MLOps, that is, the practice of continuous integration and deployment (CI/CD) of ML-enabled systems. Aims. Since changes may affect not only the code but also the ML model parameters and the data themselves, the automation of traditional CI/CD needs to be extended to manage model retraining in production. Method. In this paper, we present an initial investigation of the MLOps practices implemented in a set of ML-enabled systems retrieved from GitHub, focusing on GitHub Actions and CML, two solutions to automate the development workflow. Results. Our preliminary results suggest that the adoption of MLOps workflows in open-source GitHub projects is currently rather limited. Conclusions. Issues are also identified, which can guide future research work.

摘要:背景。机器学习(ML)应用程序的迅速流行已导致对MLOP的兴趣越来越多,即ML启用ML的系统的连续集成和部署(CI/CD)的实践。目标。由于更改不仅可能影响代码,还会影响ML模型参数和数据本身,因此需要扩展传统CI/CD的自动化以管理生产中的模型再培训。方法。在本文中,我们对从GitHub检索的一组启用ML的系统中实施的MLOP实践进行了初步研究,重点是GitHub Action和CML,这是两种解决开发工作流程的解决方案。结果。我们的初步结果表明,在开源GitHub项目中采用MLOPS工作流程目前相当有限。结论。还确定了问题,可以指导未来的研究工作。

ML-18-标题 A Robust and Explainable Data-Driven Anomaly Detection Approach For Power Electronics

链接: https://arxiv.org/abs/2209.11427
作者: Alexander Beattie, Pavol Mulinka, Subham Sahoo, Ioannis T. Christou, Charalampos Kalalas, Daniel Gutierrez-Rojas, Pedro H. J. Nardelli
备注:

点击查看摘要

Abstract: Timely and accurate detection of anomalies in power electronics is becoming increasingly critical for maintaining complex production systems. Robust and explainable strategies help decrease system downtime and preempt or mitigate infrastructure cyberattacks. This work begins by explaining the types of uncertainty present in current datasets and machine learning algorithm outputs. Three techniques for combating these uncertainties are then introduced and analyzed. We further present two anomaly detection and classification approaches, namely the Matrix Profile algorithm and anomaly transformer, which are applied in the context of a power electronic converter dataset. Specifically, the Matrix Profile algorithm is shown to be well suited as a generalizable approach for detecting real-time anomalies in streaming time-series data. The STUMPY python library implementation of the iterative Matrix Profile is used for the creation of the detector. A series of custom filters is created and added to the detector to tune its sensitivity, recall, and detection accuracy. Our numerical results show that, with simple parameter tuning, the detector provides high accuracy and performance in a variety of fault scenarios.

摘要:及时,准确地检测功率电子中的异常,对于维持复杂的生产系统而变得越来越重要。强大而可解释的策略有助于减少系统的停机时间,并抢占或减轻基础设施网络攻击。这项工作从解释当前数据集和机器学习算法输出中存在的不确定性类型开始。然后引入和分析三种打击这些不确定性的技术。我们进一步介绍了两种异常检测和分类方法,即矩阵曲线算法和异常变压器,它们是在电源电子转换器数据集的背景下应用的。具体而言,矩阵配置文件算法被证明非常适合作为检测流时间序列数据中实时异常的概括方法。迭代矩阵配置文件的结构python库实现用于创建检测器。创建了一系列自定义过滤器并将其添加到检测器中,以调整其灵敏度,回忆和检测精度。我们的数值结果表明,通过简单的参数调整,检测器在各种故障场景中提供了高精度和性能。

ML-19-标题 LEADER Learning Attention over Driving Behaviors for Planning under Uncertainty

链接: https://arxiv.org/abs/2209.11422
作者: Mohamad H. Danesh, Panpan Cai, David Hsu
备注: CoRL 2022 (oral)

点击查看摘要

Abstract: Uncertainty on human behaviors poses a significant challenge to autonomous driving in crowded urban environments. The partially observable Markov decision processes (POMDPs) offer a principled framework for planning under uncertainty, often leveraging Monte Carlo sampling to achieve online performance for complex tasks. However, sampling also raises safety concerns by potentially missing critical events. To address this, we propose a new algorithm, LEarning Attention over Driving bEhavioRs (LEADER), that learns to attend to critical human behaviors during planning. LEADER learns a neural network generator to provide attention over human behaviors in real-time situations. It integrates the attention into a belief-space planner, using importance sampling to bias reasoning towards critical events. To train the algorithm, we let the attention generator and the planner form a min-max game. By solving the min-max game, LEADER learns to perform risk-aware planning without human labeling.

摘要:人类行为的不确定性对拥挤的城市环境中的自主驾驶构成了重大挑战。部分可观察到的马尔可夫决策过程(POMDP)为不确定性下的计划提供了一个原则的框架,通常利用蒙特卡洛抽样来实现在线绩效进行复杂的任务。但是,抽样还通过潜在缺失关键事件引起了安全问题。为了解决这个问题,我们提出了一种新的算法,学习对驾驶行为(领导者)的关注,这些算法在计划过程中学习了批判性人类行为。领导者学习了一个神经网络生成器,以实时情况下对人类行为的关注。它将注意力集成到信仰空间计划者中,使用重要性抽样来偏向关键事件。为了训练该算法,我们让注意力生成器和计划者组成了最小游戏。通过解决Min-Max游戏,领导者学会了无需人类标签即可执行风险意识的计划。

ML-20-标题 Relation Embedding based Graph Neural Networks for Handling Heterogeneous Graph

链接: https://arxiv.org/abs/2209.11414
作者: Junfu Wang, Yuanfang Guo, Liang Yang, Yunhong Wang
备注:

点击查看摘要

Abstract: Heterogeneous graph learning has drawn significant attentions in recent years, due to the success of graph neural networks (GNNs) and the broad applications of heterogeneous information networks. Various heterogeneous graph neural networks have been proposed to generalize GNNs for processing the heterogeneous graphs. Unfortunately, these approaches model the heterogeneity via various complicated modules. This paper aims to propose a simple yet efficient framework to make the homogeneous GNNs have adequate ability to handle heterogeneous graphs. Specifically, we propose Relation Embedding based Graph Neural Networks (RE-GNNs), which employ only one parameter per relation to embed the importance of edge type relations and self-loop connections. To optimize these relation embeddings and the other parameters simultaneously, a gradient scaling factor is proposed to constrain the embeddings to converge to suitable values. Besides, we theoretically demonstrate that our RE-GNNs have more expressive power than the meta-path based heterogeneous GNNs. Extensive experiments on the node classification tasks validate the effectiveness of our proposed method.

摘要:由于图神经网络(GNN)的成功和异质信息网络的广泛应用,近年来,异质图学习引起了极大的关注。已经提出了各种异质图神经网络,以概括GNN来处理异质图。不幸的是,这些方法通过各种复杂的模块对异质性进行建模。本文旨在提出一个简单而有效的框架,以使均质GNN具有足够的处理异质图的能力。具体而言,我们提出了基于关系嵌入的图形神经网络(RE-GNNS),该图形仅使用一个参数来嵌入边缘类型关系和自动连接的重要性。为了同时优化这些关系嵌入和其他参数,提出了一个梯度缩放因子来约束嵌入以收敛到合适的值。此外,我们从理论上证明,与基于元路径的异质GNN相比,我们的RE-GNN具有更高的表现力。关于节点分类任务的广泛实验验证了我们提出的方法的有效性。

ML-21-标题 Achieve the Minimum Width of Neural Networks for Universal Approximation

链接: https://arxiv.org/abs/2209.11395
作者: Yongqiang Cai
备注:

点击查看摘要

Abstract: The universal approximation property (UAP) of neural networks is fundamental for deep learning, and it is well known that wide neural networks are universal approximators of continuous functions within both the L^p norm and the continuous/uniform norm. However, the exact minimum width, w_\min , for the UAP has not been studied thoroughly. Recently, using a decoder-memorizer-encoder scheme, \citetPark2021Minimum found that w_\min = \max(d_x+1,d_y) for both the L^p -UAP of ReLU networks and the C -UAP of ReLU+STEP networks, where d_x,d_y are the input and output dimensions, respectively. In this paper, we consider neural networks with an arbitrary set of activation functions. We prove that both C -UAP and L^p -UAP for functions on compact domains share a universal lower bound of the minimal width; that is, w^_\min = \max(d_x,d_y) . In particular, the critical width, w^_\min , for L^p -UAP can be achieved by leaky-ReLU networks, provided that the input or output dimension is larger than one. Our construction is based on the approximation power of neural ordinary differential equations and the ability to approximate flow maps by neural networks. The nonmonotone or discontinuous activation functions case and the one-dimensional case are also discussed.

摘要:神经网络的通用近似特性(UAP)对于深度学习至关重要,众所周知,广泛的神经网络是L^p Norm和连续/统一规范中连续功能的通用近似值。但是,尚未对UAP的确切最小宽度,w_ \ min进行彻底研究。最近,使用解码器模式编码器方案,\ citetPark2021mimine发现w_ \ min = \ max(d_x+1,d_y)对于relu网络的l^p -uap和relu+step网络的c -uap,c。其中d_x,d_y分别是输入和输出尺寸。在本文中,我们考虑具有任意激活功能的神经网络。我们证明,在紧凑型域上的函数的c -uap和l^p -uap共享最小宽度的通用下限。也就是说,w^_ \ min = \ max(d_x,d_y)。特别是,只要输入或输出维度大于一个,就可以通过泄漏的relu网络来实现临界宽度,w^_ \ min,可以通过泄漏的relu网络来实现。我们的构建基于神经普通微分方程的近似能力以及通过神经网络近似流量图的能力。还讨论了非单极管或不连续的激活函数情况和一维情况。

ML-22-标题 Do Current Multi-Task Optimization Methods in Deep Learning Even Help?

链接: https://arxiv.org/abs/2209.11379
作者: Derrick Xin, Behrooz Ghorbani, Ankush Garg, Orhan Firat, Justin Gilmer
备注:

点击查看摘要

Abstract: Recent research has proposed a series of specialized optimization algorithms for deep multi-task models. It is often claimed that these multi-task optimization (MTO) methods yield solutions that are superior to the ones found by simply optimizing a weighted average of the task losses. In this paper, we perform large-scale experiments on a variety of language and vision tasks to examine the empirical validity of these claims. We show that, despite the added design and computational complexity of these algorithms, MTO methods do not yield any performance improvements beyond what is achievable via traditional optimization approaches. We highlight alternative strategies that consistently yield improvements to the performance profile and point out common training pitfalls that might cause suboptimal results. Finally, we outline challenges in reliably evaluating the performance of MTO algorithms and discuss potential solutions.

摘要:最近的研究提出了一系列针对深度任务模型的专业优化算法。通常声称这些多任务优化(MTO)方法产生的解决方案优于仅通过优化任务损失的加权平均值而获得的解决方案。在本文中,我们对各种语言和视觉任务进行大规模实验,以检查这些主张的经验有效性。我们表明,尽管这些算法的设计和计算复杂性增加了,但MTO方法并未产生超出传统优化方法可实现的性能的任何改进。我们强调了替代策略,这些策略始终如一地提高性能概况,并指出可能导致次优效果的常见训练陷阱。最后,我们概述了可靠地评估MTO算法的性能并讨论潜在解决方案的挑战。

ML-23-标题 A Jensen-Shannon Divergence Based Loss Function for Bayesian Neural Networks

链接: https://arxiv.org/abs/2209.11366
作者: Ponkrshnan Thiagarajan, Susanta Ghosh
备注: To be submitted for peer review in IEEE

点击查看摘要

Abstract: Kullback-Leibler (KL) divergence is widely used for variational inference of Bayesian Neural Networks (BNNs). However, the KL divergence has limitations such as unboundedness and asymmetry. We examine the Jensen-Shannon (JS) divergence that is more general, bounded, and symmetric. We formulate a novel loss function for BNNs based on the geometric JS divergence and show that the conventional KL divergence-based loss function is its special case. We evaluate the divergence part of the proposed loss function in a closed form for a Gaussian prior. For any other general prior, Monte Carlo approximations can be used. We provide algorithms for implementing both of these cases. We demonstrate that the proposed loss function offers an additional parameter that can be tuned to control the degree of regularisation. We derive the conditions under which the proposed loss function regularises better than the KL divergence-based loss function for Gaussian priors and posteriors. We demonstrate performance improvements over the state-of-the-art KL divergence-based BNN on the classification of a noisy CIFAR data set and a biased histopathology data set.

摘要:Kullback-Leibler(KL)差异广泛用于贝叶斯神经网络(BNNS)的变异推断。然而,KL差异具有无限性和不对称性等局限性。我们检查了更通用,有限和对称的詹森 - 香农(JS)差异。我们根据几何JS差异为BNN制定新的损失函数,并表明基于KL差异的常规损失函数是其特殊情况。我们以封闭形式的高斯先验评估拟议损失函数的差异部分。对于任何其他一般的先验,都可以使用蒙特卡洛近似值。我们提供了实施这两种情况的算法。我们证明所提出的损失函数提供了一个可以调整的附加参数,以控制正则化程度。我们得出了所提出的损失函数在高斯先验和后代的基于KL差异的损失函数更好的条件。我们证明了基于嘈杂的CIFAR数据集和有偏见的组织病理学数据集的最新基于KL差异的BNN的性能提高。

ML-24-标题 Convolutional Learning on Multigraphs

链接: https://arxiv.org/abs/2209.11354
作者: Landon Butler, Alejandro Parada-Mayorga, Alejandro Ribeiro
备注:

点击查看摘要

Abstract: Graph convolutional learning has led to many exciting discoveries in diverse areas. However, in some applications, traditional graphs are insufficient to capture the structure and intricacies of the data. In such scenarios, multigraphs arise naturally as discrete structures in which complex dynamics can be embedded. In this paper, we develop convolutional information processing on multigraphs and introduce convolutional multigraph neural networks (MGNNs). To capture the complex dynamics of information diffusion within and across each of the multigraph’s classes of edges, we formalize a convolutional signal processing model, defining the notions of signals, filtering, and frequency representations on multigraphs. Leveraging this model, we develop a multigraph learning architecture, including a sampling procedure to reduce computational complexity. The introduced architecture is applied towards optimal wireless resource allocation and a hate speech localization task, offering improved performance over traditional graph neural networks.

摘要:图形卷积学习导致了不同领域的许多令人兴奋的发现。但是,在某些应用中,传统图不足以捕获数据的结构和复杂性。在这种情况下,多编码自然出现是可以嵌入复杂动力学的离散结构。在本文中,我们开发了有关多编码的卷积信息处理,并引入了卷积多编码神经网络(MGNN)。为了捕获每个多数边缘内外的信息传播的复杂动力学,我们正式化了一个卷积信号处理模型,从而定义了多格画上信号,过滤和频率表示的概念。利用该模型,我们开发了多个学习架构,包括采样程序以降低计算复杂性。引入的体系结构用于最佳无线资源分配和仇恨言语本地化任务,从而比传统的图形神经网络的性能提高了。

ML-25-标题 StyleTime Style Transfer for Synthetic Time Series Generation

链接: https://arxiv.org/abs/2209.11306
作者: Yousef El-Laham, Svitlana Vyetrenko
备注:

点击查看摘要

Abstract: Neural style transfer is a powerful computer vision technique that can incorporate the artistic “style” of one image to the “content” of another. The underlying theory behind the approach relies on the assumption that the style of an image is represented by the Gram matrix of its features, which is typically extracted from pre-trained convolutional neural networks (e.g., VGG-19). This idea does not straightforwardly extend to time series stylization since notions of style for two-dimensional images are not analogous to notions of style for one-dimensional time series. In this work, a novel formulation of time series style transfer is proposed for the purpose of synthetic data generation and enhancement. We introduce the concept of stylized features for time series, which is directly related to the time series realism properties, and propose a novel stylization algorithm, called StyleTime, that uses explicit feature extraction techniques to combine the underlying content (trend) of one time series with the style (distributional properties) of another. Further, we discuss evaluation metrics, and compare our work to existing state-of-the-art time series generation and augmentation schemes. To validate the effectiveness of our methods, we use stylized synthetic data as a means for data augmentation to improve the performance of recurrent neural network models on several forecasting tasks.

摘要:神经风格转移是一种强大的计算机视觉技术,可以将一个图像的艺术“样式”纳入另一个图像的“内容”。该方法背后的基本理论取决于以下假设:图像的样式由其特征的革兰氏矩阵表示,该矩阵通常是从预先训练的卷积神经网络(例如VGG-19)中提取的。这个想法并不能直接扩展到时间序列风格化,因为二维图像的样式概念与一维时间序列的样式概念不类似。在这项工作中,提出了一种新颖的时间序列样式转移的表述,以实现合成数据的生成和增强。我们介绍了时间序列的程式化功能的概念,该功能与时间序列现实主义属性直接相关,并提出了一种新型的风格化算法,称为STYLETIME,该算法使用明确的功能提取技术来结合一个时间序列的基础内容(趋势)带有另一个样式(分销属性)。此外,我们讨论了评估指标,并将我们的工作与现有的最新时间序列生成和增强方案进行比较。为了验证我们的方法的有效性,我们使用风格化的合成数据作为数据增强的手段,以提高几个预测任务上经常性神经网络模型的性能。

ML-26-标题 An Investigation of the Bias-Variance Tradeoff in Meta-Gradients

链接: https://arxiv.org/abs/2209.11303
作者: Risto Vuorio, Jacob Beck, Shimon Whiteson, Jakob Foerster, Gregory Farquhar
备注:

点击查看摘要

Abstract: Meta-gradients provide a general approach for optimizing the meta-parameters of reinforcement learning (RL) algorithms. Estimation of meta-gradients is central to the performance of these meta-algorithms, and has been studied in the setting of MAML-style short-horizon meta-RL problems. In this context, prior work has investigated the estimation of the Hessian of the RL objective, as well as tackling the problem of credit assignment to pre-adaptation behavior by making a sampling correction. However, we show that Hessian estimation, implemented for example by DiCE and its variants, always adds bias and can also add variance to meta-gradient estimation. Meanwhile, meta-gradient estimation has been studied less in the important long-horizon setting, where backpropagation through the full inner optimization trajectories is not feasible. We study the bias and variance tradeoff arising from truncated backpropagation and sampling correction, and additionally compare to evolution strategies, which is a recently popular alternative strategy to long-horizon meta-learning. While prior work implicitly chooses points in this bias-variance space, we disentangle the sources of bias and variance and present an empirical study that relates existing estimators to each other.

摘要:元梯度提供了一种一般方法,以优化增强学习的元参数(RL)算法。元梯度的估计对于这些元算法的性能至关重要,并且已经在MAML式短距离元元RL问题的情况下进行了研究。在这种情况下,先前的工作调查了对RL目标的Hessian的估计,并通过进行抽样校正来解决信贷分配问题,以解决预先适应行为。但是,我们表明,例如由DICE及其变体实施的Hessian估计始终会增加偏差,还可以为元梯度估计增加差异。同时,在重要的长马设置中,元梯度估计的研究较少,在这种情况下,通过完整的内部优化轨迹的反向传播是不可行的。我们研究了截短的反向传播和采样校正引起的偏见和差异权衡,并与进化策略进行了比较,这是最近流行的长期替代策略。虽然先前的工作隐含地选择了这个偏见变化空间中的点,但我们解散了偏见和差异的来源,并提出了将现有估计器相互关联的经验研究。

ML-27-标题 Scalable Gaussian Process Hyperparameter Optimization via Coverage Regularization

链接: https://arxiv.org/abs/2209.11280
作者: Killian Wood, Alec M. Dunton, Amanda Muyskens, Benjamin W. Priest
备注: 4 pages content, 3 figures, 6 tables

点击查看摘要

Abstract: Gaussian processes (GPs) are Bayesian non-parametric models popular in a variety of applications due to their accuracy and native uncertainty quantification (UQ). Tuning GP hyperparameters is critical to ensure the validity of prediction accuracy and uncertainty; uniquely estimating multiple hyperparameters in, e.g. the Matern kernel can also be a significant challenge. Moreover, training GPs on large-scale datasets is a highly active area of research: traditional maximum likelihood hyperparameter training requires quadratic memory to form the covariance matrix and has cubic training complexity. To address the scalable hyperparameter tuning problem, we present a novel algorithm which estimates the smoothness and length-scale parameters in the Matern kernel in order to improve robustness of the resulting prediction uncertainties. Using novel loss functions similar to those in conformal prediction algorithms in the computational framework provided by the hyperparameter estimation algorithm MuyGPs, we achieve improved UQ over leave-one-out likelihood maximization while maintaining a high degree of scalability as demonstrated in numerical experiments.

摘要:高斯工艺(GPS)是贝叶斯非参数模型,由于其准确性和天然不确定性定量(UQ),因此在各种应用中流行。调整GP超参数对于确保预测准确性和不确定性的有效性至关重要。独特地估计多个超参数,例如Matern内核也可能是一个重大挑战。此外,大规模数据集中的培训GPS是一个高度活跃的研究领域:传统的最大似然超参数训练需要二次记忆以形成协方差矩阵并具有立方训练的复杂性。为了解决可扩展的超参数调整问题,我们提出了一种新型算法,该算法估算了Matern内核中的平滑度和长度尺度参数,以提高所得预测不确定性的鲁棒性。使用与超参数估计算法MUYGPS提供的计算框架中的合并预测算法相似的新型损失函数,我们在数值实验中证明了高度可伸缩性,同时保持了高度可伸缩性。

ML-28-标题 Environment Optimization for Multi-Agent Navigation

链接: https://arxiv.org/abs/2209.11279
作者: Zhan Gao, Amanda Prorok
备注:

点击查看摘要

Abstract: Traditional approaches to the design of multi-agent navigation algorithms consider the environment as a fixed constraint, despite the obvious influence of spatial constraints on agents’ performance. Yet hand-designing improved environment layouts and structures is inefficient and potentially expensive. The goal of this paper is to consider the environment as a decision variable in a system-level optimization problem, where both agent performance and environment cost can be accounted for. We begin by proposing a novel environment optimization problem. We show, through formal proofs, under which conditions the environment can change while guaranteeing completeness (i.e., all agents reach their navigation goals). Our solution leverages a model-free reinforcement learning approach. In order to accommodate a broad range of implementation scenarios, we include both online and offline optimization, and both discrete and continuous environment representations. Numerical results corroborate our theoretical findings and validate our approach.

摘要:尽管空间限制对代理的性能产生了明显的影响,但多代理导航算法设计的传统方法将环境视为固定的限制。然而,手动设计改进的环境布局和结构效率低下且可能昂贵。本文的目的是将环境视为系统级优化问题中的决策变量,在该问题中,代理性能和环境成本都可以考虑到。我们首先提出一个新颖的环境优化问题。我们通过正式证明在哪些条件下显示环境可以改变的同时保证完整性(即所有代理达到其导航目标)。我们的解决方案利用了一种无模型的增强学习方法。为了适应广泛的实施方案,我们包括在线和离线优化,以及离散和连续的环境表示。数值结果证实了我们的理论发现并验证了我们的方法。

ML-29-标题 Minimizing Human Assistance Augmenting a Single Demonstration for Deep Reinforcement Learning

链接: https://arxiv.org/abs/2209.11275
作者: Abraham George, Alison Bartsch, Amir Barati Farimani
备注: 7 pages, 11 figures

点击查看摘要

Abstract: The use of human demonstrations in reinforcement learning has proven to significantly improve agent performance. However, any requirement for a human to manually ‘teach’ the model is somewhat antithetical to the goals of reinforcement learning. This paper attempts to minimize human involvement in the learning process while still retaining the performance advantages by using a single human example collected through a simple-to-use virtual reality simulation to assist with RL training. Our method augments a single demonstration to generate numerous human-like demonstrations that, when combined with Deep Deterministic Policy Gradients and Hindsight Experience Replay (DDPG + HER), significantly improve training time on simple tasks and allows the agent to solve a complex task (block stacking) that DDPG + HER alone cannot solve. The model achieves this significant training advantage using a single human example, requiring less than a minute of human input.

摘要:事实证明,在加强学习中使用人类示范可以显着改善代理的性能。但是,任何要求人手动“教”该模型的要求与强化学习的目标有些相反。本文试图通过使用通过简单使用的虚拟现实模拟收集的单个人类示例来帮助进行RL培训,以最大程度地减少人类参与学习过程的参与,同时仍保留了绩效优势。我们的方法增加了一次演示,以产生许多类似人类的演示,与深层确定性的政策梯度和事后的经验重播(DDPG + HER)相结合时,可以显着改善对简单任务的训练时间,并允许代理商解决复杂的任务(Block Block堆叠)DDPG +她一个人无法解决。该模型使用单个人类示例实现了这一重要的训练优势,需要少于一分钟的人类输入。

ML-30-标题 Artificial Intelligence in Material Engineering A review on applications of AI in Material Engineering

链接: https://arxiv.org/abs/2209.11234
作者: Lipichanda Goswami, Manoj Deka, Mohendra Roy
备注: V1

点击查看摘要

Abstract: Recently, there has been extensive use of artificial Intelligence (AI) in the field of material engineering. This can be attributed to the development of high performance computing and thereby feasibility to test deep learning models with large parameters. In this article we tried to review some of the latest developments in the applications of AI in material engineering.

摘要:最近,在材料工程领域广泛使用了人工智能(AI)。这可以归因于高性能计算的开发,从而可行性地测试具有大参数的深度学习模型。在本文中,我们试图回顾AI在材料工程中应用中的一些最新发展。

ML-31-标题 Multidimensional Interactive Fixed-Effects

链接: https://arxiv.org/abs/2209.11691
作者: Hugo Freeman
备注:

点击查看摘要

Abstract: This paper studies a linear and additively separable model for multidimensional panel data of three or more dimensions with unobserved interactive fixed effects. Two approaches are considered to account for these unobserved interactive fixed-effects when estimating coefficients on the observed covariates. First, the model is embedded within the standard two-dimensional panel framework and restrictions are derived under which the factor structure methods in Bai (2009) lead to consistent estimation of model parameters. The second approach considers group fixed-effects and kernel methods that are more robust to the multidimensional nature of the problem. Theoretical results and simulations show the benefit of standard two-dimensional panel methods when the structure of the interactive fixed-effect term is known, but also highlight how the group fixed-effects and kernel methods perform well without knowledge of this structure. The methods are implemented to estimate the demand elasticity for beer under a handful of models for demand.

摘要:本文研究了三个或多个维度的多维面板数据的线性和可分离模型,具有未观察到的连接固定效果。当在观察到的协变量上估计系数时,两种方法被认为是这些未观察到的交互式固定效应。首先,该模型嵌入了标准二维面板框架中,并且在Bai(2009)中的因子结构方法导致模型参数的一致估计中得出了限制。第二种方法考虑了组固定效应和内核方法,这些方法对问题的多维性质更强大。理论结果和仿真显示了当已知交互式固定效应项的结构时,标准二维方法的好处,但也突出显示了组固定效应和内核方法在不了解这种结构的情况下如何表现良好。实施了这些方法来估计少数型号的需求模型下的啤酒需求弹性。

ML-32-标题 Exact conservation laws for neural network integrators of dynamical systems

链接: https://arxiv.org/abs/2209.11661
作者: Eike Hermann Müller
备注: 21 pages, 16 figures; submitted to Journal of Computational Physics

点击查看摘要

Abstract: The solution of time dependent differential equations with neural networks has attracted a lot of attention recently. The central idea is to learn the laws that govern the evolution of the solution from data, which might be polluted with random noise. However, in contrast to other machine learning applications, usually a lot is known about the system at hand. For example, for many dynamical systems physical quantities such as energy or (angular) momentum are exactly conserved. Hence, the neural network has to learn these conservation laws from data and they will only be satisfied approximately due to finite training time and random noise. In this paper we present an alternative approach which uses Noether’s Theorem to inherently incorporate conservation laws into the architecture of the neural network. We demonstrate that this leads to better predictions for three model systems: the motion of a non-relativistic particle in a three-dimensional Newtonian gravitational potential, the motion of a massive relativistic particle in the Schwarzschild metric and a system of two interacting particles in four dimensions.

摘要:与神经网络相关的差分方程的解决方案最近引起了很多关注。核心思想是学习控制解决方案从数据演变的法律,该数据可能会被随机噪声污染。但是,与其他机器学习应用相比,通常对手头的系统了解很多。例如,对于许多动态系统,诸如能量或(角度)动量之类的物理量是完全保守的。因此,神经网络必须从数据中学习这些保护定律,并且仅由于有限的训练时间和随机噪声而被满足。在本文中,我们提出了一种替代方法,该方法使用Noether的定理将保护定律本质地纳入神经网络的体系结构。我们证明,这可以更好地预测三个模型系统:在三维牛顿引力潜能中非偏见粒子的运动,Schwarzschild指标中庞大的相对论粒子的运动和两个相互作用的粒子在四个相互作用的粒子系统中的运动方面。

ML-33-标题 Differentiable physics-enabled closure modeling for Burgers turbulence

链接: https://arxiv.org/abs/2209.11614
作者: Varun Shankar, Vedant Puri, Ramesh Balakrishnan, Romit Maulik, Venkatasubramanian Viswanathan
备注:

点击查看摘要

Abstract: Data-driven turbulence modeling is experiencing a surge in interest following algorithmic and hardware developments in the data sciences. We discuss an approach using the differentiable physics paradigm that combines known physics with machine learning to develop closure models for Burgers’ turbulence. We consider the 1D Burgers system as a prototypical test problem for modeling the unresolved terms in advection-dominated turbulence problems. We train a series of models that incorporate varying degrees of physical assumptions on an a posteriori loss function to test the efficacy of models across a range of system parameters, including viscosity, time, and grid resolution. We find that constraining models with inductive biases in the form of partial differential equations that contain known physics or existing closure approaches produces highly data-efficient, accurate, and generalizable models, outperforming state-of-the-art baselines. Addition of structure in the form of physics information also brings a level of interpretability to the models, potentially offering a stepping stone to the future of closure modeling.

摘要:数据科学中的算法和硬件开发后,数据驱动的湍流建模正在引起人们的兴趣。我们讨论了一种使用可区分物理范式的方法,该方法将已知的物理学与机器学习结合起来,以开发汉堡湍流的闭合模型。我们将1D汉堡系统视为一种原型测试问题,用于建模以对流为主的湍流问题中未解决的术语。我们训练一系列模型,这些模型在后验损失函数上结合了不同程度的物理假设,以测试模型在一系列系统参数(包括粘度,时间和网格分辨率)上的疗效。我们发现,以部分微分方程形式的归纳偏差的约束模型包含已知物理或现有闭合方法会产生高度数据效率,准确和可推广的模型,并且表现优于最先进的基准。以物理信息形式添加结构还为模型带来了一定程度的解释性,可能为封闭建模的未来提供了垫脚石。

ML-34-标题 Power Management in Smart Residential Building with Deep Learning Model for Occupancy Detection by Usage Pattern of Electric Appliances

链接: https://arxiv.org/abs/2209.11520
作者: Sangkeum Lee, Sarvar Hussain Nengroo, Hojun Jin, Yoonmee Doh, Chungho Lee, Taewook Heo, Dongsoo Har
备注: 11 pages, 7 figures, to be submitted to 7th International Conference on Renewable Energy and Conservation, ICREC 2022

点击查看摘要

Abstract: With the growth of smart building applications, occupancy information in residential buildings is becoming more and more significant. In the context of the smart buildings’ paradigm, this kind of information is required for a wide range of purposes, including enhancing energy efficiency and occupant comfort. In this study, occupancy detection in residential building is implemented using deep learning based on technical information of electric appliances. To this end, a novel approach of occupancy detection for smart residential building system is proposed. The dataset of electric appliances, sensors, light, and HVAC, which is measured by smart metering system and is collected from 50 households, is used for simulations. To classify the occupancy among datasets, the support vector machine and autoencoder algorithm are used. Confusion matrix is utilized for accuracy, precision, recall, and F1 to demonstrate the comparative performance of the proposed method in occupancy detection. The proposed algorithm achieves occupancy detection using technical information of electric appliances by 95.7~98.4%. To validate occupancy detection data, principal component analysis and the t-distributed stochastic neighbor embedding (t-SNE) algorithm are employed. Power consumption with renewable energy system is reduced to 11.1~13.1% in smart buildings by using occupancy detection.

摘要:随着智能建筑应用的增长,住宅建筑中的占用信息变得越来越重要。在智能建筑物的范式的背景下,为了广泛的目的,需要这种信息,包括提高能源效率和乘员舒适性。在这项研究中,使用基于电器技术信息的深度学习实施了住宅建筑中的占用检测。为此,提出了一种新型的智能住宅建筑系统占用方法。通过智能计量系统测量的电器,传感器,光和HVAC的数据集用于模拟。为了对数据集进行分类,使用了支持向量机和自动编码器算法。混淆矩阵用于准确性,精度,召回和F1,以证明所提出的方法在占用检测中的比较性能。拟议的算法使用电器的技术信息达到95.7〜98.4%。为了验证占用检测数据,采用主成分分析和T分布的随机邻居嵌入(T-SNE)算法。通过使用占用检测,智能建筑物中可再生能源系统的功耗降低到11.1〜13.1%。

ML-35-标题 Error Mitigation-Aided Optimization of Parameterized Quantum Circuits Convergence Analysis

链接: https://arxiv.org/abs/2209.11514
作者: Sharu Theresa Jose, Osvaldo Simeone
备注: Submitted for journal publication

点击查看摘要

Abstract: Variational quantum algorithms (VQAs) offer the most promising path to obtaining quantum advantages via noisy intermediate-scale quantum (NISQ) processors. Such systems leverage classical optimization to tune the parameters of a parameterized quantum circuit (PQC). The goal is minimizing a cost function that depends on measurement outputs obtained from the PQC. Optimization is typically implemented via stochastic gradient descent (SGD). On NISQ computers, gate noise due to imperfections and decoherence affects the stochastic gradient estimates by introducing a bias. Quantum error mitigation (QEM) techniques can reduce the estimation bias without requiring any increase in the number of qubits, but they in turn cause an increase in the variance of the gradient estimates. This work studies the impact of quantum gate noise on the convergence of SGD for the variational eigensolver (VQE), a fundamental instance of VQAs. The main goal is ascertaining conditions under which QEM can enhance the performance of SGD for VQEs. It is shown that quantum gate noise induces a non-zero error-floor on the convergence error of SGD (evaluated with respect to a reference noiseless PQC), which depends on the number of noisy gates, the strength of the noise, as well as the eigenspectrum of the observable being measured and minimized. In contrast, with QEM, any arbitrarily small error can be obtained. Furthermore, for error levels attainable with or without QEM, QEM can reduce the number of required iterations, but only as long as the quantum noise level is sufficiently small, and a sufficiently large number of measurements is allowed at each SGD iteration. Numerical examples for a max-cut problem corroborate the main theoretical findings.

摘要:变异量子算法(VQAS)提供了通过嘈杂的中间尺度量子(NISQ)处理器获得量子优势的最有希望的途径。这样的系统利用经典优化来调整参数化量子电路(PQC)的参数。目标是最大程度地减少取决于从PQC获得的测量输出的成本函数。通常通过随机梯度下降(SGD)实现优化。在NISQ计算机上,由于缺陷和破坏性而引起的栅极噪声通过引入偏差会影响随机梯度的估计。量子误差缓解(QEM)技术可以减少估计偏差而无需量子数量增加,但它们又导致梯度估计的方差增加。这项工作研究了量子门噪声对SGD收敛的影响,而VQA的基本实例是变异的eigensolver(VQE)。主要目标是确定QEM可以增强VQE的SGD性能的条件。结果表明,量子门噪声在SGD的收敛误差(根据参考无噪声PQC评估)诱导非零误差 - 基础,这取决于噪声门的数量,噪声的强度以及可观察到的可观察到的特征性被测量和最小化。相反,使用QEM,可以获得任何任意小的误差。此外,对于有或没有QEM的误差级别,QEM可以减少所需的迭代次数,但是只要量子噪声水平足够小,并且在每种SGD迭代中允许足够大的测量值。最大切割问题的数值示例证实了主要理论发现。

ML-36-标题 Image Classification using Sequence of Pixels

链接: https://arxiv.org/abs/2209.11495
作者: Gajraj Kuldeep
备注:

点击查看摘要

Abstract: This study compares sequential image classification methods based on recurrent neural networks. We describe methods based on recurrent neural networks such as Long-Short-Term memory(LSTM), bidirectional Long-Short-Term memory(BiLSTM) architectures, etc. We also review the state-of-the-art sequential image classification architectures. We mainly focus on LSTM, BiLSTM, temporal convolution network, and independent recurrent neural network architecture in the study. It is known that RNN lacks in learning long-term dependencies in the input sequence. We use a simple feature construction method using orthogonal Ramanujan periodic transform on the input sequence. Experiments demonstrate that if these features are given to LSTM or BiLSTM networks, the performance increases drastically. Our focus in this study is to increase the training accuracy simultaneously reducing the training time for the LSTM and BiLSTM architecture, but not on pushing the state-of-the-art results, so we use simple LSTM/BiLSTM architecture. We compare sequential input with the constructed feature as input to single layer LSTM and BiLSTM network for MNIST and CIFAR datasets. We observe that sequential input to the LSTM network with 128 hidden unit training for five epochs results in training accuracy of 33% whereas constructed features as input to the same LSTM network results in training accuracy of 90% with 1/3 lesser time.

摘要:本研究比较基于复发神经网络的顺序图像分类方法。我们描述了基于复发性神经网络的方法,例如长短记忆​​(LSTM),双向长短记忆(BILSTM)体系结构等。我们还回顾了最新的顺序图像分类体系结构。我们主要关注研究中的LSTM,Bilstm,时间卷积网络和独立的复发性神经网络体系结构。众所周知,RNN缺乏学习输入序列中的长期依赖性。我们在输入序列上使用正交Ramanujan周期转换使用简单的特征构造方法。实验表明,如果将这些功能赋予LSTM或BilstM网络,则性能会大大提高。我们在这项研究上的重点是同时提高训练精度,以减少LSTM和BilstM体系结构的训练时间,但不在推动最先进的结果上,因此我们使用简单的LSTM/BILSTM架构。我们将顺序输入与构造功能作为MNIST和CIFAR数据集的单层LSTM和BILSTM网络的输入进行比较。我们观察到对LSTM网络进行的顺序输入,对五个时期进行了128个隐藏的单位训练,导致训练精度为33%,而构造的功能作为相同LSTM网络的输入,导致训练精度为90%,时间较小1/3。

ML-37-标题 Computational Discovery of Energy-Efficient Heat Treatment for Microstructure Design using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2209.11259
作者: Jaber R. Mianroodi, Nima H. Siboni, Dierk Raabe
备注:

点击查看摘要

Abstract: Deep Reinforcement Learning (DRL) is employed to develop autonomously optimized and custom-designed heat-treatment processes that are both, microstructure-sensitive and energy efficient. Different from conventional supervised machine learning, DRL does not rely on static neural network training from data alone, but a learning agent autonomously develops optimal solutions, based on reward and penalty elements, with reduced or no supervision. In our approach, a temperature-dependent Allen-Cahn model for phase transformation is used as the environment for the DRL agent, serving as the model world in which it gains experience and takes autonomous decisions. The agent of the DRL algorithm is controlling the temperature of the system, as a model furnace for heat-treatment of alloys. Microstructure goals are defined for the agent based on the desired microstructure of the phases. After training, the agent can generate temperature-time profiles for a variety of initial microstructure states to reach the final desired microstructure state. The agent’s performance and the physical meaning of the heat-treatment profiles generated are investigated in detail. In particular, the agent is capable of controlling the temperature to reach the desired microstructure starting from a variety of initial conditions. This capability of the agent in handling a variety of conditions paves the way for using such an approach also for recycling-oriented heat treatment process design where the initial composition can vary from batch to batch, due to impurity intrusion, and also for the design of energy-efficient heat treatments. For testing this hypothesis, an agent without penalty on the total consumed energy is compared with one that considers energy costs. The energy cost penalty is imposed as an additional criterion on the agent for finding the optimal temperature-time profile.

摘要:深入加固学习(DRL)用于开发自主优化和定制设计的热处理过程,这些过程既对微观结构敏感又节能。与常规监督的机器学习不同,DRL不仅依赖于数据中的静态神经网络培训,但是学习代理人会根据奖励和惩罚元素自主开发最佳解决方案,并减少或没有监督。在我们的方法中,依赖温度的艾伦 - 卡恩模型用于相转换,用作DRL代理的环境,是其获得经验并采取自主决策的模型世界。 DRL算法的试剂正在控制系统的温度,作为用于合金热处理的模型炉。根据所需的相位微观结构为代理定义了微观结构目标。训练后,代理可以为各种初始微观结构状态生成温度时间曲线,以达到最终所需的微观结构状态。详细研究了代理商的性能和热处理概况的物理含义。特别是,该试剂能够控制温度以从各种初始条件开始达到所需的微观结构。代理在处理各种条件方面的这种能力为使用这种方法铺平了道路,也用于回收的导向热处理过程设计,由于杂质的侵入,初始组合物可能因批量而异,以及用于设计节能热处理。为了检验这一假设,将无罚款的代理人与考虑能源成本的代理人进行了比较。对能源成本的罚款是针对找到最佳温度时间剖面的代理的附加标准。

ML-38-标题 Assessing Robustness of EEG Representations under Data-shifts via Latent Space and Uncertainty Analysis

链接: https://arxiv.org/abs/2209.11233
作者: Neeraj Wagh, Jionghao Wei, Samarth Rawal, Brent M. Berry, Yogatheesan Varatharajah
备注: Preprint under review

点击查看摘要

Abstract: The recent availability of large datasets in bio-medicine has inspired the development of representation learning methods for multiple healthcare applications. Despite advances in predictive performance, the clinical utility of such methods is limited when exposed to real-world data. Here we develop model diagnostic measures to detect potential pitfalls during deployment without assuming access to external data. Specifically, we focus on modeling realistic data shifts in electrophysiological signals (EEGs) via data transforms, and extend the conventional task-based evaluations with analyses of a) model’s latent space and b) predictive uncertainty, under these transforms. We conduct experiments on multiple EEG feature encoders and two clinically relevant downstream tasks using publicly available large-scale clinical EEGs. Within this experimental setting, our results suggest that measures of latent space integrity and model uncertainty under the proposed data shifts may help anticipate performance degradation during deployment.

摘要:Bio-Medicine中大型数据集的最新可用性启发了用于多种医疗保健应用的表示方法的开发。尽管预测性能取得了进步,但这种方法的临床实用性在暴露于现实世界数据时受到限制。在这里,我们开发模型诊断措施,以检测部署过程中潜在的陷阱,而无需访问外部数据。具体而言,我们专注于通过数据转换建模电生理信号(EEG)的现实数据转移,并通过分析a)模型的潜在空间和b)预测性不确定性在这些变换下扩展了常规的基于任务的评估。我们使用公开可用的大规模临床EEG进行了多个EEG功能编码器和两个临床相关的下游任务进行实验。在这种实验环境中,我们的结果表明,在提出的数据转移下,潜在空间完整性和模型不确定性的度量可能有助于预测部署过程中的性能退化。

ML-39-标题 DFX A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

链接: https://arxiv.org/abs/2209.10797
作者: Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, Joo-Young Kim
备注: Extension of HOTCHIPS 2022 and accepted in MICRO 2022

点击查看摘要

Abstract: Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.

摘要:Transformer是一种深入学习语言模型,广泛用于数据中心中的自然语言处理(NLP)服务。在变压器模型中,生成的预训练的变压器(GPT)在文本生成或自然语言生成(NLG)中取得了显着的性能,它需要在摘要阶段处理大型输入上下文,然后是产生一个生成阶段的一次单词。常规平台(例如GPU)专门用于在摘要阶段平行处理大型输入,但是由于其顺序特征,它们的性能在生成阶段显着降低。因此,需要一个有效的硬件平台来解决由文本生成的顺序特征引起的高潜伏期。在本文中,我们提出了DFX,这是一种多FPGA加速器,该设备在摘要和发电阶段中执行GPT-2模型端到端,并具有低延迟和高吞吐量。 DFX使用模型并行性和优化的数据流,这是模型和硬件感知的设备之间快速同时执行执行。其计算核心根据自定义说明运行,并提供GPT-2操作端到端。我们在四个Xilinx Alveo U280 FPGAS上实现了建议的硬件体系结构,并利用了高带宽内存(HBM)的所有频道,以及用于高硬件效率的最大计算资源数量。 DFX在现代GPT-2模型上实现了四个NVIDIA V100 GPU的5.58倍加速度和3.99倍的能效。 DFX的成本效益比GPU设备更具成本效益,这表明它是云数据中心中文本生成工作负载的有前途解决方案。

计算机视觉

CV-0-标题 Lightweight Transformers for Human Activity Recognition on Mobile Devices

链接: https://arxiv.org/abs/2209.11750
作者: Sannara EK, François Portet, Philippe Lalanda
备注:

点击查看摘要

Abstract: Human Activity Recognition (HAR) on mobile devices has shown to be achievable with lightweight neural models learned from data generated by the user’s inertial measurement units (IMUs). Most approaches for instanced-based HAR have used Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTMs), or a combination of the two to achieve state-of-the-art results with real-time performances. Recently, the Transformers architecture in the language processing domain and then in the vision domain has pushed further the state-of-the-art over classical architectures. However, such Transformers architecture is heavyweight in computing resources, which is not well suited for embedded applications of HAR that can be found in the pervasive computing domain. In this study, we present Human Activity Recognition Transformer (HART), a lightweight, sensor-wise transformer architecture that has been specifically adapted to the domain of the IMUs embedded on mobile devices. Our experiments on HAR tasks with several publicly available datasets show that HART uses fewer FLoating-point Operations Per Second (FLOPS) and parameters while outperforming current state-of-the-art results. Furthermore, we present evaluations across various architectures on their performances in heterogeneous environments and show that our models can better generalize on different sensing devices or on-body positions.

摘要:移动设备上的人类活动识别(HAR)已证明可以通过从用户的惯性测量单元(IMU)生成的数据中学到的轻量级神经模型来实现。基于Instanced HAR的大多数方法都使用卷积神经网络(CNN),长期记忆(LSTMS)或两者组合以实现实时性能来实现最新结果。最近,在语言处理域中,然后在视觉域中的变形金刚体系结构进一步推动了对古典体系结构的最先进。但是,这种变形金刚在计算资源中是重量级的,它不适合在Pervasive Computing域中找到HAR的嵌入式应用程序。在这项研究中,我们提出了人类活动识别变压器(HART),这是一种轻巧的,传感器的变压器结构,已专门适用于嵌入移动设备上的IMU的域。我们对HAR任务的实验具有几个公开可用的数据集,表明HART使用较少的每秒浮点操作(FLOPS)和参数,同时超过了当前的最新结果。此外,我们在各种体系结构中对它们在异质环境中的性能进行了评估,并表明我们的模型可以更好地推广到不同的感应设备或体内位置。

CV-1-标题 Adaptive-SpikeNet Event-based Optical Flow Estimation using Spiking Neural Networks with Learnable Neuronal Dynamics

链接: https://arxiv.org/abs/2209.11741
作者: Adarsh Kumar Kosta, Kaushik Roy
备注:

点击查看摘要

Abstract: Event-based cameras have recently shown great potential for high-speed motion estimation owing to their ability to capture temporally rich information asynchronously. Spiking Neural Networks (SNNs), with their neuro-inspired event-driven processing can efficiently handle such asynchronous data, while neuron models such as the leaky-integrate and fire (LIF) can keep track of the quintessential timing information contained in the inputs. SNNs achieve this by maintaining a dynamic state in the neuron memory, retaining important information while forgetting redundant data over time. Thus, we posit that SNNs would allow for better performance on sequential regression tasks compared to similarly sized Analog Neural Networks (ANNs). However, deep SNNs are difficult to train due to vanishing spikes at later layers. To that effect, we propose an adaptive fully-spiking framework with learnable neuronal dynamics to alleviate the spike vanishing problem. We utilize surrogate gradient-based backpropagation through time (BPTT) to train our deep SNNs from scratch. We validate our approach for the task of optical flow estimation on the Multi-Vehicle Stereo Event-Camera (MVSEC) dataset and the DSEC-Flow dataset. Our experiments on these datasets show an average reduction of 13% in average endpoint error (AEE) compared to state-of-the-art ANNs. We also explore several down-scaled models and observe that our SNN models consistently outperform similarly sized ANNs offering 10%-16% lower AEE. These results demonstrate the importance of SNNs for smaller models and their suitability at the edge. In terms of efficiency, our SNNs offer substantial savings in network parameters (48x) and computational energy (51x) while attaining ~10% lower EPE compared to the state-of-the-art ANN implementations.

摘要:基于事件的摄像机最近显示出对高速运动估计的巨大潜力,因为它们可以异步捕获时间丰富的信息。具有神经启发的事件驱动的处理的尖峰神经网络(SNN)可以有效地处理异步数据,而神经元模型(例如泄漏的综合和火灾(LIF))可以跟踪输入中包含的典型时序信息。 SNN通过在神经元内存中保持动态状态,保留重要信息,同时忘记冗余数据随着时间的推移而实现这一目标。因此,我们认为,与类似大小的模拟神经网络(ANN)相比,SNN将允许在顺序回归任务上更好地性能。但是,由于以后的层消失了,很难训练深SNN。为此,我们提出了一个具有可学习的神经元动力学的自适应完全刺激框架,以减轻尖峰消失的问题。我们在时间(BPTT)中利用基于替代梯度的反向传播来从头开始训练我们的深SNN。我们验证了在多车立体化事件相机(MVSEC)数据集和DSEC-FLOW数据集中的光流估计任务的方法。我们在这些数据集上的实验显示,与最新的ANN相比,平均终点误差(AEE)平均降低了13%。我们还探索了几个缩小的模型,并观察到我们的SNN模型始终超过大小的ANN,提供10%-16%的AEE。这些结果证明了SNN对较小模型的重要性及其在边缘的适用性。在效率方面,与最先进的ANN实施相比,我们的SNN可节省大量的网络参数(48倍)和计算能(51倍),同时获得了〜10%的EPE。

CV-2-标题 On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks

链接: https://arxiv.org/abs/2209.11740
作者: Hubert Leterme (UGA, LJK), Kévin Polisano (UGA, LJK), Valérie Perrier (Grenoble INP, LJK), Karteek Alahari (LJK)
备注:

点击查看摘要

Abstract: In this paper, we aim to improve the mathematical interpretability of convolutional neural networks for image classification. When trained on natural image datasets, such networks tend to learn parameters in the first layer that closely resemble oriented Gabor filters. By leveraging the properties of discrete Gabor-like convolutions, we prove that, under specific conditions, feature maps computed by the subsequent max pooling operator tend to approximate the modulus of complex Gabor-like coefficients, and as such, are stable with respect to certain input shifts. We then compute a probabilistic measure of shift invariance for these layers. More precisely, we show that some filters, depending on their frequency and orientation, are more likely than others to produce stable image representations. We experimentally validate our theory by considering a deterministic feature extractor based on the dual-tree wavelet packet transform, a particular case of discrete Gabor-like decomposition. We demonstrate a strong correlation between shift invariance on the one hand and similarity with complex modulus on the other hand.

摘要:在本文中,我们旨在改善用于图像分类的卷积神经网络的数学解释性。当对自然图像数据集进行培训时,此类网络倾向于在第一层中学习与定向Gabor过滤器的参数。通过利用离散类似Gabor的卷积的性质,我们证明,在特定条件下,由随后的最大池池操作员计算出的特征图倾向于近似于复杂的Gabor样系数的模量,因此相对于某些某些某些人而言稳定输入偏移。然后,我们计算了这些层的移动不变性概率度量。更确切地说,我们表明,某些过滤器,取决于其频率和方向,比其他过滤器更有可能产生稳定的图像表示。我们通过考虑基于双树小波数据包变换的确定性特征提取器来实验验证我们的理论,这是一种离散的Gabor样分解的特定情况。另一方面,我们证明了转移不变性与相似性与复杂模量之间的相关性。

CV-3-标题 Catoptric Light can be Dangerous Effective Physical-World Attack by Natural Phenomenon

链接: https://arxiv.org/abs/2209.11739
作者: Chengyin Hu, Weiwen Shi
备注: arXiv admin note: substantial text overlap with arXiv:2209.09652, arXiv:2209.02430

点击查看摘要

Abstract: Deep neural networks (DNNs) have achieved great success in many tasks. Therefore, it is crucial to evaluate the robustness of advanced DNNs. The traditional methods use stickers as physical perturbations to fool the classifiers, which is difficult to achieve stealthiness and there exists printing loss. Some new types of physical attacks use light beam to perform attacks (e.g., laser, projector), whose optical patterns are artificial rather than natural. In this work, we study a new type of physical attack, called adversarial catoptric light (AdvCL), in which adversarial perturbations are generated by common natural phenomena, catoptric light, to achieve stealthy and naturalistic adversarial attacks against advanced DNNs in physical environments. Carefully designed experiments demonstrate the effectiveness of the proposed method in simulated and real-world environments. The attack success rate is 94.90% in a subset of ImageNet and 83.50% in the real-world environment. We also discuss some of AdvCL’s transferability and defense strategy against this attack.

摘要:深度神经网络(DNNS)在许多任务中取得了巨大的成功。因此,评估晚期DNN的鲁棒性至关重要。传统方法使用贴纸作为物理扰动来欺骗分类器,这很难实现隐秘,并且存在印刷损失。一些新型的物理攻击使用光束进行攻击(例如,激光,投影仪),其光学模式是人造的,而不是自然的。在这项工作中,我们研究了一种新型的物理攻击,称为对抗性伴奏(ADVCL),其中由常见的自然现象,即cat曲,以实现对物理环境中先进的DNN的隐秘和自然主义对抗性攻击,从而产生对抗性扰动。精心设计的实验证明了所提出的方法在模拟和现实世界中的有效性。在ImageNet的子集中,攻击成功率为94.90%,在现实世界中为83.50%。我们还讨论了Advcl针对此攻击的一些可转让性和防御策略。

CV-4-标题 Semantic scene descriptions as an objective of human vision

链接: https://arxiv.org/abs/2209.11737
作者: Adrien Doerig, Tim C Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, Ian Charest
备注:

点击查看摘要

Abstract: Interpreting the meaning of a visual scene requires not only identification of its constituent objects, but also a rich semantic characterization of object interrelations. Here, we study the neural mechanisms underlying visuo-semantic transformations by applying modern computational techniques to a large-scale 7T fMRI dataset of human brain responses elicited by complex natural scenes. Using semantic embeddings obtained by applying linguistic deep learning models to human-generated scene descriptions, we identify a widely distributed network of brain regions that encode semantic scene descriptions. Importantly, these semantic embeddings better explain activity in these regions than traditional object category labels. In addition, they are effective predictors of activity despite the fact that the participants did not actively engage in a semantic task, suggesting that visuo-semantic transformations are a default mode of vision. In support of this view, we then show that highly accurate reconstructions of scene captions can be directly linearly decoded from patterns of brain activity. Finally, a recurrent convolutional neural network trained on semantic embeddings further outperforms semantic embeddings in predicting brain activity, providing a mechanistic model of the brain’s visuo-semantic transformations. Together, these experimental and computational results suggest that transforming visual input into rich semantic scene descriptions may be a central objective of the visual system, and that focusing efforts on this new objective may lead to improved models of visual information processing in the human brain.

摘要:解释视觉场景的含义不仅需要识别其成分对象,还需要对象相互关系的丰富语义表征。在这里,我们通过将现代计算技术应用于复杂自然场景引起的人类脑反应的大规模7T fMRI数据集,研究视觉语义转换的神经机制。使用通过将语言深度学习模型应用于人类生成的场景描述获得的语义嵌入,我们确定了编码语义场景描述的大脑区域的广泛分布网络。重要的是,这些语义嵌入比传统对象类别标签更好地解释了这些区域的活动。此外,尽管参与者没有积极从事语义任务,但它们还是活动的有效预测指标,这表明Visuo-Semantic转换是默认的视觉方式。为了支持这种观点,我们表明,可以直接通过大脑活动模式直接将场景字幕的高度精确重建。最后,经过语义嵌入训练的经常性卷积神经网络进一步超过了语义嵌入在预测大脑活动时的语义嵌入,从而提供了大脑视觉语义转换的机械模型。这些实验和计算结果在一起表明,将视觉输入转换为丰富的语义场景描述可能是视觉系统的核心目标,并且将重点放在这一新目标上可能会导致改进人类大脑中视觉信息处理的模型。

CV-5-标题 Boost CTR Prediction for New Advertisements via Modeling Visual Content

链接: https://arxiv.org/abs/2209.11727
作者: Tan Yu, Zhipeng Jin, Jie Liu, Yi Yang, Hongliang Fei, Ping Li
备注:

点击查看摘要

Abstract: Existing advertisements click-through rate (CTR) prediction models are mainly dependent on behavior ID features, which are learned based on the historical user-ad interactions. Nevertheless, behavior ID features relying on historical user behaviors are not feasible to describe new ads without previous interactions with users. To overcome the limitations of behavior ID features in modeling new ads, we exploit the visual content in ads to boost the performance of CTR prediction models. Specifically, we map each ad into a set of visual IDs based on its visual content. These visual IDs are further used for generating the visual embedding for enhancing CTR prediction models. We formulate the learning of visual IDs into a supervised quantization problem. Due to a lack of class labels for commercial images in advertisements, we exploit image textual descriptions as the supervision to optimize the image extractor for generating effective visual IDs. Meanwhile, since the hard quantization is non-differentiable, we soften the quantization operation to make it support the end-to-end network training. After mapping each image into visual IDs, we learn the embedding for each visual ID based on the historical user-ad interactions accumulated in the past. Since the visual ID embedding depends only on the visual content, it generalizes well to new ads. Meanwhile, the visual ID embedding complements the ad behavior ID embedding. Thus, it can considerably boost the performance of the CTR prediction models previously relying on behavior ID features for both new ads and ads that have accumulated rich user behaviors. After incorporating the visual ID embedding in the CTR prediction model of Baidu online advertising, the average CTR of ads improves by 1.46%, and the total charge increases by 1.10%.

摘要:现有的广告点击率(CTR)预测模型主要取决于行为ID功能,这些功能是根据历史用户AD交互所学习的。然而,依赖历史用户行为的行为ID功能是不可行的,可以在没有以前与用户互动的情况下描述新广告。为了克服对新广告建模的行为ID特征的局限性,我们利用广告中的视觉内容来提高CTR预测模型的性能。具体来说,我们根据其视觉内容将每个广告映射到一组视觉ID中。这些视觉ID进一步用于生成可视觉嵌入,以增强CTR预测模型。我们将视觉ID的学习分为有监督的量化问题。由于缺乏广告中商业图像的类标签,因此我们利用图像文本描述作为监督,以优化图像提取器以生成有效的视觉ID。同时,由于硬量化是不可差异的,因此我们软化量化操作以使其支持端到端网络培训。将每个图像映射到视觉ID之后,我们根据过去积累的历史用户AD交互学习每个视觉ID的嵌入。由于视觉ID嵌入仅取决于视觉内容,因此它概括为新广告。同时,嵌入视觉ID补充了AD行为ID嵌入。因此,它可以大大提高CTR预测模型的性能,以前依赖于积累了丰富用户行为的新广告和广告的行为ID功能。将视觉ID嵌入在BAIDU在线广告的CTR预测模型中后,AD的平均CTR提高了1.46%,总费用增加了1.10%。

CV-6-标题 Multilevel Robustness for 2D Vector Field Feature Tracking Selection and Comparison

链接: https://arxiv.org/abs/2209.11708
作者: Lin Yan, Paul Aaron Ullrich, Luke P. Van Roekel, Bei Wang, Hanqi Guo
备注:

点击查看摘要

Abstract: Critical point tracking is a core topic in scientific visualization for understanding the dynamic behavior of time-varying vector field data. The topological notion of robustness has been introduced recently to quantify the structural stability of critical points, that is, the robustness of a critical point is the minimum amount of perturbation to the vector field necessary to cancel it. A theoretical basis has been established previously that relates critical point tracking with the notion of robustness, in particular, critical points could be tracked based on their closeness in stability, measured by robustness, instead of just distance proximities within the domain. However, in practice, the computation of classic robustness may produce artifacts when a critical point is close to the boundary of the domain; thus, we do not have a complete picture of the vector field behavior within its local neighborhood. To alleviate these issues, we introduce a multilevel robustness framework for the study of 2D time-varying vector fields. We compute the robustness of critical points across varying neighborhoods to capture the multiscale nature of the data and to mitigate the boundary effect suffered by the classic robustness computation. We demonstrate via experiments that such a new notion of robustness can be combined seamlessly with existing feature tracking algorithms to improve the visual interpretability of vector fields in terms of feature tracking, selection, and comparison for large-scale scientific simulations. We observe, for the first time, that the minimum multilevel robustness is highly correlated with physical quantities used by domain scientists in studying a real-world tropical cyclone dataset. Such observation helps to increase the physical interpretability of robustness.

摘要:关键点跟踪是科学可视化的核心主题,用于理解时变矢量场数据的动态行为。最近引入了鲁棒性的拓扑概念,以量化关键点的结构稳定性,也就是说,临界点的鲁棒性是对矢量场取消其所需的最小扰动量。先前已经建立了一个理论基础,该基础将关键点跟踪与鲁棒性概念相关联,特别是,可以根据其稳定性的接近度来跟踪临界点,这是通过稳健性来衡量的,而不是域内的距离接近。但是,实际上,当临界点接近域的边界时,经典鲁棒性的计算可能会产生伪影。因此,我们没有对其本地社区内的向量场行为的完整图片。为了减轻这些问题,我们引入了一个多级鲁棒性框架,以研究2D随时间变化的向量领域。我们计算各个社区的临界点的鲁棒性,以捕获数据的多尺度性质,并减轻经典鲁棒性计算所产生的边界效应。我们通过实验证明,这种新的鲁棒性概念可以与现有的功能跟踪算法无缝结合,以在功能跟踪,选择和大规模科学模拟的比较方面提高向量场的视觉解释性。我们首次观察到,最小多级鲁棒性与领域科学家在研究现实世界热带旋风数据集中使用的物理量高度相关。这种观察有助于提高鲁棒性的物理解释性。

CV-7-标题 Multivariate Wasserstein Functional Connectivity for Autism Screening

链接: https://arxiv.org/abs/2209.11703
作者: Oleg Kachan, Alexander Bernstein
备注:

点击查看摘要

Abstract: Most approaches to the estimation of brain functional connectivity from the functional magnetic resonance imaging (fMRI) data rely on computing some measure of statistical dependence, or more generally, a distance between univariate representative time series of regions of interest (ROIs) consisting of multiple voxels. However, summarizing a ROI’s multiple time series with its mean or the first principal component (1PC) may result to the loss of information as, for example, 1PC explains only a small fraction of variance of the multivariate signal of the neuronal activity. We propose to compare ROIs directly, without the use of representative time series, defining a new measure of multivariate connectivity between ROIs, not necessarily consisting of the same number of voxels, based on the Wasserstein distance. We assess the proposed Wasserstein functional connectivity measure on the autism screening task, demonstrating its superiority over commonly used univariate and multivariate functional connectivity measures.

摘要:从功能磁共振成像(fMRI)数据估算大脑功能连接性的大多数方法都依赖于计算统计依赖性的某种度量,或更一般地,单变量代表性时间序列(ROIS)(ROIS)组成的距离(ROIS)之间的距离多个体素。但是,总结ROI的多个时间序列具有其平均值或第一个主成分(1pc)可能导致信息丢失,例如,1PC仅解释了神经元活动的多变量信号的一小部分。我们建议在不使用代表性时间序列的情况下直接比较ROI,并根据Wasserstein距离定义了ROI之间的新的多元连通性量度,不一定由相同数量的体素组成。我们在自闭症筛查任务上评估了拟议的Wasserstein功能连接度量,证明了其优越性优于常用单变量和多元功能连通性测量。

CV-8-标题 Edge-oriented Implicit Neural Representation with Channel Tuning

链接: https://arxiv.org/abs/2209.11697
作者: Wonjoon Chang, Dahee Kwon, Bumjin Park
备注:

点击查看摘要

Abstract: Implicit neural representation, which expresses an image as a continuous function rather than a discrete grid form, is widely used for image processing. Despite its outperforming results, there are still remaining limitations on restoring clear shapes of a given signal such as the edges of an image. In this paper, we propose Gradient Magnitude Adjustment algorithm which calculates the gradient of an image for training the implicit representation. In addition, we propose Edge-oriented Representation Network (EoREN) that can reconstruct the image with clear edges by fitting gradient information (Edge-oriented module). Furthermore, we add Channel-tuning module to adjust the distribution of given signals so that it solves a chronic problem of fitting gradients. By separating backpropagation paths of the two modules, EoREN can learn true color of the image without hindering the role for gradients. We qualitatively show that our model can reconstruct complex signals and demonstrate general reconstruction ability of our model with quantitative results.

摘要:隐式神经表示,将图像表示为连续函数而不是离散的网格形式,被广泛用于图像处理。尽管其表现优于效果,但仍存在恢复给定信号的清晰形状(例如图像边缘)的局限性。在本文中,我们提出了梯度幅度调节算法,该算法计算了训练隐式表示的图像的梯度。此外,我们提出了面向边缘的表示网络(EOREN),该网络可以通过拟合梯度信息(面向边缘的模块)来重建图像。此外,我们添加了通道调节模块以调整给定信号的分布,从而解决了拟合梯度的慢性问题。通过分离两个模块的反向传播路径,Eoren可以学习图像的真实颜色,而不会阻碍梯度的作用。我们定性地表明,我们的模型可以重建复杂的信号,并通过定量结果证明我们的模型的一般重建能力。

CV-9-标题 Dynamic camera alignment optimization problem based on Fractal Decomposition based Algorithm

链接: https://arxiv.org/abs/2209.11695
作者: Arcadi Llanza, Nadiya Shvai, Amir Nakib
备注:

点击查看摘要

Abstract: In this work, we tackle the Dynamic Optimization Problem (DOP) of IA in a real-world application using a Dynamic Optimization Algorithm (DOA) called Fractal Decomposition Algorithm (FDA), introduced by recently. We used FDA to perform IA on CCTV camera feed from a tunnel. As the camera viewpoint can change by multiple reasons such as wind, maintenance, etc. the alignment is required to guarantee the correct functioning of video-based traffic security system.

摘要:在这项工作中,我们使用最近引入的动态优化算法(DOA)解决了现实世界应用中IA的动态优化问题(DOA),该算法(DOA)最近引入了最近引入。我们使用FDA从隧道上进行CCTV摄像头饲料进行IA。由于摄像机的观点可以通过多种原因(例如风,维护等)变化。需要对齐来保证基于视频的流量安全系统的正确功能。

CV-10-标题 Rate-Distortion in Image Coding for Machines

链接: https://arxiv.org/abs/2209.11694
作者: Alon Harell, Anderson De Andrade, Ivan V. Bajic
备注:

点击查看摘要

Abstract: In recent years, there has been a sharp increase in transmission of images to remote servers specifically for the purpose of computer vision. In many applications, such as surveillance, images are mostly transmitted for automated analysis, and rarely seen by humans. Using traditional compression for this scenario has been shown to be inefficient in terms of bit-rate, likely due to the focus on human based distortion metrics. Thus, it is important to create specific image coding methods for joint use by humans and machines. One way to create the machine side of such a codec is to perform feature matching of some intermediate layer in a Deep Neural Network performing the machine task. In this work, we explore the effects of the layer choice used in training a learnable codec for humans and machines. We prove, using the data processing inequality, that matching features from deeper layers is preferable in the sense of rate-distortion. Next, we confirm our findings empirically by re-training an existing model for scalable human-machine coding. In our experiments we show the trade-off between the human and machine sides of such a scalable model, and discuss the benefit of using deeper layers for training in that regard.

摘要:近年来,出于计算机视觉目的,将图像传输到远程服务器的传输急剧增加。在许多应用程序(例如监视)中,图像主要是用于自动分析的,并且很少被人类看到。在这种情况下,使用传统的压缩在比特率方面效率低下,这可能是由于关注基于人类的失真指标。因此,重要的是创建特定的图像编码方法,以供人类和机器联合使用。创建这种编解码器的机器侧的一种方法是在深神经网络中执行某些中间层执行机器任务的功能匹配。在这项工作中,我们探讨了用于培训人类和机器可学习的编解码器时所使用的层选择的效果。我们证明,使用数据处理不平等,从速率延伸的意义上讲,更深层的匹配特征是可取的。接下来,我们通过重新培训现有的可扩展人机编码模型来从经验上确认我们的发现。在我们的实验中,我们显示了这种可扩展模型的人类和机器方面的权衡,并讨论了在这方面使用更深层进行训练的好处。

CV-11-标题 T3VIP Transformation-based 3D Video Prediction

链接: https://arxiv.org/abs/2209.11693
作者: Iman Nematollahi, Erick Rosete-Beas, Seyed Mahdi B. Azad, Raghu Rajan, Frank Hutter, Wolfram Burgard
备注: Accepted at the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

点击查看摘要

Abstract: For autonomous skill acquisition, robots have to learn about the physical rules governing the 3D world dynamics from their own past experience to predict and reason about plausible future outcomes. To this end, we propose a transformation-based 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts and predicting their corresponding rigid transformations. Our model is fully unsupervised, captures the stochastic nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals. To fully leverage all the 2D and 3D observational signals, we equip our model with automatic hyperparameter optimization (HPO) to interpret the best way of learning from them. To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera. Our extensive evaluation with simulated and real-world datasets demonstrates that our formulation leads to interpretable 3D models that predict future depth videos while achieving on-par performance with 2D models on RGB video prediction. Moreover, we demonstrate that our model outperforms 2D baselines on visuomotor control. Videos, code, dataset, and pre-trained models are available at this http URL.

摘要:对于自主技能获取,机器人必须了解从自己过去的经验中管理3D世界动态的物理规则,以预测和理由关于合理的未来结果。为此,我们提出了一种基于转换的3D视频预测(T3VIP)方法,该方法通过将场景分解为对象部分并预测其相应的刚性转换来明确对3D运动进行建模。我们的模型是完全无监督的,捕获了现实世界的随机性质,图像和点云领域中的观察提示构成了其学习信号。为了充分利用所有2D和3D观测信号,我们为模型配备了自动的超参数优化(HPO),以解释从中学习的最佳方法。据我们所知,我们的模型是第一个生成模型,它为静态相机提供了RGB-D视频预测。我们对模拟和现实世界数据集进行了广泛的评估表明,我们的配方会导致可解释的3D模型,这些模型可以预测未来的深度视频,同时在RGB视频预测上使用2D模型实现PAR性能。此外,我们证明了我们的模型在视觉运动控制方面优于2D基准。此HTTP URL可用视频,代码,数据集和预训练的模型。

CV-12-标题 Meteorological Satellite Images Prediction Based on Deep Multi-scales Extrapolation Fusion

链接: https://arxiv.org/abs/2209.11682
作者: Fang Huang, Wencong Cheng, PanFeng Wang, ZhiGang Wang, HongHong He
备注:

点击查看摘要

Abstract: Meteorological satellite imagery is critical for meteorologists. The data have played an important role in monitoring and analyzing weather and climate changes. However, satellite imagery is a kind of observation data and exists a significant time delay when transmitting the data back to Earth. It is important to make accurate predictions for meteorological satellite images, especially the nowcasting prediction up to 2 hours ahead. In recent years, there has been growing interest in the research of nowcasting prediction applications of weather radar images based on deep learning. Compared to the weather radar images prediction problem, the main challenge for meteorological satellite images prediction is the large-scale observation areas and therefore the large sizes of the observation products. Here we present a deep multi-scales extrapolation fusion method, to address the challenge of the meteorological satellite images nowcasting prediction. First, we downsample the original satellite images dataset with large size to several images datasets with smaller resolutions, then we use a deep spatiotemporal sequences prediction method to generate the multi-scales prediction images with different resolutions separately. Second, we fuse the multi-scales prediction results to the targeting prediction images with the original size by a conditional generative adversarial network. The experiments based on the FY-4A meteorological satellite data show that the proposed method can generate realistic prediction images that effectively capture the evolutions of the weather systems in detail. We believe that the general idea of this work can be potentially applied to other spatiotemporal sequence prediction tasks with a large size.

摘要:气象卫星图像对于气象学家至关重要。数据在监视和分析天气和气候变化方面起着重要作用。但是,卫星图像是一种观察数据,在将数据传输到地球时存在很大的时间延迟。重要的是要对气象卫星图像进行准确的预测,尤其是提前2个小时的现象预测。近年来,人们对基于深度学习的天气雷达图像的预测应用的研究越来越兴趣。与天气雷达图像预测问题相比,气象卫星图像预测的主要挑战是大规模观察区域,因此观察产物的大小。在这里,我们提出了一种深层的多尺度外推融合方法,以解决气象卫星图像的挑战。首先,我们将具有大尺寸的原始卫星图像数据集简单地示例到具有较小分辨率的几个图像数据集,然后我们使用深层时空序列预测方法来生成具有不同分辨率的多尺度预测图像。其次,我们通过条件生成对抗网络将多尺度预测结果与原始大小融合到目标预测图像。基于FY-4A气象卫星数据的实验表明,所提出的方法可以生成逼真的预测图像,从而有效地详细捕获天气系统的发展。我们认为,这项工作的一般思想可以潜在地应用于具有较大尺寸的其他时空序列预测任务。

CV-13-标题 An Overview of Violence Detection Techniques Current Challenges and Future Directions

链接: https://arxiv.org/abs/2209.11680
作者: Nadia Mumtaz, Naveed Ejaz, Shabana Habib, Syed Muhammad Mohsin, Prayag Tiwari, Shahab S. Band, Neeraj Kumar
备注: Artificial Intelligence Review

点击查看摘要

Abstract: The Big Video Data generated in today’s smart cities has raised concerns from its purposeful usage perspective, where surveillance cameras, among many others are the most prominent resources to contribute to the huge volumes of data, making its automated analysis a difficult task in terms of computation and preciseness. Violence Detection (VD), broadly plunging under Action and Activity recognition domain, is used to analyze Big Video data for anomalous actions incurred due to humans. The VD literature is traditionally based on manually engineered features, though advancements to deep learning based standalone models are developed for real-time VD analysis. This paper focuses on overview of deep sequence learning approaches along with localization strategies of the detected violence. This overview also dives into the initial image processing and machine learning-based VD literature and their possible advantages such as efficiency against the current complex models. Furthermore,the datasets are discussed, to provide an analysis of the current models, explaining their pros and cons with future directions in VD domain derived from an in-depth analysis of the previous methods.

摘要:当今智能城市中产生的大型视频数据从其有目的的用法角度引起了人们的关注,在这些观点中,监视摄像机等是最突出的资源,是为大量数据做出贡献的最突出的资源计算和准确性。暴力检测(VD)在行动和活动识别域中广泛崩溃,用于分析大型视频数据,以了解由于人类而引起的异常动作。传统上,VD文献基于手动设计的功能,尽管开发了基于深度学习的独立模型的进步用于实时VD分析。本文重点介绍了深度序列学习方法以及检测到的暴力的本地化策略。该概述还介入了基于机器学习的初始图像处理和基于机器学习的文献及其可能具有的优势,例如针对当前复杂模型的效率。此外,讨论了数据集,以提供当前模型的分析,并用对先前方法的深入分析得出的VD域中的未来方向解释了他们的利弊。

CV-14-标题 PNeRF Probabilistic Neural Scene Representations for Uncertain 3D Visual Mapping

链接: https://arxiv.org/abs/2209.11677
作者: Yassine Ahmine, Arnab Dey, Andrew I. Comport
备注: 7 Pages, 6 Figures, 5 Tables. Submitted to IEEE International Conference on Robotics and Automation 2023 (ICRA 2023)

点击查看摘要

Abstract: Recently neural scene representations have provided very impressive results for representing 3D scenes visually, however, their study and progress have mainly been limited to visualization of virtual models in computer graphics or scene reconstruction in computer vision without explicitly accounting for sensor and pose uncertainty. Using this novel scene representation in robotics applications, however, would require accounting for this uncertainty in the neural map. The aim of this paper is therefore to propose a novel method for training \em probabilistic neural scene representations with uncertain training data that could enable the inclusion of these representations in robotics applications. Acquiring images using cameras or depth sensors contains inherent uncertainty, and furthermore, the camera poses used for learning a 3D model are also imperfect. If these measurements are used for training without accounting for their uncertainty, then the resulting models are non-optimal, and the resulting scene representations are likely to contain artifacts such as blur and un-even geometry. In this work, the problem of uncertainty integration to the learning process is investigated by focusing on training with uncertain information in a probabilistic manner. The proposed method involves explicitly augmenting the training likelihood with an uncertainty term such that the learnt probability distribution of the network is minimized with respect to the training uncertainty. It will be shown that this leads to more accurate image rendering quality, in addition to more precise and consistent geometry. Validation has been carried out on both synthetic and real datasets showing that the proposed approach outperforms state-of-the-art methods. The results show notably that the proposed method is capable of rendering novel high-quality views even when the training data is limited.

摘要:最近的神经场景表示为视觉表示3D场景提供了非常令人印象深刻的结果,但是,他们的研究和进步主要仅限于计算机图形或计算机视觉中的虚拟模型的可视化,而无需明确考虑传感器和姿势不确定性。但是,在机器人技术应用程序中使用这种新颖的场景表示形式将需要考虑神经图中的这种不确定性。因此,本文的目的是提出一种新的方法,用于培训概率的神经场景表征,并具有不确定的训练数据,可以使这些表示形式纳入机器人技术应用中。使用相机或深度传感器获取图像包含固有的不确定性,此外,用于学习3D模型的相机姿势也不完美。如果这些测量值用于训练而无需考虑其不确定性,则结果模型是非最佳的,并且所得场景表示可能包含诸如Blur和Un-Cheven几何形状之类的伪影。在这项工作中,通过以概率方式专注于不确定信息的培训来研究与学习过程的不确定性整合问题。所提出的方法涉及以不确定性项的明确增加训练可能性,以使网络的学习概率分布相对于培训不确定性最小化。可以证明,除了更精确和一致的几何形状外,这还导致更准确的图像渲染质量。对合成数据集和真实数据集进行了验证,表明所提出的方法的表现优于最先进的方法。结果表明,即使训练数据受到限制,该提出的方法也能够呈现新颖的高质量视图。

CV-15-标题 Image-to-Image Translation for Autonomous Driving from Coarsely-Aligned Image Pairs

链接: https://arxiv.org/abs/2209.11673
作者: Youya Xia, Josephine Monica, Wei-Lun Chao, Bharath Hariharan, Kilian Q Weinberger, Mark Campbell
备注: Submitted to the International Conference on Robotics and Automation (ICRA) 2023

点击查看摘要

Abstract: A self-driving car must be able to reliably handle adverse weather conditions (e.g., snowy) to operate safely. In this paper, we investigate the idea of turning sensor inputs (i.e., images) captured in an adverse condition into a benign one (i.e., sunny), upon which the downstream tasks (e.g., semantic segmentation) can attain high accuracy. Prior work primarily formulates this as an unpaired image-to-image translation problem due to the lack of paired images captured under the exact same camera poses and semantic layouts. While perfectly-aligned images are not available, one can easily obtain coarsely-paired images. For instance, many people drive the same routes daily in both good and adverse weather; thus, images captured at close-by GPS locations can form a pair. Though data from repeated traversals are unlikely to capture the same foreground objects, we posit that they provide rich contextual information to supervise the image translation model. To this end, we propose a novel training objective leveraging coarsely-aligned image pairs. We show that our coarsely-aligned training scheme leads to a better image translation quality and improved downstream tasks, such as semantic segmentation, monocular depth estimation, and visual localization.

摘要:自动驾驶汽车必须能够可靠地处理不利的天气条件(例如,雪地)安全运行。在本文中,我们研究了以不利条件捕获的转动传感器输入(即图像)的想法,将其下游任务(例如,语义分割)可以达到高精度。先前的工作主要将其作为未配对的图像到图像翻译问题,因为缺乏在完全相同的相机姿势和语义布局下捕获的配对图像。虽然没有完美对准的图像,但可以轻松获得粗配上的图像。例如,许多人每天在好天气和不利的天气中驾驶相同的路线;因此,在近距离GPS位置捕获的图像可以形成一对。尽管来自重复遍历的数据不太可能捕获相同的前景对象,但我们认为它们提供了丰富的上下文信息来监督图像翻译模型。为此,我们提出了一个新颖的训练目标,利用了粗糙的图像对。我们表明,我们与一致的训练方案可提高更好的图像翻译质量和改进的下游任务,例如语义分割,单眼深度估计和视觉定位。

CV-16-标题 View-Invariant Skeleton-based Action Recognition via Global-Local Contrastive Learning

链接: https://arxiv.org/abs/2209.11634
作者: Cunling Bian, Wei Feng, Fanbo Meng, Song Wang
备注:

点击查看摘要

Abstract: Skeleton-based human action recognition has been drawing more interest recently due to its low sensitivity to appearance changes and the accessibility of more skeleton data. However, even the 3D skeletons captured in practice are still sensitive to the viewpoint and direction gave the occlusion of different human-body joints and the errors in human joint localization. Such view variance of skeleton data may significantly affect the performance of action recognition. To address this issue, we propose in this paper a new view-invariant representation learning approach, without any manual action labeling, for skeleton-based human action recognition. Specifically, we leverage the multi-view skeleton data simultaneously taken for the same person in the network training, by maximizing the mutual information between the representations extracted from different views, and then propose a global-local contrastive loss to model the multi-scale co-occurrence relationships in both spatial and temporal domains. Extensive experimental results show that the proposed method is robust to the view difference of the input skeleton data and significantly boosts the performance of unsupervised skeleton-based human action methods, resulting in new state-of-the-art accuracies on two challenging multi-view benchmarks of PKUMMD and NTU RGB+D.

摘要:基于骨架的人类动作识别最近引起了人们对外观变化的敏感性和更多骨架数据的可访问性的敏感性。但是,即使在实践中捕获的3D骨骼也对观点和方向仍然敏感,并给出了不同人体关节的阻塞和人类关节定位中的误差。骨骼数据的这种视图差异可能会严重影响动作识别的性能。为了解决这个问题,我们在本文中提出了一种新的视图不变的表示方法,而没有任何手动动作标签,用于基于骨架的人类行动识别。具体而言,我们通过最大化从不同观点提取的表示形式之间的相互信息来利用同一个人同时对同一个人进行的多视图骨架数据,然后提出一个全局 - 局部对比度损失,以模拟多规模CO - 空间和时间域中的发生关系。广泛的实验结果表明,所提出的方法对输入骨骼数据的视图差异是可靠的,并显着提高了基于无监督骨架的人类动作方法的性能,从而在两个具有挑战性的多视图上产生了新的最新精确度Pkummd和NTU RGB+d的基准。

CV-17-标题 I-SPLIT Deep Network Interpretability for Split Computing

链接: https://arxiv.org/abs/2209.11607
作者: Federico Cunico, Luigi Capogrosso, Francesco Setti, Damiano Carra, Franco Fummi, Marco Cristani
备注: ICPR 2022

点击查看摘要

Abstract: This work makes a substantial step in the field of split computing, i.e., how to split a deep neural network to host its early part on an embedded device and the rest on a server. So far, potential split locations have been identified exploiting uniquely architectural aspects, i.e., based on the layer sizes. Under this paradigm, the efficacy of the split in terms of accuracy can be evaluated only after having performed the split and retrained the entire pipeline, making an exhaustive evaluation of all the plausible splitting points prohibitive in terms of time. Here we show that not only the architecture of the layers does matter, but the importance of the neurons contained therein too. A neuron is important if its gradient with respect to the correct class decision is high. It follows that a split should be applied right after a layer with a high density of important neurons, in order to preserve the information flowing until then. Upon this idea, we propose Interpretable Split (I-SPLIT): a procedure that identifies the most suitable splitting points by providing a reliable prediction on how well this split will perform in terms of classification accuracy, beforehand of its effective implementation. As a further major contribution of I-SPLIT, we show that the best choice for the splitting point on a multiclass categorization problem depends also on which specific classes the network has to deal with. Exhaustive experiments have been carried out on two networks, VGG16 and ResNet-50, and three datasets, Tiny-Imagenet-200, notMNIST, and Chest X-Ray Pneumonia. The source code is available at this https URL.

摘要:这项工作在拆分计算领域迈出了重大步骤,即如何拆分深神经网络以将其早期部分托管在嵌入式设备和服务器上的其余部分。到目前为止,已经确定了潜在的分裂位置,以利用独特的建筑方面,即基于层尺寸。在此范式下,只有在执行分裂并重新训练整个管道后,才能评估分裂的疗效,从而对所有合理的分裂点在时间方面进行详尽的评估。在这里,我们表明,不仅层的结构确实很重要,而且其中包含的神经元的重要性也很重要。如果神经元相对于正确的班级决策,神经元很重要。因此,应在具有高密度的重要神经元的层后立即施加拆分,以保留流动的信息。根据这个想法,我们提出了可解释的拆分(i-split):通过提供有关该分型在分类准确性方面的表现,事先对其有效实现的可靠性,以确定最合适的分裂点的过程。作为I-Split的另一个重大贡献,我们表明,多类分类问题的分裂点的最佳选择还取决于网络必须处理的特定类别。详尽的实验已在两个网络(VGG16和Resnet-50)以及三个数据集(Tiny-Imagenet-200,Notmnist和胸部X射线肺炎)上进行。源代码可在此HTTPS URL上获得。

CV-18-标题 Multi-Granularity Graph Pooling for Video-based Person Re-Identification

链接: https://arxiv.org/abs/2209.11584
作者: Honghu Pan, Yongyong Chen, Zhenyu He
备注:

点击查看摘要

Abstract: The video-based person re-identification (ReID) aims to identify the given pedestrian video sequence across multiple non-overlapping cameras. To aggregate the temporal and spatial features of the video samples, the graph neural networks (GNNs) are introduced. However, existing graph-based models, like STGCN, perform the \textitmean/\textitmax pooling on node features to obtain the graph representation, which neglect the graph topology and node importance. In this paper, we propose the graph pooling network (GPNet) to learn the multi-granularity graph representation for the video retrieval, where the \textitgraph pooling layer is implemented to downsample the graph. We first construct a multi-granular graph, whose node features denote image embedding learned by backbone, and edges are established between the temporal and Euclidean neighborhood nodes. We then implement multiple graph convolutional layers to perform the neighborhood aggregation on the graphs. To downsample the graph, we propose a multi-head full attention graph pooling (MHFAPool) layer, which integrates the advantages of existing node clustering and node selection pooling methods. Specifically, MHFAPool takes the main eigenvector of full attention matrix as the aggregation coefficients to involve the global graph information in each pooled nodes. Extensive experiments demonstrate that our GPNet achieves the competitive results on four widely-used datasets, i.e., MARS, DukeMTMC-VideoReID, iLIDS-VID and PRID-2011.

摘要:基于视频的人重新识别(REID)旨在识别多个非重叠摄像机的给定的人群序列。为了汇总视频样本的时间和空间特征,引入了图神经网络(GNN)。但是,现有基于图的模型(例如STGCN)在节点功能上执行\ textitmean/\ textitMax池,以获取图表表示,该图表忽略了图形拓扑和节点的重要性。在本文中,我们建议图形池网络(GPNET)学习视频检索的多粒度图表示,其中实现了\ textItgraph池池以对图进行下样本。我们首先构建了一个多粒图,其节点特征表示由骨架学到的图像嵌入,并且在颞和欧几里得邻域节点之间建立了边缘。然后,我们实现多个图形卷积层以在图上执行邻域聚集。为了下图,我们提出了一个多头全注意图池(MHFAPOOL)层,该图集合了现有节点群集和节点选择池的优势。具体而言,MHFAPOOL将全部注意矩阵的主要特征向量作为聚合系数涉及每个汇总节点中的全局图信息。广泛的实验表明,我们的GPNET在四个广泛使用的数据集(即火星,dukemtmc-veneoreid,ilids-vid and Prid-2011)上实现了竞争结果。

CV-19-标题 Pose-Aided Video-based Person Re-Identification via Recurrent Graph Convolutional Network

链接: https://arxiv.org/abs/2209.11582
作者: Honghu Pan, Qiao Liu, Yongyong Chen, Yunqi He, Yuan Zheng, Feng Zheng, Zhenyu He
备注:

点击查看摘要

Abstract: Existing methods for video-based person re-identification (ReID) mainly learn the appearance feature of a given pedestrian via a feature extractor and a feature aggregator. However, the appearance models would fail when different pedestrians have similar appearances. Considering that different pedestrians have different walking postures and body proportions, we propose to learn the discriminative pose feature beyond the appearance feature for video retrieval. Specifically, we implement a two-branch architecture to separately learn the appearance feature and pose feature, and then concatenate them together for inference. To learn the pose feature, we first detect the pedestrian pose in each frame through an off-the-shelf pose detector, and construct a temporal graph using the pose sequence. We then exploit a recurrent graph convolutional network (RGCN) to learn the node embeddings of the temporal pose graph, which devises a global information propagation mechanism to simultaneously achieve the neighborhood aggregation of intra-frame nodes and message passing among inter-frame graphs. Finally, we propose a dual-attention method consisting of node-attention and time-attention to obtain the temporal graph representation from the node embeddings, where the self-attention mechanism is employed to learn the importance of each node and each frame. We verify the proposed method on three video-based ReID datasets, i.e., Mars, DukeMTMC and iLIDS-VID, whose experimental results demonstrate that the learned pose feature can effectively improve the performance of existing appearance models.

摘要:现有的基于视频的人重新识别(REID)的方法主要通过功能提取器和功能聚合器来了解给定行人的外观特征。但是,当不同的行人外观相似时,外观模型将失败。考虑到不同的行人具有不同的步行姿势和身体比例,我们建议学习视频检索的外观功能之外的歧视性姿势功能。具体而言,我们实现了一个两分支的体系结构,以单独学习外观功能和姿势功能,然后将它们串联在一起进行推理。为了学习姿势特征,我们首先通过现成的姿势检测器检测到每个框架中的行人姿势,并使用姿势序列构建时间图。然后,我们利用复发图卷积网络(RGCN)来学习时间姿势图的节点嵌入,该姿势图设计了一种全局信息传播机制,以同时实现框内节点的邻域聚集,并在框架间图之间传递消息。最后,我们提出了一种由节点注意和时间注意的双重意见方法,以从节点嵌入中获得时间图表示,其中采用自我注意机制来了解每个节点和每个帧的重要性。我们在三个基于视频的REID数据集(即火星,Dukemtmc和Ilids-Vid)上验证了所提出的方法,其实验结果表明,学习的姿势功能可以有效地改善现有外观模型的性能。

CV-20-标题 Towards Complete-View and High-Level Pose-based Gait Recognition

链接: https://arxiv.org/abs/2209.11577
作者: Honghu Pan, Yongyong Chen, Tingyang Xu, Yunqi He, Zhenyu He
备注:

点击查看摘要

Abstract: The model-based gait recognition methods usually adopt the pedestrian walking postures to identify human beings. However, existing methods did not explicitly resolve the large intra-class variance of human pose due to camera views changing. In this paper, we propose to generate multi-view pose sequences for each single-view pose sample by learning full-rank transformation matrices via lower-upper generative adversarial network (LUGAN). By the prior of camera imaging, we derive that the spatial coordinates between cross-view poses satisfy a linear transformation of a full-rank matrix, thereby, this paper employs the adversarial training to learn transformation matrices from the source pose and target views to obtain the target pose sequences. To this end, we implement a generator composed of graph convolutional (GCN) layers, fully connected (FC) layers and two-branch convolutional (CNN) layers: GCN layers and FC layers encode the source pose sequence and target view, then CNN branches learn a lower triangular matrix and an upper triangular matrix, respectively, finally they are multiplied to formulate the full-rank transformation matrix. For the purpose of adversarial training, we further devise a condition discriminator that distinguishes whether the pose sequence is true or generated. To enable the high-level correlation learning, we propose a plug-and-play module, named multi-scale hypergraph convolution (HGC), to replace the spatial graph convolutional layer in baseline, which could simultaneously model the joint-level, part-level and body-level correlations. Extensive experiments on two large gait recognition datasets, i.e., CASIA-B and OUMVLP-Pose, demonstrate that our method outperforms the baseline model and existing pose-based methods by a large margin.

摘要:基于模型的步态识别方法通常采用行人步行姿势来识别人类。但是,由于摄像头视图的改变,现有方法并未明确解决人类姿势的较大阶层差异。在本文中,我们建议通过通过低UPPER生成的对抗网络(Lugan)学习全级转换矩阵来为每个单视姿势样本生成多视图姿势序列。通过摄像机成像的先验,我们得出的是,跨视图之间的空间坐标满足了全级矩阵的线性转换,因此,本文采用了对抗性训练来从源姿势学习转换矩阵,并获得目标视图以获得目标。目标姿势序列。为此,我们实现了由图形卷积(GCN)层组成的发电机,完全连接(FC)层和两支分支卷积(CNN)层:GCN层和FC层编码源姿势序列和目标视图,然后是CNN分支最后,分别学习一个三角形基质和上三角基质,最后它们被乘以制定全级转换矩阵。出于对抗训练的目的,我们进一步设计了一个条件鉴别因子,该条件区分姿势序列是真实的还是产生的。为了启用高级相关性学习,我们提出了一个名为Multi尺度超图卷积(HGC)的插件播放模块,以替换基线中的空间图卷积层,该层可以同时模拟联合级别的部分,部分部分 - 水平和身体水平的相关性。在两个大型步态识别数据集(即CASIA-B和OUMVLP置位)上进行的广泛实验表明,我们的方法的表现优于基线模型,并以一个较大的边距基于基于姿势的方法。

CV-21-标题 Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

链接: https://arxiv.org/abs/2209.11572
作者: Xiang Fang, Daizong Liu, Pan Zhou, YuChong Hu
备注:

点击查看摘要

Abstract: As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from an untrimmed video according to a given language query. Most previous methods depend heavily on numerous manual annotations (i.e., moment boundaries), which are extremely expensive to acquire in practice. In addition, due to the domain gap between different datasets, directly applying these pre-trained models to an unseen domain leads to a significant performance drop. In this paper, we focus on a novel task: cross-domain VMR, where fully-annotated datasets are available in one domain (source domain''), but the domain of interest (target domain’') only contains unannotated datasets. As far as we know, we present the first study on cross-domain VMR. To address this new task, we propose a novel Multi-Modal Cross-Domain Alignment (MMCDA) network to transfer the annotation knowledge from the source domain to the target domain. However, due to the domain discrepancy between the source and target domains and the semantic gap between videos and queries, directly applying trained models to the target domain generally leads to a performance drop. To solve this problem, we develop three novel modules: (i) a domain alignment module is designed to align the feature distributions between different domains of each modality; (ii) a cross-modal alignment module aims to map both video and query features into a joint embedding space and to align the feature distributions between different modalities in the target domain; (iii) a specific alignment module tries to obtain the fine-grained similarity between a specific frame and the given query for optimal localization. By jointly training these three modules, our MMCDA can learn domain-invariant and semantic-aligned cross-modal representations.

摘要:作为多媒体信息检索中越来越流行的任务,视频瞬间检索(VMR)旨在根据给定的语言查询从未修剪视频中定位目标时刻。以前的大多数方法都在很大程度上取决于众多手动注释(即瞬间边界),在实践中获取非常昂贵。此外,由于不同数据集之间的域间隙,直接将这些预训练的模型应用于看不见的域,这会导致显着的性能下降。在本文中,我们专注于一项新任务:跨域VMR,其中一个域中完全注重数据集(````源域’‘’),但是感兴趣的域(``目标域’')仅包含未通知的数据集。据我们所知,我们介绍了有关跨域VMR的第一项研究。为了解决这一新任务,我们提出了一个新型的多模式跨域比对(MMCDA)网络,以将注释知识从源域转移到目标域。但是,由于源和目标域之间的域差异以及视频和查询之间的语义差距,直接将经过训练的模型应用于目标域通常会导致性能下降。为了解决这个问题,我们开发了三个新型模块:(i)域对齐模块旨在使每种模式的不同域之间的特征分布对齐; (ii)跨模式对齐模块旨在将视频和查询特征映射到关节嵌入空间中,并将目标域不同模态之间的特征分布对齐; (iii)特定的比对模块试图获得特定帧与给定查询之间的细粒度相似性以进行最佳定位。通过共同训练这三个模块,我们的MMCDA可以学习域不变和语义一致的跨模式表示。

CV-22-标题 Query-based Hard-Image Retrieval for Object Detection at Test Time

链接: https://arxiv.org/abs/2209.11559
作者: Edward Ayers, Jonathan Sadeghi, John Redford, Romain Mueller, Puneet K. Dokania
备注:

点击查看摘要

Abstract: There is a longstanding interest in capturing the error behaviour of object detectors by finding images where their performance is likely to be unsatisfactory. In real-world applications such as autonomous driving, it is also crucial to characterise potential failures beyond simple requirements of detection performance. For example, a missed detection of a pedestrian close to an ego vehicle will generally require closer inspection than a missed detection of a car in the distance. The problem of predicting such potential failures at test time has largely been overlooked in the literature and conventional approaches based on detection uncertainty fall short in that they are agnostic to such fine-grained characterisation of errors. In this work, we propose to reformulate the problem of finding “hard” images as a query-based hard image retrieval task, where queries are specific definitions of “hardness”, and offer a simple and intuitive method that can solve this task for a large family of queries. Our method is entirely post-hoc, does not require ground-truth annotations, is independent of the choice of a detector, and relies on an efficient Monte Carlo estimation that uses a simple stochastic model in place of the ground-truth. We show experimentally that it can be applied successfully to a wide variety of queries for which it can reliably identify hard images for a given detector without any labelled data. We provide results on ranking and classification tasks using the widely used RetinaNet, Faster-RCNN, Mask-RCNN, and Cascade Mask-RCNN object detectors.

摘要:通过查找图像可能不满意的图像来捕获对象检测器的错误行为,这一兴趣很长。在实际应用(例如自动驾驶)中,对于表征除了简单的检测性能要求之外的潜在失败也至关重要。例如,与远处未遗漏的汽车检测相比,错过对靠近自我车辆的行人的侦查通常需要更仔细的检查。在测试时间预测这种潜在失败的问题在文献和基于检测不确定性的传统方法中被忽略了,因为它们对这种错误的细粒度表征不可知。在这项工作中,我们建议将查找“硬”图像作为基于查询的硬图像检索任务的问题进行重新制定,其中查询是“硬度”的特定定义,并提供了一种简单而直观的方法,可以解决此任务大型查询家庭。我们的方法完全是事后的,不需要地面真相注释,独立于检测器的选择,并且依赖于有效的蒙特卡洛估计,该估计使用简单的随机模型代替地面真相。我们通过实验表明,它可以成功地应用于各种查询中,它可以可靠地识别给定检测器的硬图像,而无需任何标记的数据。我们使用广泛使用的视网膜,更快的RCNN,Mask-RCNN和CASCADE MASK-RCNN对象检测器提供有关排名和分类任务的结果。

CV-23-标题 MAGIC Mask-Guided Image Synthesis by Inverting a Quasi-Robust Classifier

链接: https://arxiv.org/abs/2209.11549
作者: Mozhdeh Rouhsedaghat, Masoud Monajatipoor, Kai-Wei Chang, C. -C. Jay Kuo, Iacopo Masi
备注: 12 pages, 9 figures, technical report

点击查看摘要

Abstract: We offer a method for one-shot image synthesis that allows controlling manipulations of a single image by inverting a quasi-robust classifier equipped with strong regularizers. Our proposed method, entitled Magic, samples structured gradients from a pre-trained quasi-robust classifier to better preserve the input semantics while preserving its classification accuracy, thereby guaranteeing credibility in the synthesis. Unlike current methods that use complex primitives to supervise the process or use attention maps as a weak supervisory signal, Magic aggregates gradients over the input, driven by a guide binary mask that enforces a strong, spatial prior. Magic implements a series of manipulations with a single framework achieving shape and location control, intense non-rigid shape deformations, and copy/move operations in the presence of repeating objects and gives users firm control over the synthesis by requiring simply specifying binary guide masks. Our study and findings are supported by various qualitative comparisons with the state-of-the-art on the same images sampled from ImageNet and quantitative analysis using machine perception along with a user survey of 100+ participants that endorse our synthesis quality.

摘要:我们提供了一种用于一次性图像合成的方法,该方法可以通过反转配备有强正规化器的准稳定分类器来控制单个图像的操纵。我们提出的标题为“魔术”的方法是从预先训练的准稳定分类器中的结构化梯度,以更好地保留输入语义,同时保留其分类精度,从而确保合成中的信誉。与当前使用复杂原语的当前方法来监督该过程或使用注意图作为弱监督信号,魔术汇总了输入上的梯度,这是由导向二进制掩码驱动的,该导向二进制掩码可以实施强大的空间先验。魔术在一个框架上实现了一系列的操作,以实现形状和位置控制,强烈的非刚性形状变形,并在存在重复对象的情况下复制/移动操作,并通过仅需指定二进制指南掩码来使用户对综合的企业控制。我们的研究和发现得到了与最新图像的各种定性比较,从成像网和使用机器感知进行定量分析的相同图像以及对100多名参与者的用户调查来认可我们的合成质量。

CV-24-标题 Statistical shape representations for temporal registration of plant components in 3D

链接: https://arxiv.org/abs/2209.11526
作者: Karoline Heiwolt, Cengiz Öztireli, Grzegorz Cielniak
备注: 6 pages plus references, 7 figures, Submitted to ICRA 2023

点击查看摘要

Abstract: Plants are dynamic organisms. Understanding temporal variations in vegetation is an essential problem for all robots in the wild. However, associating repeated 3D scans of plants across time is challenging. A key step in this process is re-identifying and tracking the same individual plant components over time. Previously, this has been achieved by comparing their global spatial or topological location. In this work, we demonstrate how using shape features improves temporal organ matching. We present a landmark-free shape compression algorithm, which allows for the extraction of 3D shape features of leaves, characterises leaf shape and curvature efficiently in few parameters, and makes the association of individual leaves in feature space possible. The approach combines 3D contour extraction and further compression using Principal Component Analysis (PCA) to produce a shape space encoding, which is entirely learned from data and retains information about edge contours and 3D curvature. Our evaluation on temporal scan sequences of tomato plants shows, that incorporating shape features improves temporal leaf-matching. A combination of shape, location, and rotation information proves most informative for recognition of leaves over time and yields a true positive rate of 75%, a 15% improvement on sate-of-the-art methods. This is essential for robotic crop monitoring, which enables whole-of-lifecycle phenotyping.

摘要:植物是动态生物。对于野外所有机器人来说,了解植被的时间变化是一个必不可少的问题。但是,在时间上关联重复的3D植物扫描是具有挑战性的。此过程中的关键步骤是随着时间的推移重新识别和跟踪相同的单个植物组件。以前,这是通过比较其全球空间或拓扑位置来实现的。在这项工作中,我们演示了使用形状功能如何改善颞器官匹配。我们提出了一种无里程碑的形状压缩算法,该算法允许提取叶子的3D形状特征,在少数参数中有效地表征叶片形状和曲率,并使特征空间中各个叶子的关联成为可能。该方法使用主成分分析(PCA)结合了3D轮廓提取和进一步的压缩,以产生形状空间编码,这完全是从数据中学到的,并保留有关边缘轮廓和3D曲率的信息。我们对番茄植物的时间扫描序列的评估表明,结合形状特征可改善颞叶匹配。形状,位置和旋转信息的结合证明了最有用的信息,可以随着时间的推移识别叶子,并产生75%的真正正率,对固定方法提高了15%。这对于机器人作物监测至关重要,这可以实现全面的表型。

CV-25-标题 WS-3D-Lane Weakly Supervised 3D Lane Detection With 2D Lane Labels

链接: https://arxiv.org/abs/2209.11523
作者: Jianyong Ai, Wenbo Ding, Jiuhua Zhao, Jiachen Zhong
备注: 7 pages, 8 figures

点击查看摘要

Abstract: Compared to 2D lanes, real 3D lane data is difficult to collect accurately. In this paper, we propose a novel method for training 3D lanes with only 2D lane labels, called weakly supervised 3D lane detection WS-3D-Lane. By assumptions of constant lane width and equal height on adjacent lanes, we indirectly supervise 3D lane heights in the training. To overcome the problem of the dynamic change of the camera pitch during data collection, a camera pitch self-calibration method is proposed. In anchor representation, we propose a double-layer anchor with a improved non-maximum suppression (NMS) method, which enables the anchor-based method to predict two lane lines that are close. Experiments are conducted on the base of 3D-LaneNet under two supervision methods. Under weakly supervised setting, our WS-3D-Lane outperforms previous 3D-LaneNet: F-score rises to 92.3% on Apollo 3D synthetic dataset, and F1 rises to 74.5% on ONCE-3DLanes. Meanwhile, WS-3D-Lane in purely supervised setting makes more increments and outperforms state-of-the-art. To the best of our knowledge, WS-3D-Lane is the first try of 3D lane detection under weakly supervised setting.

摘要:与2D车道相比,实际3D车道数据很难准确收集。在本文中,我们提出了一种仅使用2D车道标签训练3D车道的新方法,称为弱监督的3D车道检测WS-3D车道。通过在相邻车道上的恒定车道宽度和相等高度的假设,我们间接监督训练中的3D车道高度。为了克服数据收集过程中相机音调动态变化的问题,提出了相机音调自校准方法。在锚固表示中,我们提出了一个具有改进的非限量抑制(NMS)方法的双层锚,该方法使基于锚的方法可以预测两条接近的车道线。实验是在两种监督方法下在3D-LANENEN的基础上进行的。在弱监督的环境下,我们的WS-3D车道的表现优于先前的3D-LANEN:APOLLO 3D合成数据集的F得分上升到92.3%,而F1在3DDLANES上上升到74.5%。同时,在纯监督环境中的WS-3D车道可以提高更多的增量,并且优于最先进的设置。据我们所知,WS-3D车道是在弱监督环境下进行3D车道检测的第一次尝试。

CV-26-标题 Vector Quantized Semantic Communication System

链接: https://arxiv.org/abs/2209.11519
作者: Qifan Fu, Huiqiang Xie, Zhijin Qin, Gregory Slabaugh, Xiaoming Tao
备注:

点击查看摘要

Abstract: Although analog semantic communication systems have received considerable attention in the literature, there is less work on digital semantic communication systems. In this paper, we develop a deep learning (DL)-enabled vector quantized (VQ) semantic communication system for image transmission, named VQ-DeepSC. Specifically, we propose a convolutional neural network (CNN)-based transceiver to extract multi-scale semantic features of images and introduce multi-scale semantic embedding spaces to perform semantic feature quantization, rendering the data compatible with digital communication systems. Furthermore, we employ adversarial training to improve the quality of received images by introducing a PatchGAN discriminator. Experimental results demonstrate that the proposed VQ-DeepSC outperforms traditional image transmission methods in terms of SSIM.

摘要:尽管模拟语义通信系统在文献中受到了很大的关注,但在数字语义通信系统上的工作较少。在本文中,我们开发了一个深度学习(DL)启用的矢量量化(VQ)语义通信系统,用于图像传输,名为VQ-Deepsc。具体而言,我们提出了一个基于卷积的神经网络(CNN)的收发器来提取图像的多尺度语义特征,并引入多尺度语义嵌入空间以执行语义特征量化,从而使数据与数字通信系统兼容。此外,我们通过引入Patchgan歧视者来采用对抗训练来提高接收图像的质量。实验结果表明,根据SSIM,所提出的VQ-Deepsc优于传统图像传输方法。

CV-27-标题 Marine Video Kit A New Marine Video Dataset for Content-based Analysis and Retrieval

链接: https://arxiv.org/abs/2209.11518
作者: Quang-Trung Truong, Tuan-Anh Vu, Tan-Sang Ha, Lokoc Jakub, Yue Him Wong Tim, Ajay Joneja, Sai-Kit Yeung
备注: 12 pages of content with 2 pages of reference

点击查看摘要

Abstract: Effective analysis of unusual domain specific video collections represents an important practical problem, where state-of-the-art general purpose models still face limitations. Hence, it is desirable to design benchmark datasets that challenge novel powerful models for specific domains with additional constraints. It is important to remember that domain specific data may be noisier (e.g., endoscopic or underwater videos) and often require more experienced users for effective search. In this paper, we focus on single-shot videos taken from moving cameras in underwater environments which constitute a nontrivial challenge for research purposes. The first shard of a new Marine Video Kit dataset is presented to serve for video retrieval and other computer vision challenges. In addition to basic meta-data statistics, we present several insights and reference graphs based on low-level features as well as semantic annotations of selected keyframes. The analysis contains also experiments showing limitations of respected general purpose models for retrieval.

摘要:对异常域特定视频集的有效分析是一个重要的实际问题,最新的通用模型仍面临局限性。因此,希望设计基准数据集,以挑战具有其他约束的特定领域的新型强大模型。重要的是要记住,特定域的数据可能更嘈杂(例如,内窥镜或水下视频),并且通常需要更多经验丰富的用户才能有效搜索。在本文中,我们专注于从水下环境中移动相机拍摄的单次视频,这构成了研究目的的非平凡挑战。提出了新的海洋视频套件数据集的第一个碎片,用于用于视频检索和其他计算机视觉挑战。除了基本的元数据统计数据外,我们还基于低级特征以及所选密钥帧的语义注释提供了几个见解和参考图。该分析还包含实验,显示了检索受人尊敬的通用模型的局限性。

CV-28-标题 Comparison of synthetic dataset generation methods for medical intervention rooms using medical clothing detection as an example

链接: https://arxiv.org/abs/2209.11493
作者: Patrick Schülein, Hannah Teufel, Ronja Vorpahl, Indira Emter, Yannick Bukschat, Marcus Pfister, Anke Siebert, Nils Rathmann, Steffen Diehl, Marcus Vetter
备注:

点击查看摘要

Abstract: The availability of real data from areas with high privacy requirements, such as the medical intervention space, is low and the acquisition legally complex. Therefore, this work presents a way to create a synthetic dataset for the medical context, using medical clothing as an example. The goal is to close the reality gap between the synthetic and real data. For this purpose, methods of 3D-scanned clothing and designed clothing are compared in a Domain-Randomization and Structured-Domain-Randomization scenario using an Unreal-Engine plugin or Unity. Additionally a Mixed-Reality dataset in front of a greenscreen and a target domain dataset were used. Our experiments show, that Structured-Domain-Randomization of designed clothing together with Mixed-Reality data provide a baseline achieving 72.0% mAP on a test dataset of the clinical target domain. When additionally using 15% of available target domain train data, the gap towards 100% (660 images) target domain train data could be nearly closed 80.05% mAP (81.95% mAP). Finally we show that when additionally using 100% target domain train data the accuracy could be increased to 83.35% mAP.

摘要:从具有高隐私要求的领域(例如医疗干预空间)的实际数据可用性很低,并且收购在法律上很复杂。因此,这项工作提供了一种以医疗服装为例为医疗环境创建合成数据集的方法。目的是缩小合成数据和真实数据之间的现实差距。为此,使用虚幻的引擎插件或Unity比较了3D扫描服装和设计服装的方法。此外,还使用了绿屏和目标域数据集的混合现实数据集。我们的实验表明,设计服装的结构性域随机化以及混合现实数据提供了基线,可在临床目标域的测试数据集上实现72.0%的地图。当使用15%可用的目标域列车数据时,针对100%(660张图像)目标域列车数据的差距几乎可以关闭80.05%的地图(81.95%地图)。最后,我们表明,当使用100%目标域训练数据时,精度可以提高到83.35%的地图。

CV-29-标题 Grouped Adaptive Loss Weighting for Person Search

链接: https://arxiv.org/abs/2209.11492
作者: Yanling Tian, Di Chen, Yunan Liu, Shanshan Zhang, Jian Yang
备注: Accepted by ACM MM

点击查看摘要

Abstract: Person search is an integrated task of multiple sub-tasks such as foreground/background classification, bounding box regression and person re-identification. Therefore, person search is a typical multi-task learning problem, especially when solved in an end-to-end manner. Recently, some works enhance person search features by exploiting various auxiliary information, e.g. person joint keypoints, body part position, attributes, etc., which brings in more tasks and further complexifies a person search model. The inconsistent convergence rate of each task could potentially harm the model optimization. A straightforward solution is to manually assign different weights to different tasks, compensating for the diverse convergence rates. However, given the special case of person search, i.e. with a large number of tasks, it is impractical to weight the tasks manually. To this end, we propose a Grouped Adaptive Loss Weighting (GALW) method which adjusts the weight of each task automatically and dynamically. Specifically, we group tasks according to their convergence rates. Tasks within the same group share the same learnable weight, which is dynamically assigned by considering the loss uncertainty. Experimental results on two typical benchmarks, CUHK-SYSU and PRW, demonstrate the effectiveness of our method.

摘要:人员搜索是多个子任务的集成任务,例如前景/背景分类,边界框回归和人员重新识别。因此,人搜索是一个典型的多任务学习问题,尤其是在以端到端方式解决时。最近,一些作品通过利用各种辅助信息,例如人关节关键点,身体部位位置,属性等,这带来了更多的任务并使人搜索模型更加复杂。每个任务的不一致的趋同率可能会损害模型优化。一个直接的解决方案是手动为不同的任务分配不同的权重,以补偿各种融合率。但是,鉴于人搜索的特殊情况,即有大量任务,手动加权任务是不切实际的。为此,我们提出了一种分组的自适应减肥方法(GALW)方法,该方法会自动和动态地调整每个任务的权重。具体而言,我们根据其收敛率对任务进行分组。同一组中的任务共享相同的可学习权重,这是通过考虑损失不确定性动态分配的。对两个典型基准(Cuhk-Sysu and Prw)的实验结果证明了我们方法的有效性。

CV-30-标题 GIDP Learning a Good Initialization and Inducing Descriptor Post-enhancing for Large-scale Place Recognition

链接: https://arxiv.org/abs/2209.11488
作者: Zhaoxin Fan, Zhenbo Song, Hongyan Liu, Jun He
备注: 7 pages

点击查看摘要

Abstract: Large-scale place recognition is a fundamental but challenging task, which plays an increasingly important role in autonomous driving and robotics. Existing methods have achieved acceptable good performance, however, most of them are concentrating on designing elaborate global descriptor learning network structures. The importance of feature generalization and descriptor post-enhancing has long been neglected. In this work, we propose a novel method named GIDP to learn a Good Initialization and Inducing Descriptor Poseenhancing for Large-scale Place Recognition. In particular, an unsupervised momentum contrast point cloud pretraining module and a reranking-based descriptor post-enhancing module are proposed respectively in GIDP. The former aims at learning a good initialization for the point cloud encoding network before training the place recognition model, while the later aims at post-enhancing the predicted global descriptor through reranking at inference time. Extensive experiments on both indoor and outdoor datasets demonstrate that our method can achieve state-of-the-art performance using simple and general point cloud encoding backbones.

摘要:大规模的地方识别是一项基本但具有挑战性的任务,在自主驾驶和机器人技术中起着越来越重要的作用。现有的方法已经达到了可接受的良好性能,但是,其中大多数都集中精力设计精美的全球描述符学习网络结构。长期以来忽略了特征概括和描述后的特征概括和描述符的重要性。在这项工作中,我们提出了一种名为GIDP的新方法,以学习良好的初始化并引起描述符,以供大规模识别。特别是,在GIDP中分别提出了无监督的动量对比度云预处理模块和基于重新的描述符后增强模块。前者旨在在训练位置识别模型之前对Point Cloud编码网络进行良好的初始化,而后来的目标是通过推理时间重新掌握预测的全局描述符。在室内和室外数据集上进行的广泛实验表明,我们的方法可以使用简单和一般的点云编码主干来实现最先进的性能。

CV-31-标题 Weakly Supervised Two-Stage Training Scheme for Deep Video Fight Detection Model

链接: https://arxiv.org/abs/2209.11477
作者: Zhenting Qi, Ruike Zhu, Zheyu Fu, Wenhao Chai, Volodymyr Kindratenko
备注: Accepted by ICTAI 2022

点击查看摘要

Abstract: Fight detection in videos is an emerging deep learning application with today’s prevalence of surveillance systems and streaming media. Previous work has largely relied on action recognition techniques to tackle this problem. In this paper, we propose a simple but effective method that solves the task from a new perspective: we design the fight detection model as a composition of an action-aware feature extractor and an anomaly score generator. Also, considering that collecting frame-level labels for videos is too laborious, we design a weakly supervised two-stage training scheme, where we utilize multiple-instance-learning loss calculated on video-level labels to train the score generator, and adopt the self-training technique to further improve its performance. Extensive experiments on a publicly available large-scale dataset, UBI-Fights, demonstrate the effectiveness of our method, and the performance on the dataset exceeds several previous state-of-the-art approaches. Furthermore, we collect a new dataset, VFD-2000, that specializes in video fight detection, with a larger scale and more scenarios than existing datasets. The implementation of our method and the proposed dataset will be publicly available at this https URL.

摘要:视频中的战斗检测是当今监视系统和流媒体的流行率的新兴深度学习应用程序。以前的工作主要依靠行动识别技术来解决这个问题。在本文中,我们提出了一种简单但有效的方法,该方法从新的角度解决了任务:我们将战斗检测模型设计为动作感知功能提取器和异常得分生成器的组成。另外,考虑到视频收集帧级标签太费力了,我们设计了一个弱监督的两阶段训练计划,在此我们使用在视频级别标签上计算出的多个实体学习损失来培训得分生成器,并采用自我训练的技术以进一步提高其性能。在公开可用的大规模数据集(UBI-Fights)上进行了广泛的实验,证明了我们方法的有效性,并且数据集的性能超过了几种先前的最先进的方法。此外,我们收集了一个新的数据集VFD-2000,该数据集专门研究视频战斗检测,比现有数据集更大,场景更大。我们的方法的实现和拟议的数据集将在此HTTPS URL上公开可用。

CV-32-标题 Unsupervised Hashing with Semantic Concept Mining

链接: https://arxiv.org/abs/2209.11475
作者: Rong-Cheng Tu, Xian-Ling Mao, Kevin Qinghong Lin, Chengfei Cai, Weize Qin, Hongfa Wang, Wei Wei, Heyan Huang
备注:

点击查看摘要

Abstract: Recently, to improve the unsupervised image retrieval performance, plenty of unsupervised hashing methods have been proposed by designing a semantic similarity matrix, which is based on the similarities between image features extracted by a pre-trained CNN model. However, most of these methods tend to ignore high-level abstract semantic concepts contained in images. Intuitively, concepts play an important role in calculating the similarity among images. In real-world scenarios, each image is associated with some concepts, and the similarity between two images will be larger if they share more identical concepts. Inspired by the above intuition, in this work, we propose a novel Unsupervised Hashing with Semantic Concept Mining, called UHSCM, which leverages a VLP model to construct a high-quality similarity matrix. Specifically, a set of randomly chosen concepts is first collected. Then, by employing a vision-language pretraining (VLP) model with the prompt engineering which has shown strong power in visual representation learning, the set of concepts is denoised according to the training images. Next, the proposed method UHSCM applies the VLP model with prompting again to mine the concept distribution of each image and construct a high-quality semantic similarity matrix based on the mined concept distributions. Finally, with the semantic similarity matrix as guiding information, a novel hashing loss with a modified contrastive loss based regularization item is proposed to optimize the hashing network. Extensive experiments on three benchmark datasets show that the proposed method outperforms the state-of-the-art baselines in the image retrieval task.

摘要:最近,为了改善无监督的图像检索性能,通过设计语义相似性矩阵提出了许多无监督的哈希方法,该方法基于预先训练的CNN模型提取的图像特征之间的相似性。但是,这些方法中的大多数倾向于忽略图像中包含的高级抽象语义概念。直观地,概念在计算图像之间的相似性中起着重要作用。在实际情况下,每个图像都与某些概念相关联,如果两个图像共享更相同的概念,则两个图像之间的相似性将更大。受到上述直觉的启发,在这项工作中,我们提出了一种带有语义概念挖掘的新颖无监督的散列散布,称为UHSCM,该挖掘利用VLP模型来构建高质量的相似性矩阵。具体而言,首先收集一组随机选择的概念。然后,通过使用及时的工程进行视觉预审进(VLP)模型,该模型在视觉表示学习中表现出强大的力量,根据训练图像将一组概念降低。接下来,提出的方法UHSCM应用了VLP模型,并再次提示挖掘每个图像的概念分布,并基于挖掘的概念分布构建高质量的语义相似性矩阵。最后,以语义相似性矩阵作为指导信息,提出了一种新颖的散列损失,并提出了基于对比度损失的正则化项,以优化哈希网络。在三个基准数据集上进行的大量实验表明,所提出的方法在图像检索任务中优于最新基准。

CV-33-标题 TeST Test-time Self-Training under Distribution Shift

链接: https://arxiv.org/abs/2209.11459
作者: Samarth Sinha, Peter Gehler, Francesco Locatello, Bernt Schiele
备注:

点击查看摘要

Abstract: Despite their recent success, deep neural networks continue to perform poorly when they encounter distribution shifts at test time. Many recently proposed approaches try to counter this by aligning the model to the new distribution prior to inference. With no labels available this requires unsupervised objectives to adapt the model on the observed test data. In this paper, we propose Test-Time Self-Training (TeST): a technique that takes as input a model trained on some source data and a novel data distribution at test time, and learns invariant and robust representations using a student-teacher framework. We find that models adapted using TeST significantly improve over baseline test-time adaptation algorithms. TeST achieves competitive performance to modern domain adaptation algorithms, while having access to 5-10x less data at time of adaption. We thoroughly evaluate a variety of baselines on two tasks: object detection and image segmentation and find that models adapted with TeST. We find that TeST sets the new state-of-the art for test-time domain adaptation algorithms.

摘要:尽管他们最近的成功,但深度神经网络在测试时遇到分配变化时仍会表现不佳。最近,许多提出的方法试图通过将模型与推理之前的新分布对齐来解决。由于没有可用的标签,因此需要无监督的目标才能使模型适应观察到的测试数据。在本文中,我们提出了测试时间自我训练(测试):一种技术,该技术在测试时以某些源数据和新的数据分配为输入,并使用学生教师框架来学习不变且强大的表示形式。 。我们发现使用测试适应的模型可以显着改善基线测试时间适应算法。测试可以实现现代领域适应算法的竞争性能,同时自适应时访问5-10倍的数据。我们对两项任务进行了各种基准:对象检测和图像分割,并发现该模型适用于测试。我们发现测试设置了用于测试时间域适应算法的新最新技术。

CV-34-标题 Motion Guided Deep Dynamic 3D Garments

链接: https://arxiv.org/abs/2209.11449
作者: Meng Zhang, Duygu Ceylan, Niloy J. Mitra
备注: 11 pages

点击查看摘要

Abstract: Realistic dynamic garments on animated characters have many AR/VR applications. While authoring such dynamic garment geometry is still a challenging task, data-driven simulation provides an attractive alternative, especially if it can be controlled simply using the motion of the underlying character. In this work, we focus on motion guided dynamic 3D garments, especially for loose garments. In a data-driven setup, we first learn a generative space of plausible garment geometries. Then, we learn a mapping to this space to capture the motion dependent dynamic deformations, conditioned on the previous state of the garment as well as its relative position with respect to the underlying body. Technically, we model garment dynamics, driven using the input character motion, by predicting per-frame local displacements in a canonical state of the garment that is enriched with frame-dependent skinning weights to bring the garment to the global space. We resolve any remaining per-frame collisions by predicting residual local displacements. The resultant garment geometry is used as history to enable iterative rollout prediction. We demonstrate plausible generalization to unseen body shapes and motion inputs, and show improvements over multiple state-of-the-art alternatives.

摘要:动画字符上的现实动态服装具有许多AR/VR应用程序。在创作这种动态服装几何形状仍然是一项具有挑战性的任务时,数据驱动的模拟提供了一个有吸引力的替代方案,尤其是如果可以简单地使用基础字符的运动来控制它。在这项工作中,我们专注于动态3D服装,尤其是对于松散的服装。在数据驱动的设置中,我们首先学习了合理服装几何形状的生成空间。然后,我们学会了对该空间的映射,以捕获运动依赖的动态变形,该变形在服装的先前状态以及相对于基础体的相对位置为条件。从技术上讲,我们通过在服装的规范状态下预测富含框架依赖的皮肤重量的服装状态下的人均局部位移来对服装动力学进行建模,从而将服装带入全球空间。我们通过预测剩余的局部位移来解决所有剩余的人均碰撞。所得的服装几何形状被用作历史记录,以实现迭代推出预测。我们证明了对看不见的身体形状和运动输入的合理概括,并在多个最新的替代方案中显示出改进。

CV-35-标题 Rethinking Performance Gains in Image Dehazing Networks

链接: https://arxiv.org/abs/2209.11448
作者: Yuda Song, Yang Zhou, Hui Qian, Xin Du
备注:

点击查看摘要

Abstract: Image dehazing is an active topic in low-level vision, and many image dehazing networks have been proposed with the rapid development of deep learning. Although these networks’ pipelines work fine, the key mechanism to improving image dehazing performance remains unclear. For this reason, we do not target to propose a dehazing network with fancy modules; rather, we make minimal modifications to popular U-Net to obtain a compact dehazing network. Specifically, we swap out the convolutional blocks in U-Net for residual blocks with the gating mechanism, fuse the feature maps of main paths and skip connections using the selective kernel, and call the resulting U-Net variant gUNet. As a result, with a significantly reduced overhead, gUNet is superior to state-of-the-art methods on multiple image dehazing datasets. Finally, we verify these key designs to the performance gain of image dehazing networks through extensive ablation studies.

摘要:Dimage Dehazing是低级视觉中的一个活跃主题,并且随着深度学习的快速发展,已经提出了许多图像去除网络。尽管这些网络的管道效果很好,但改善图像飞行性能的关键机制尚不清楚。因此,我们不针对带有精美模块的飞行网络。相反,我们对流行的U-NET进行了最小的修改,以获得紧凑的飞行网络。具体而言,我们将U-NET中的卷积块与门控机构,使用选择性内核进行融合,并跳过连接,并调用所得的U-NET变体Gunet。结果,由于开销大大减少,Gunet优于多个图像脱掩的数据集上的最新方法。最后,我们通过广泛的消融研究来验证这些关键设计为图像去除网络的性能增益。

CV-36-标题 Understanding Open-Set Recognition by Jacobian Norm of Representation

链接: https://arxiv.org/abs/2209.11436
作者: Jaewoo Park, Hojin Park, Eunju Jeong, Andrew Beng Jin Teoh
备注:

点击查看摘要

Abstract: In contrast to conventional closed-set recognition, open-set recognition (OSR) assumes the presence of an unknown class, which is not seen to a model during training. One predominant approach in OSR is metric learning, where a model is trained to separate the inter-class representations of known class data. Numerous works in OSR reported that, even though the models are trained only with the known class data, the models become aware of the unknown, and learn to separate the unknown class representations from the known class representations. This paper analyzes this emergent phenomenon by observing the Jacobian norm of representation. We theoretically show that minimizing the intra-class distances within the known set reduces the Jacobian norm of known class representations while maximizing the inter-class distances within the known set increases the Jacobian norm of the unknown class. The closed-set metric learning thus separates the unknown from the known by forcing their Jacobian norm values to differ. We empirically validate our theoretical framework with ample pieces of evidence using standard OSR datasets. Moreover, under our theoretical framework, we explain how the standard deep learning techniques can be helpful for OSR and use the framework as a guiding principle to develop an effective OSR model.

摘要:与常规的闭合识别相反,开放式识别(OSR)假设存在未知类别,在训练过程中未被视为模型。 OSR中的一种主要方法是度量学习,其中对模型进行了训练以分离已知类别数据的类间表示。 OSR中的许多作品报告说,即使模型仅通过已知类别的数据进行培训,模型也会意识到未知数,并学会将未知类表征与已知类别表示分开。本文通过观察雅各布的代表规范来分析这种新兴现象。从理论上讲,我们表明已知集中的阶层内距离最小化会减少已知类表征的雅各布式规范,同时最大化已知集合中的阶层间距离会增加未知类别的雅各布式规范。因此,封闭式度量学习通过迫使其雅各布规范值有所不同,从而将未知的未知数与已知分开。我们通过使用标准OSR数据集的大量证据来验证我们的理论框架。此外,在我们的理论框架下,我们解释了标准的深度学习技术如何有助于OSR并将框架作为指导原则来开发有效的OSR模型。

CV-37-标题 Towards Frame Rate Agnostic Multi-Object Tracking

链接: https://arxiv.org/abs/2209.11404
作者: Weitao Feng, Lei Bai, Yongqiang Yao, Fengwei Yu, Wanli Ouyang
备注: 21 pages; Author version

点击查看摘要

Abstract: Multi-Object Tracking (MOT) is one of the most fundamental computer vision tasks which contributes to a variety of video analysis applications. Despite the recent promising progress, current MOT research is still limited to a fixed sampling frame rate of the input stream. In fact, we empirically find that the accuracy of all recent state-of-the-art trackers drops dramatically when the input frame rate changes. For a more intelligent tracking solution, we shift the attention of our research work to the problem of Frame Rate Agnostic MOT (FraMOT). In this paper, we propose a Frame Rate Agnostic MOT framework with Periodic training Scheme (FAPS) to tackle the FraMOT problem for the first time. Specifically, we propose a Frame Rate Agnostic Association Module (FAAM) that infers and encodes the frame rate information to aid identity matching across multi-frame-rate inputs, improving the capability of the learned model in handling complex motion-appearance relations in FraMOT. Besides, the association gap between training and inference is enlarged in FraMOT because those post-processing steps not included in training make a larger difference in lower frame rate scenarios. To address it, we propose Periodic Training Scheme (PTS) to reflect all post-processing steps in training via tracking pattern matching and fusion. Along with the proposed approaches, we make the first attempt to establish an evaluation method for this new task of FraMOT in two different modes, i.e., known frame rate and unknown frame rate, aiming to handle a more complex situation. The quantitative experiments on the challenging MOT datasets (FraMOT version) have clearly demonstrated that the proposed approaches can handle different frame rates better and thus improve the robustness against complicated scenarios.

摘要:多对象跟踪(MOT)是最基本的计算机视觉任务之一,它有助于各种视频分析应用程序。尽管最近取得了有希望的进展,但当前的MOT研究仍仅限于输入流的固定采样帧速率。实际上,我们从经验上发现,当输入帧速率变化时,所有最新最新跟踪器的准确性都会急剧下降。对于更智能的跟踪解决方案,我们将研究工作的注意力转移到了帧速率不可知MOT(FRAMOT)的问题上。在本文中,我们建议使用定期培训计划(FAPS)的帧速率不可知的MOT框架,以首次解决FRAMOT问题。具体而言,我们提出了一个帧速率不可知协会模块(FAAM),该模块(FAAM)渗透并编码帧速率信息,以帮助跨多帧速率输入的身份匹配,从而提高了学习模型在处理FRAMOT中复杂的运动体验关系方面的能力。此外,FRAMOT中训练和推理之间的关联差距扩大,因为训练中未包含的那些后处理步骤在较低的帧速率方案中产生了更大的影响。为了解决这个问题,我们建议定期培训计划(PTS),以通过跟踪模式匹配和融合来反映培训中的所有后处理步骤。除了提出的方法外,我们首次尝试以两种不同的模式(即已知的帧速率和未知帧速率)建立这项新任务的评估方法,旨在处理更复杂的情况。在具有挑战性的MOT数据集(FRAMOT版本)上进行的定量实验清楚地表明,所提出的方法可以更好地处理不同的帧速率,从而改善对复杂情况的鲁棒性。

CV-38-标题 LGDN Language-Guided Denoising Network for Video-Language Modeling

链接: https://arxiv.org/abs/2209.11388
作者: Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, Zhiwu Lu
备注: Accepted by NeurIPS2022

点击查看摘要

Abstract: Video-language modeling has attracted much attention with the rapid growth of web videos. Most existing methods assume that the video frames and text description are semantically correlated, and focus on video-language modeling at video level. However, this hypothesis often fails for two reasons: (1) With the rich semantics of video contents, it is difficult to cover all frames with a single video-level description; (2) A raw video typically has noisy/meaningless information (e.g., scenery shot, transition or teaser). Although a number of recent works deploy attention mechanism to alleviate this problem, the irrelevant/noisy information still makes it very difficult to address. To overcome such challenge, we thus propose an efficient and effective model, termed Language-Guided Denoising Network (LGDN), for video-language modeling. Different from most existing methods that utilize all extracted video frames, LGDN dynamically filters out the misaligned or redundant frames under the language supervision and obtains only 2–4 salient frames per video for cross-modal token-level alignment. Extensive experiments on five public datasets show that our LGDN outperforms the state-of-the-arts by large margins. We also provide detailed ablation study to reveal the critical importance of solving the noise issue, in hope of inspiring future video-language work.

摘要:通过网络视频的快速增长,视频语言建模吸引了很多关注。大多数现有方法都假定视频帧和文本描述是语义上关联的,并专注于视频级别的视频模型。但是,该假设通常是有两个原因的:(1)凭借视频内容丰富的语义,很难用单个视频级别的描述覆盖所有帧; (2)原始视频通常具有嘈杂/毫无意义的信息(例如,镜头,过渡或预告片)。尽管最近的许多作品部署了注意力来减轻此问题,但无关/嘈杂的信息仍然使得很难解决。为了克服此类挑战,我们提出了一个高效有效的模型,称为语言引导网络(LGDN),用于视频语言建模。与使用所有提取的视频帧的大多数现有方法不同,LGDN在语言监督下动态过滤了未对准或冗余的帧,并且每个视频仅获得2—4个显着帧,以进行交叉模式令牌级别的对准。在五个公共数据集上进行的广泛实验表明,我们的LGDN优于最先进的利润率。我们还提供了详细的消融研究,以揭示解决噪声问题的关键重要性,以启发未来的视频语言工作。

CV-39-标题 Tensor-Based Multi-Modality Feature Selection and Regression for Alzheimers Disease Diagnosis

链接: https://arxiv.org/abs/2209.11372
作者: Jun Yu, Zhaoming Kong, Liang Zhan, Li Shen, Lifang He
备注:

点击查看摘要

Abstract: The assessment of Alzheimer’s Disease (AD) and Mild Cognitive Impairment (MCI) associated with brain changes remains a challenging task. Recent studies have demonstrated that combination of multi-modality imaging techniques can better reflect pathological characteristics and contribute to more accurate diagnosis of AD and MCI. In this paper, we propose a novel tensor-based multi-modality feature selection and regression method for diagnosis and biomarker identification of AD and MCI from normal controls. Specifically, we leverage the tensor structure to exploit high-level correlation information inherent in the multi-modality data, and investigate tensor-level sparsity in the multilinear regression model. We present the practical advantages of our method for the analysis of ADNI data using three imaging modalities (VBM- MRI, FDG-PET and AV45-PET) with clinical parameters of disease severity and cognitive scores. The experimental results demonstrate the superior performance of our proposed method against the state-of-the-art for the disease diagnosis and the identification of disease-specific regions and modality-related differences. The code for this work is publicly available at this https URL.

摘要:与大脑变化相关的阿尔茨海默氏病(AD)和轻度认知障碍(MCI)的评估仍然是一项艰巨的任务。最近的研究表明,多模式成像技术的组合可以更好地反映病理特征,并有助于更准确地诊断AD和MCI。在本文中,我们提出了一种新型的基于张量的多模式特征选择和回归方法,用于诊断和生物标志物对正常对照组的AD和MCI鉴定。具体而言,我们利用张量结构来利用多模式数据中固有的高级相关信息,并研究多线性回归模型中的张量级稀疏性。我们使用三种成像方式(VBM- MRI,FDG-PET和AV45-PET)具有疾病严重程度和认知评分的临床参数来分析ADNI数据的方法的实际优势。实验结果表明,我们提出的方法与疾病诊断的最新方法的优越性能以及疾病特异性区域和与模态相关的差异的鉴定。此工作的代码可在此HTTPS URL上公开获得。

CV-40-标题 CUTS A Fully Unsupervised Framework for Medical Image Segmentation

链接: https://arxiv.org/abs/2209.11359
作者: Matthew Amodio, Feng Gao, Arman Avesta, Sanjay Aneja, Lucian V. Del Priore, Jay Wang, Smita Krishnaswamy
备注:

点击查看摘要

Abstract: In this work we introduce CUTS (Contrastive and Unsupervised Training for Segmentation) the first fully unsupervised deep learning framework for medical image segmentation, facilitating the use of the vast majority of imaging data that is not labeled or annotated. Segmenting medical images into regions of interest is a critical task for facilitating both patient diagnoses and quantitative research. A major limiting factor in this segmentation is the lack of labeled data, as getting expert annotations for each new set of imaging data or task can be expensive, labor intensive, and inconsistent across annotators: thus, we utilize self-supervision based on pixel-centered patches from the images themselves. Our unsupervised approach is based on a training objective with both contrastive learning and autoencoding aspects. Previous contrastive learning approaches for medical image segmentation have focused on image-level contrastive training, rather than our intra-image patch-level approach or have used this as a pre-training task where the network needed further supervised training afterwards. By contrast, we build the first entirely unsupervised framework that operates at the pixel-centered-patch level. Specifically, we add novel augmentations, a patch reconstruction loss, and introduce a new pixel clustering and identification framework. Our model achieves improved results on several key medical imaging tasks, as verified by held-out expert annotations on the task of segmenting geographic atrophy (GA) regions of images of the retina.

摘要:在这项工作中,我们引入了第一个完全无监督的深度学习框架,以进行医学图像细分,从而促进了未经标记或注释的绝大多数成像数据的使用。将医学图像分割成感兴趣的区域是促进患者诊断和定量研究的关键任务。该细分的一个主要限制因素是缺乏标记的数据,因为在注释者之间获得每组新的成像数据或任务的专家注释可能是昂贵,劳动力且不一致的:因此,我们利用基于Pixel-的自学意义图像本身的居中补丁。我们无监督的方法是基于对比度学习和自动编码方面的培训目标。以前的医学图像细分学习方法集中在图像级对比度训练上,而不是我们的图像内贴片级别的方法,或者将其用作一项预训练的任务,此后网络之后需要进一步监督培训。相比之下,我们构建了第一个完全无监督的框架,该框架在以像素为中心的斑点级别上运行。具体来说,我们添加了新颖的增强,补丁重建损失,并引入了一个新的像素聚类和识别框架。我们的模型在几个关键的医学成像任务上取得了改进的结果,这是通过对视网膜图像的地理萎缩(GA)区域进行分割的任务进行了固定的专家注释的验证。

CV-41-标题 NasHD Efficient ViT Architecture Performance Ranking using Hyperdimensional Computing

链接: https://arxiv.org/abs/2209.11356
作者: Dongning Ma, Pengfei Zhao, Xun Jiao
备注:

点击查看摘要

Abstract: Neural Architecture Search (NAS) is an automated architecture engineering method for deep learning design automation, which serves as an alternative to the manual and error-prone process of model development, selection, evaluation and performance estimation. However, one major obstacle of NAS is the extremely demanding computation resource requirements and time-consuming iterations particularly when the dataset scales. In this paper, targeting at the emerging vision transformer (ViT), we present NasHD, a hyperdimensional computing based supervised learning model to rank the performance given the architectures and configurations. Different from other learning based methods, NasHD is faster thanks to the high parallel processing of HDC architecture. We also evaluated two HDC encoding schemes: Gram-based and Record-based of NasHD on their performance and efficiency. On the VIMER-UFO benchmark dataset of 8 applications from a diverse range of domains, NasHD Record can rank the performance of nearly 100K vision transformer models with about 1 minute while still achieving comparable results with sophisticated models.

摘要:神经体系结构搜索(NAS)是一种用于深度学习设计自动化的自动化体系结构工程方法,可作为模型开发,选择,评估和性能估算的手动和错误过程的替代方法。但是,NAS的一个主要障碍是非常苛刻的计算资源需求和耗时的迭代,尤其是在数据集尺度时。在本文中,针对新兴视觉变压器(VIT),我们提出了NASHD,这是一种基于高度计算的监督学习模型,以对给定架构和配置的性能进行排名。与其他基于学习的方法不同,由于HDC体系结构的高平行处理,NASHD的速度更快。我们还评估了两个HDC编码方案:基于革兰氏阴性的NASHD的性能和效率。在来自不同范围的8个应用程序的Vimer-Ufo基准数据集上,NASHD记录可以对近100K视觉变压器模型的性能进行排名,而该模型的性能约为1分钟,同时仍可以通过复杂的模型来取得可比的结果。

CV-42-标题 Learning Interpretable Dynamics from Images of a Freely Rotating 3D Rigid Body

链接: https://arxiv.org/abs/2209.11355
作者: Justice Mason, Christine Allen-Blanchette, Nicholas Zolman, Elizabeth Davison, Naomi Leonard
备注: 8 pages, 7 figures

点击查看摘要

Abstract: In many real-world settings, image observations of freely rotating 3D rigid bodies, such as satellites, may be available when low-dimensional measurements are not. However, the high-dimensionality of image data precludes the use of classical estimation techniques to learn the dynamics and a lack of interpretability reduces the usefulness of standard deep learning methods. In this work, we present a physics-informed neural network model to estimate and predict 3D rotational dynamics from image sequences. We achieve this using a multi-stage prediction pipeline that maps individual images to a latent representation homeomorphic to \mathbfSO(3) , computes angular velocities from latent pairs, and predicts future latent states using the Hamiltonian equations of motion with a learned representation of the Hamiltonian. We demonstrate the efficacy of our approach on a new rotating rigid-body dataset with sequences of rotating cubes and rectangular prisms with uniform and non-uniform density.

摘要:在许多现实世界中,当没有低维度测量值时,可以使用自由旋转的3D刚体(例如卫星)的图像观测值。但是,图像数据的高维度排除了学习动力学和缺乏解释性的使用,从而降低了标准深度学习方法的有用性。在这项工作中,我们提出了一个物理知识的神经网络模型,以估计和预测图像序列中的3D旋转动力学。我们使用多阶段预测管道实现了这一目标,该管道将单个图像映射到潜在表示同构为\ mathbfso(3),从潜在对计算角速度,并使用汉密尔顿的运动方程来预测未来的潜在状态哈密​​顿人。我们证明了方法对新的旋转刚体数据集的功效,该数据集具有旋转立方体和矩形棱镜序列,并具有均匀且不均匀的密度。

CV-43-标题 Oracle Analysis of Representations for Deep Open Set Detection

链接: https://arxiv.org/abs/2209.11350
作者: Risheek Garrepalli, Alan Fern, Thomas G. Dietterich
备注:

点击查看摘要

Abstract: The problem of detecting a novel class at run time is known as Open Set Detection & is important for various real-world applications like medical application, autonomous driving, etc. Open Set Detection within context of deep learning involves solving two problems: (i) Must map the input images into a latent representation that contains enough information to detect the outliers, and (ii) Must learn an anomaly scoring function that can extract this information from the latent representation to identify the anomalies. Research in deep anomaly detection methods has progressed slowly. One reason may be that most papers simultaneously introduce new representation learning techniques and new anomaly scoring approaches. The goal of this work is to improve this methodology by providing ways of separately measuring the effectiveness of the representation learning and anomaly scoring. This work makes two methodological contributions. The first is to introduce the notion of Oracle anomaly detection for quantifying the information available in a learned latent representation. The second is to introduce Oracle representation learning, which produces a representation that is guaranteed to be sufficient for accurate anomaly detection. These two techniques help researchers to separate the quality of the learned representation from the performance of the anomaly scoring mechanism so that they can debug and improve their systems. The methods also provide an upper limit on how much open category detection can be improved through better anomaly scoring mechanisms. The combination of the two oracles gives an upper limit on the performance that any open category detection method could achieve. This work introduces these two oracle techniques and demonstrates their utility by applying them to several leading open category detection methods.

摘要:在运行时检测新颖类的问题称为开放式检测,对于各种现实世界应用,例如医疗应用,自动驾驶等。 i)必须将输入图像映射到潜在表示中,该图像包含足够的信息来检测异常值,并且(ii)必须学习一个可以从潜在表示中提取此信息以识别异常的异常评分函数。深度异常检测方法的研究缓慢进展。原因之一可能是大多数论文同时引入了新的表示学习技术和新的异常评分方法。这项工作的目的是通过提供分别衡量表示学习和异常评分的有效性的方法来改善这种方法。这项工作做出了两项方法论贡献。首先是引入甲骨文异常检测的概念,以量化学习潜在表示中可用的信息。第二个是引入Oracle表示学习,该学习产生的表示形式可以保证足以准确的异常检测。这两种技术可帮助研究人员将学习表示的质量与异常评分机制的性能分开,以便他们可以调试和改善系统。这些方法还为通过更好的异常评分机制改善了多少开放类别检测提供了上限。两个牙齿的组合给出了任何开放类别检测方法可以实现的性能的上限。这项工作介绍了这两种Oracle技术,并通过将它们应用于几种领先的开放类别检测方法来演示其实用性。

CV-44-标题 Swin2SR SwinV2 Transformer for Compressed Image Super-Resolution and Restoration

链接: https://arxiv.org/abs/2209.11345
作者: Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte
备注: European Conference on Computer Vision (ECCV 2022) Workshops

点击查看摘要

Abstract: Compression plays an important role on the efficient transmission and storage of images and videos through band-limited systems such as streaming services, virtual reality or videogames. However, compression unavoidably leads to artifacts and the loss of the original information, which may severely degrade the visual quality. For these reasons, quality enhancement of compressed images has become a popular research topic. While most state-of-the-art image restoration methods are based on convolutional neural networks, other transformers-based methods such as SwinIR, show impressive performance on these tasks. In this paper, we explore the novel Swin Transformer V2, to improve SwinIR for image super-resolution, and in particular, the compressed input scenario. Using this method we can tackle the major issues in training transformer vision models, such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data. We conduct experiments on three representative tasks: JPEG compression artifacts removal, image super-resolution (classical and lightweight), and compressed image super-resolution. Experimental results demonstrate that our method, Swin2SR, can improve the training convergence and performance of SwinIR, and is a top-5 solution at the “AIM 2022 Challenge on Super-Resolution of Compressed Image and Video”.

摘要:压缩在通过限制系统(例如流媒体服务,虚拟现实或视频游戏)等系统的有效传输和存储图像和视频中起着重要作用。但是,不可避免地会导致伪影和原始信息的丢失,这可能会严重降低视觉质量。由于这些原因,压缩图像的质量增强已成为流行的研究主题。尽管大多数最先进的图像恢复方法基于卷积神经网络,但基于Swinir等其他基于变压器的方法在这些任务上表现出令人印象深刻的性能。在本文中,我们探索了新型的Swin Transformer V2,以改善图像超分辨率的Swinir,尤其是压缩输入方案。使用这种方法,我们可以解决训练变压器视觉模型中的主要问题,例如训练不稳定性,预训练和微调之间的分辨率差距以及数据饥饿。我们对三个代表性任务进行实验:JPEG压缩伪像去除,图像超分辨率(经典和轻巧)以及压缩的图像超分辨率。实验结果表明,我们的方法SWIN2SR可以改善SWINIR的训练收敛性和性能,并且是“ AIM 2022挑战压缩图像和视频的超分辨率”的前5个解决方案。

CV-45-标题 Fast Disparity Estimation from a Single Compressed Light Field Measurement

链接: https://arxiv.org/abs/2209.11342
作者: Emmanuel Martinez, Edwin Vargas, Henry Arguello
备注:

点击查看摘要

Abstract: The abundant spatial and angular information from light fields has allowed the development of multiple disparity estimation approaches. However, the acquisition of light fields requires high storage and processing cost, limiting the use of this technology in practical applications. To overcome these drawbacks, the compressive sensing (CS) theory has allowed the development of optical architectures to acquire a single coded light field measurement. This measurement is decoded using an optimization algorithm or deep neural network that requires high computational costs. The traditional approach for disparity estimation from compressed light fields requires first recovering the entire light field and then a post-processing step, thus requiring long times. In contrast, this work proposes a fast disparity estimation from a single compressed measurement by omitting the recovery step required in traditional approaches. Specifically, we propose to jointly optimize an optical architecture for acquiring a single coded light field snapshot and a convolutional neural network (CNN) for estimating the disparity maps. Experimentally, the proposed method estimates disparity maps comparable with those obtained from light fields reconstructed using deep learning approaches. Furthermore, the proposed method is 20 times faster in training and inference than the best method that estimates the disparity from reconstructed light fields.

摘要:来自光场的丰富空间和角度信息允许开发多种差异估计方法。但是,对光场的获取需要高存储和处理成本,从而限制了该技术在实际应用中的使用。为了克服这些缺点,压缩感应(CS)理论使光学体系结构的开发能够获得单个编码的光场测量。该测量是使用需要高计算成本的优化算法或深神经网络来解码的。从压缩光场进行的传统差异估计方法需要首先恢复整个光场,然后再恢复后处理步骤,从而需要长时间。相比之下,这项工作提出了通过省略传统方法所需的恢复步骤来从单个压缩测量中进行快速差异估计。具体而言,我们建议共同优化用于获取单个编码光场快照和卷积神经网络(CNN)的光学体系结构,以估计差异图。在实验上,提出的方法估计了与使用深度学习方法重建的光场相当的差异图。此外,所提出的方法在训练和推理方面的速度比估计重建光场差异的最佳方法要快20倍。

CV-46-标题 A domain adaptive deep learning solution for scanpath prediction of paintings

链接: https://arxiv.org/abs/2209.11338
作者: Mohamed Amine Kerkouri, Marouane Tliba, Aladine Chetouani, Alessandro Bruno
备注: Accepted at CBMI2022 graz, austria

点击查看摘要

Abstract: Cultural heritage understanding and preservation is an important issue for society as it represents a fundamental aspect of its identity. Paintings represent a significant part of cultural heritage, and are the subject of study continuously. However, the way viewers perceive paintings is strictly related to the so-called HVS (Human Vision System) behaviour. This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings. In further details, we introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans, including the fundamental understanding of a scene, and then extend it to painting images. The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers’ attention. We use an FCNN (Fully Convolutional Neural Network), in which we exploit a differentiable channel-wise selection and Soft-Argmax modules. We also incorporate learnable Gaussian distributions onto the network bottleneck to simulate visual attention process bias in natural scene images. Furthermore, to reduce the effect of shifts between different domains (i.e. natural images, painting), we urge the model to learn unsupervised general features from other domains using a gradient reversal classifier. The results obtained by our model outperform existing state-of-the-art ones in terms of accuracy and efficiency.

摘要:文化遗产的理解和保存对于社会来说是一个重要的问题,因为它代表了其身份的基本方面。绘画代表了文化遗产的重要组成部分,并且是不断研究的主题。但是,观众认为绘画与所谓的HVS(人类视觉系统)行为严格相关。本文重点介绍了一定数量绘画的视觉体验期间观众的眼动分析。在进一步的详细信息中,我们引入了一种新的方法来预测人类的视觉关注,这影响了人类的几种认知功能,包括对场景的基本理解,然后将其扩展到绘画图像。拟议的新建筑摄入图像并返回扫描路径,这是一系列积分,具有引起观众注意力的很有可能性。我们使用FCNN(完全卷积的神经网络),其中利用了可区分的渠道选择和软弧度模块。我们还将可学习的高斯分布纳入网络瓶颈上,以模拟自然场景图像中的视觉注意力过程偏见。此外,为了减少不同域之间的变化影响(即自然图像,绘画),我们敦促模型使用梯度反转分类器从其他域中学习无监督的一般特征。在准确性和效率方面,我们的模型获得的结果优于现有的最先进的结果。

CV-47-标题 UNav An Infrastructure-Independent Vision-Based Navigation System for People with Blindness and Low vision

链接: https://arxiv.org/abs/2209.11336
作者: Anbang Yang, Mahya Beheshti, Todd E Hudson, Rajesh Vedanthan, Wachara Riewpaiboon, Pattanasak Mongkolwat, Chen Feng, John-Ross Rizzo
备注:

点击查看摘要

Abstract: Vision-based localization approaches now underpin newly emerging navigation pipelines for myriad use cases from robotics to assistive technologies. Compared to sensor-based solutions, vision-based localization does not require pre-installed sensor infrastructure, which is costly, time-consuming, and/or often infeasible at scale. Herein, we propose a novel vision-based localization pipeline for a specific use case: navigation support for end-users with blindness and low vision. Given a query image taken by an end-user on a mobile application, the pipeline leverages a visual place recognition (VPR) algorithm to find similar images in a reference image database of the target space. The geolocations of these similar images are utilized in downstream tasks that employ a weighted-average method to estimate the end-user’s location and a perspective-n-point (PnP) algorithm to estimate the end-user’s direction. Additionally, this system implements Dijkstra’s algorithm to calculate a shortest path based on a navigable map that includes trip origin and destination. The topometric map used for localization and navigation is built using a customized graphical user interface that projects a 3D reconstructed sparse map, built from a sequence of images, to the corresponding a priori 2D floor plan. Sequential images used for map construction can be collected in a pre-mapping step or scavenged through public databases/citizen science. The end-to-end system can be installed on any internet-accessible device with a camera that hosts a custom mobile application. For evaluation purposes, mapping and localization were tested in a complex hospital environment. The evaluation results demonstrate that our system can achieve localization with an average error of less than 1 meter without knowledge of the camera’s intrinsic parameters, such as focal length.

摘要:基于视觉的本地化方法现在是针对机器人技术到辅助技术的无数用例的新出现的导航管道的基础。与基于传感器的解决方案相比,基于视觉的定位不需要预安装的传感器基础架构,这是昂贵,耗时和/或通常不可行的。本文中,我们为特定用例提出了一个基于视觉的本地化管道:针对失明和低视力的最终用户的导航支持。给定最终用户在移动应用程序上拍摄的查询图像,该管道利用视觉位置识别(VPR)算法在目标空间的参考图像数据库中找到相似的图像。这些相似图像的地理位置用于采用加权平均方法来估计最终用户的位置和透视N点(PNP)算法的下游任务中,以估计最终用户的方向。此外,该系统实现了Dijkstra的算法,以根据包括Trip Origin和目的地的可通航地图计算最短路径。用于本地化和导航的层压映射是使用定制的图形用户界面构建的,该图形用户界面投影了3D重建的稀疏映射,从一系列图像构建到相应的先验2D楼平面图。用于地图构造的顺序图像可以在预映射步骤中收集,也可以通过公共数据库/公民科学清除。端到端系统可以使用托管自定义移动应用程序的相机安装在任何可互联网的设备上。出于评估目的,在复杂的医院环境中测试了映射和定位。评估结果表明,我们的系统可以以少于1米的平均误差来实现本地化,而无需了解摄像机的固有参数,例如焦距。

CV-48-标题 Privacy-Preserving Person Detection Using Low-Resolution Infrared Cameras

链接: https://arxiv.org/abs/2209.11335
作者: Thomas Dubail, Fidel Alejandro Guerrero Peña, Heitor Rapela Medeiros, Masih Aminbeidokhti, Eric Granger, Marco Pedersoli
备注:

点击查看摘要

Abstract: In intelligent building management, knowing the number of people and their location in a room are important for better control of its illumination, ventilation, and heating with reduced costs and improved comfort. This is typically achieved by detecting people using compact embedded devices that are installed on the room’s ceiling, and that integrate low-resolution infrared camera, which conceals each person’s identity. However, for accurate detection, state-of-the-art deep learning models still require supervised training using a large annotated dataset of images. In this paper, we investigate cost-effective methods that are suitable for person detection based on low-resolution infrared images. Results indicate that for such images, we can reduce the amount of supervision and computation, while still achieving a high level of detection accuracy. Going from single-shot detectors that require bounding box annotations of each person in an image, to auto-encoders that only rely on unlabelled images that do not contain people, allows for considerable savings in terms of annotation costs, and for models with lower computational costs. We validate these experimental findings on two challenging top-view datasets with low-resolution infrared images.

摘要:在智能建筑管理中,了解房间中的人数及其位置对于更好地控制其照明,通风和供暖,并以降低的成本和改善的舒适度很重要。这通常是通过使用安装在房间天花板上的紧凑型嵌入式设备并集成低分辨率红外摄像机的人员来实现的,从而掩盖了每个人的身份。但是,为了准确检测,最新的深度学习模型仍然需要使用大量注释的图像数据集进行监督培训。在本文中,我们研究了适用于基于低分辨率红外图像的人检测的具有成本效益的方法。结果表明,对于此类图像,我们可以减少监督和计算的量,同时仍然达到高水平的检测准确性。从需要图像中每个人的边界框注释的单杆探测器,到仅依靠不包含人的未标记图像的自动编码器,可以在注释成本方面节省大量,以及计算较低的模型费用。我们在具有低分辨率红外图像的两个具有挑战性的顶级数据集上验证了这些实验发现。

CV-49-标题 FuTH-Net Fusing Temporal Relations and Holistic Features for Aerial Video Classification

链接: https://arxiv.org/abs/2209.11316
作者: Pu Jin, Lichao Mou, Yuansheng Hua, Gui-Song Xia, Xiao Xiang Zhu
备注:

点击查看摘要

Abstract: Unmanned aerial vehicles (UAVs) are now widely applied to data acquisition due to its low cost and fast mobility. With the increasing volume of aerial videos, the demand for automatically parsing these videos is surging. To achieve this, current researches mainly focus on extracting a holistic feature with convolutions along both spatial and temporal dimensions. However, these methods are limited by small temporal receptive fields and cannot adequately capture long-term temporal dependencies which are important for describing complicated dynamics. In this paper, we propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification. Furthermore, the holistic features are refined by the multi-scale temporal relations in a novel fusion module for yielding more discriminative video representations. More specially, FuTH-Net employs a two-pathway architecture: (1) a holistic representation pathway to learn a general feature of both frame appearances and shortterm temporal variations and (2) a temporal relation pathway to capture multi-scale temporal relations across arbitrary frames, providing long-term temporal dependencies. Afterwards, a novel fusion module is proposed to spatiotemporal integrate the two features learned from the two pathways. Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results. This demonstrates its effectiveness and good generalization capacity across different recognition tasks (event classification and human action recognition). To facilitate further research, we release the code at this https URL.

摘要:由于其低成本和快速移动性,无人驾驶汽车(UAV)现在已广泛应用于数据采集。随着航空视频量的增加,对这些视频自动解析的需求正在激增。为了实现这一目标,当前的研究主要集中于在空间和时间维度沿着卷积的整体特征提取整体特征。但是,这些方法受到小时接收场的限制,无法充分捕获长期的时间依赖性,这对于描述复杂动力学很重要。在本文中,我们提出了一个新颖的深神经网络,称为futh-net,不仅为整体特征建模,而且还模拟了空中视频分类的时间关系。此外,在新型融合模块中,多尺度的时间关系可以完善整体特征,以产生更具歧视性的视频表示。更特别地,FUTH-NET采用了两条道路架构:(1)学习框架外观和短期时间变化的一般特征的整体代表途径,以及(2)捕获跨任意跨越任意时间关系的时间关系途径框架,提供长期的时间依赖性。之后,提出了一个新型的融合模块,以时空整合从这两种途径中学到的两个特征。我们的模型对两个航空视频分类数据集进行了评估,即ERA和无人机操作,并实现了最新结果。这表明了其在不同识别任务(事件分类和人类行动识别)之间的有效性和良好的概括能力。为了促进进一步的研究,我们在此HTTPS URL上发布了代码。

CV-50-标题 Colonoscopy Landmark Detection using Vision Transformers

链接: https://arxiv.org/abs/2209.11304
作者: Aniruddha Tamhane, Tse’ela Mida, Erez Posner, Moshe Bouhnik
备注:

点击查看摘要

Abstract: Colonoscopy is a routine outpatient procedure used to examine the colon and rectum for any abnormalities including polyps, diverticula and narrowing of colon structures. A significant amount of the clinician’s time is spent in post-processing snapshots taken during the colonoscopy procedure, for maintaining medical records or further investigation. Automating this step can save time and improve the efficiency of the process. In our work, we have collected a dataset of 120 colonoscopy videos and 2416 snapshots taken during the procedure, that have been annotated by experts. Further, we have developed a novel, vision-transformer based landmark detection algorithm that identifies key anatomical landmarks (the appendiceal orifice, ileocecal valve/cecum landmark and rectum retroflexion) from snapshots taken during colonoscopy. Our algorithm uses an adaptive gamma correction during preprocessing to maintain a consistent brightness for all images. We then use a vision transformer as the feature extraction backbone and a fully connected network based classifier head to categorize a given frame into four classes: the three landmarks or a non-landmark frame. We compare the vision transformer (ViT-B/16) backbone with ResNet-101 and ConvNext-B backbones that have been trained similarly. We report an accuracy of 82% with the vision transformer backbone on a test dataset of snapshots.

摘要:结肠镜检查是一种常规的门诊手术,用于检查结肠和直肠的任何异常,包括息肉,憩室和结肠结构的狭窄。临床医生的大量时间用于在结肠镜检查过程中拍摄的快照,以维持医疗记录或进一步研究。自动化此步骤可以节省时间并提高流程的效率。在我们的工作中,我们收集了一个由专家注释的过程中的120个结肠镜检查视频和2416张快照的数据集。此外,我们开发了一种基于新颖的,视觉转化器的地标检测算法,该算法可以从结肠镜检查过程中鉴定出关键的解剖标志(阑尾孔,回肠瓣膜/盲肠地标和直肠翻新)。我们的算法在预处理过程中使用自适应伽马校正,以保持所有图像的一致亮度。然后,我们将视觉变压器用作特征提取主链和完全连接的基于网络的分类器头,将给定的框架分为四个类:三个地标或非地标框架。我们将视觉变压器(VIT-B/16)主链与RESNET-101和Convnext-B骨干进行了比较,这些骨干和Convnext-B骨干也接受了类似训练。我们报告了快照的测试数据集上的视觉变压器主链的精度为82%。

CV-51-标题 Deep Domain Adaptation for Detecting Bomb Craters in Aerial Images

链接: https://arxiv.org/abs/2209.11299
作者: Marco Geiger, Dominik Martin, Niklas Kühl
备注: 56th Annual Hawaii International Conference on System Sciences (HICSS-56)

点击查看摘要

Abstract: The aftermath of air raids can still be seen for decades after the devastating events. Unexploded ordnance (UXO) is an immense danger to human life and the environment. Through the assessment of wartime images, experts can infer the occurrence of a dud. The current manual analysis process is expensive and time-consuming, thus automated detection of bomb craters by using deep learning is a promising way to improve the UXO disposal process. However, these methods require a large amount of manually labeled training data. This work leverages domain adaptation with moon surface images to address the problem of automated bomb crater detection with deep learning under the constraint of limited training data. This paper contributes to both academia and practice (1) by providing a solution approach for automated bomb crater detection with limited training data and (2) by demonstrating the usability and associated challenges of using synthetic images for domain adaptation.

摘要:毁灭性事件几十年来仍然可以看到空袭的后果。未爆炸的军械(UXO)是对人类生活和环境的巨大危险。通过评估战时图像,专家可以推断出DUD的发生。当前的手动分析过程是昂贵且耗时的,因此使用深度学习可以自动检测炸弹陨石坑,是改善UXO处置过程的一种有希望的方法。但是,这些方法需要大量手动标记的培训数据。这项工作利用月球表面图像来利用域的适应性,以解决自动化炸弹火山口检测的问题,并在有限的训练数据的限制下深入学习。本文通过提供有限的训练数据和(2)的自动炸弹火山口检测的解决方案方法来促进学术和实践(1),并通过证明使用合成图像进行域适应的可用性和相关挑战。

CV-52-标题 T2FPV Constructing High-Fidelity First-Person View Datasets From Real-World Pedestrian Trajectories

链接: https://arxiv.org/abs/2209.11294
作者: Benjamin Stoler, Meghdeep Jana, Soonmin Hwang, Jean Oh
备注:

点击查看摘要

Abstract: Predicting pedestrian motion is essential for developing socially-aware robots that interact in a crowded environment. While the natural visual perspective for a social interaction setting is an egocentric view, the majority of existing work in trajectory prediction has been investigated purely in the top-down trajectory space. To support first-person view trajectory prediction research, we present T2FPV, a method for constructing high-fidelity first-person view datasets given a real-world, top-down trajectory dataset; we showcase our approach on the ETH/UCY pedestrian dataset to generate the egocentric visual data of all interacting pedestrians. We report that the bird’s-eye view assumption used in the original ETH/UCY dataset, i.e., an agent can observe everyone in the scene with perfect information, does not hold in the first-person views; only a fraction of agents are fully visible during each 20-timestep scene used commonly in existing work. We evaluate existing trajectory prediction approaches under varying levels of realistic perception – displacement errors suffer a 356% increase compared to the top-down, perfect information setting. To promote research in first-person view trajectory prediction, we release our T2FPV-ETH dataset and software tools.

摘要:预测行人运动对于开发在拥挤的环境中相互作用的社会意识的机器人至关重要。虽然社交互动环境的自然视觉观点是一种自然的观点,但轨迹预测中的大多数现有作品纯粹是在自上而下的轨迹空间中进行的。为了支持第一人称视图轨迹预测研究,我们提出了T2FPV,这是一种构建高保真的第一人称视图数据集的方法,给定真实的,自上而下的轨迹数据集;我们在ETH/UCY行人数据集上展示了我们的方法,以生成所有互动行人的以自我为中心的视觉数据。我们报告说,原始的ETH/UCY数据集中使用的鸟眼视图假设,即代理可以用完美的信息观察场景中的每个人,而不会在第一人称视图中保持;在现有作品中通常使用的每个20个磁场场景中,只有一小部分的代理都可以完全看到。我们评估现有的轨迹预测方法在不同的现实感知水平下 - 与自上而下的完美信息设置相比,位移错误增加了356%。为了促进第一人称视图轨迹预测的研究,我们发布了T2FPV-ETH数据集和软件工具。

CV-53-标题 FusionVAE A Deep Hierarchical Variational Autoencoder for RGB Image Fusion

链接: https://arxiv.org/abs/2209.11277
作者: Fabian Duffhauss, Ngo Anh Vien, Hanna Ziesche, Gerhard Neumann
备注: Accepted at ECCV 2022

点击查看摘要

Abstract: Sensor fusion can significantly improve the performance of many computer vision tasks. However, traditional fusion approaches are either not data-driven and cannot exploit prior knowledge nor find regularities in a given dataset or they are restricted to a single application. We overcome this shortcoming by presenting a novel deep hierarchical variational autoencoder called FusionVAE that can serve as a basis for many fusion tasks. Our approach is able to generate diverse image samples that are conditioned on multiple noisy, occluded, or only partially visible input images. We derive and optimize a variational lower bound for the conditional log-likelihood of FusionVAE. In order to assess the fusion capabilities of our model thoroughly, we created three novel datasets for image fusion based on popular computer vision datasets. In our experiments, we show that FusionVAE learns a representation of aggregated information that is relevant to fusion tasks. The results demonstrate that our approach outperforms traditional methods significantly. Furthermore, we present the advantages and disadvantages of different design choices.

摘要:传感器融合可以显着提高许多计算机视觉任务的性能。但是,传统的融合方法要么不是数据驱动的,也不能利用先验知识,也不能在给定数据集中找到规律性,或者它们仅限于单个应用程序。我们通过呈现一种新型深层分层变异自动编码器来克服这一缺点,称为FusionVae,可以作为许多融合任务的基础。我们的方法能够生成以多个嘈杂,遮挡或仅部分可见的输入图像来调节的各种图像样本。我们得出并优化了融合的条件对数似然的变化下限。为了彻底评估模型的融合功能,我们根据流行的计算机视觉数据集创建了三个新颖的图像融合数据集。在我们的实验中,我们表明FusionVae学习了与融合任务相关的汇总信息的表示。结果表明,我们的方法表现明显优于传统方法。此外,我们介绍了不同设计选择的优势和缺点。

CV-54-标题 Capsule Network based Contrastive Learning of Unsupervised Visual Representations

链接: https://arxiv.org/abs/2209.11276
作者: Harsh Panwar, Ioannis Patras
备注:

点击查看摘要

Abstract: Capsule Networks have shown tremendous advancement in the past decade, outperforming the traditional CNNs in various task due to it’s equivariant properties. With the use of vector I/O which provides information of both magnitude and direction of an object or it’s part, there lies an enormous possibility of using Capsule Networks in unsupervised learning environment for visual representation tasks such as multi class image classification. In this paper, we propose Contrastive Capsule (CoCa) Model which is a Siamese style Capsule Network using Contrastive loss with our novel architecture, training and testing algorithm. We evaluate the model on unsupervised image classification CIFAR-10 dataset and achieve a top-1 test accuracy of 70.50% and top-5 test accuracy of 98.10%. Due to our efficient architecture our model has 31 times less parameters and 71 times less FLOPs than the current SOTA in both supervised and unsupervised learning.

摘要:在过去的十年中,胶囊网络已显示出巨大的进步,由于其模棱两可的属性,在各种任务中表现出色。通过使用向量I/O,它提供了对象的幅度和方向的信息,或者是该部分的一部分,在无监督的学习环境中使用胶囊网络来进行视觉表示任务,例如多类图像分类。在本文中,我们提出了对比度胶囊(可口可乐)模型,该模型是一种暹罗风格的胶囊网络,使用我们的新型体系结构,训练和测试算法使用对比度损失。我们评估了无监督图像分类CIFAR-10数据集的模型,并获得了70.50%的TOP-1测试精度,前5个测试精度为98.10%。由于我们有效的体系结构,我们的模型的参数少了31倍,而在监督和无监督学习中,当前的SOTA的参数比当前的SOTA少了71倍。

CV-55-标题 Optimization of FPGA-based CNN Accelerators Using Metaheuristics

链接: https://arxiv.org/abs/2209.11272
作者: Sadiq M. Sait, Aiman El-Maleh, Mohammad Altakrouri, Ahmad Shawahna
备注: 23 pages, 7 figures, 9 tables. in The Journal of Supercomputing, 2022

点击查看摘要

Abstract: In recent years, convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields and with accuracy that was not possible before. However, this comes with extensive computational requirements, which made general CPUs unable to deliver the desired real-time performance. At the same time, FPGAs have seen a surge in interest for accelerating CNN inference. This is due to their ability to create custom designs with different levels of parallelism. Furthermore, FPGAs provide better performance per watt compared to GPUs. The current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs), each of which is tailored for a subset of layers. However, the growing complexity of CNN architectures makes optimizing the resources available on the target FPGA device to deliver optimal performance more challenging. In this paper, we present a CNN accelerator and an accompanying automated design methodology that employs metaheuristics for partitioning available FPGA resources to design a Multi-CLP accelerator. Specifically, the proposed design tool adopts simulated annealing (SA) and tabu search (TS) algorithms to find the number of CLPs required and their respective configurations to achieve optimal performance on a given target FPGA device. Here, the focus is on the key specifications and hardware resources, including digital signal processors, block RAMs, and off-chip memory bandwidth. Experimental results and comparisons using four well-known benchmark CNNs are presented demonstrating that the proposed acceleration framework is both encouraging and promising. The SA-/TS-based Multi-CLP achieves 1.31x - 2.37x higher throughput than the state-of-the-art Single-/Multi-CLP approaches in accelerating AlexNet, SqueezeNet 1.1, VGGNet, and GoogLeNet architectures on the Xilinx VC707 and VC709 FPGA boards.

摘要:近年来,卷积神经网络(CNN)证明了它们在许多领域中解决问题的能力,并且以前无法进行准确性。但是,这带有广泛的计算要求,这使得普通CPU无法提供所需的实时性能。同时,FPGA对加速CNN推断的兴趣激增。这是由于他们有能力创建具有不同级别的并行性的自定义设计。此外,与GPU相比,FPGA提供每瓦的性能更好。基于FPGA的CNN加速器的当前趋势是实现多个卷积层处理器(CLP),每个处理器都针对一层层量身定制。但是,CNN体系结构的日益增长的复杂性使得优化目标FPGA设备上可用的资源,以使最佳性能更具挑战性。在本文中,我们提出了CNN加速器和随附的自动设计方法,该方法采用元启发式学来分区可用的FPGA资源来设计多CLP加速器。具体而言,提出的设计工具采用模拟退火(SA)和禁忌搜索(TS)算法来查找所需的CLP数量及其各自的配置,以在给定的目标FPGA设备上实现最佳性能。在这里,重点是关键规格和硬件资源,包括数字信号处理器,阻止RAM和芯片内存储器带宽。提出了使用四个众所周知的基准CNN的实验结果和比较,表明所提出的加速框架既令人鼓舞又有前途。基于SA-/TS的多CLP比在加速Alexnet,Squeezenet 1.1,VGGNET和Googlenet架构上的最新单个/多CLP方法高1.31x-2.37倍高2.37倍。和VC709 FPGA板。

CV-56-标题 Recurrence-free Survival Prediction under the Guidance of Automatic Gross Tumor Volume Segmentation for Head and Neck Cancers

链接: https://arxiv.org/abs/2209.11268
作者: Kai Wang, Yunxiang Li, Michael Dohopolski, Tao Peng, Weiguo Lu, You Zhang, Jing Wang
备注: MICCAI 2022, HECKTOR Challenge Submission

点击查看摘要

Abstract: For Head and Neck Cancers (HNC) patient management, automatic gross tumor volume (GTV) segmentation and accurate pre-treatment cancer recurrence prediction are of great importance to assist physicians in designing personalized management plans, which have the potential to improve the treatment outcome and quality of life for HNC patients. In this paper, we developed an automated primary tumor (GTVp) and lymph nodes (GTVn) segmentation method based on combined pre-treatment positron emission tomography/computed tomography (PET/CT) scans of HNC patients. We extracted radiomics features from the segmented tumor volume and constructed a multi-modality tumor recurrence-free survival (RFS) prediction model, which fused the prediction results from separate CT radiomics, PET radiomics, and clinical models. We performed 5-fold cross-validation to train and evaluate our methods on the MICCAI 2022 HEad and neCK TumOR segmentation and outcome prediction challenge (HECKTOR) dataset. The ensemble prediction results on the testing cohort achieved Dice scores of 0.77 and 0.73 for GTVp and GTVn segmentation, respectively, and a C-index value of 0.67 for RFS prediction. The code is publicly available (this https URL). Our team’s name is AIRT.

摘要:对于头颈癌(HNC)患者管理,自动总肿瘤体积(GTV)细分和准确的治疗前癌症复发预测对于协助医生设计个性化管理计划非常重要,这有可能改善治疗方法HNC患者的结果和生活质量。在本文中,我们基于HNC患者的组合预处理正电子发射断层扫描/计算机发射断层扫描(PET/CT)扫描,开发了一种自动原发性肿瘤(GTVP)和淋巴结(GTVN)分割方法。我们从分段的肿瘤体积中提取了放射素学特征,并构建了多模式肿瘤复发生存率(RFS)预测模型,该模型融合了预测由单独的CT放射线学,PET放射线学和临床模型融合在一起。我们进行了5倍的交叉验证,以训练和评估MICCAI 2022头和颈部肿瘤分割和结果预测挑战(Hecktor)数据集的方法。 GTVP和GTVN分割的测试队列的集合预测分别达到0.77和0.73,RFS预测的C-指数值为0.67。该代码公开可用(此HTTPS URL)。我们团队的名字叫艾特。

CV-57-标题 3DPCT 3D Point Cloud Transformer with Dual Self-attention

链接: https://arxiv.org/abs/2209.11255
作者: Dening Lu, Kyle Gao, Qian Xie, Linlin Xu, Jonathan Li
备注: 10 pages, 5 figures, 4 tables

点击查看摘要

Abstract: Transformers have resulted in remarkable achievements in the field of image processing. Inspired by this great success, the application of Transformers to 3D point cloud processing has drawn more and more attention. This paper presents a novel point cloud representational learning network, 3D Point Cloud Transformer with Dual Self-attention (3DPCT) and an encoder-decoder structure. Specifically, 3DPCT has a hierarchical encoder, which contains two local-global dual-attention modules for the classification task (three modules for the segmentation task), with each module consisting of a Local Feature Aggregation (LFA) block and a Global Feature Learning (GFL) block. The GFL block is dual self-attention, with both point-wise and channel-wise self-attention to improve feature extraction. Moreover, in LFA, to better leverage the local information extracted, a novel point-wise self-attention model, named as Point-Patch Self-Attention (PPSA), is designed. The performance is evaluated on both classification and segmentation datasets, containing both synthetic and real-world data. Extensive experiments demonstrate that the proposed method achieved state-of-the-art results on both classification and segmentation tasks.

摘要:变形金刚在图像处理领域取得了显着的成就。受到这一巨大成功的启发,变形金刚在3D点云处理中的应用引起了越来越多的关注。本文提出了一个新颖的点云表示学习网络,具有双重自我注意的3D点云变压器(3DPCT)和一个编码器解码器结构。具体而言,3DPCT具有一个层次编码器,该编码器包含两个用于分类任务的局部全球双重注意模块(分段任务的三个模块),每个模块都包含一个局部特征聚合(LFA)块和全局特征学习( GFL)块。 GFL块是双重的自我注意事项,既有在点上的自我注意力,又可以提高特征提取。此外,在LFA中,为更好地利用了提取的本地信息,设计了一种新颖的点自我发明模型,称为点斑点自我注意力(PPSA)。在分类和分割数据集上都评估了性能,其中包含合成数据和现实世界数据。广泛的实验表明,所提出的方法在分类和分割任务上都达到了最新的结果。

CV-58-标题 Dual-Cycle Self-Supervised Dual-View Fluorescence Microscopy Image Reconstruction using CycleGAN

链接: https://arxiv.org/abs/2209.11729
作者: Tomas Kerepecky, Jiaming Liu, Xue Wen Ng, David W. Piston, Ulugbek S. Kamilov
备注: 7 pages, 5 figures

点击查看摘要

Abstract: Three-dimensional fluorescence microscopy often suffers from anisotropy, where the resolution along the axial direction is lower than that within the lateral imaging plane. We address this issue by presenting Dual-Cycle, a new framework for joint deconvolution and fusion of dual-view fluorescence images. Inspired by the recent Neuroclear method, Dual-Cycle is designed as a cycle-consistent generative network trained in a self-supervised fashion by combining a dual-view generator and prior-guided degradation model. We validate Dual-Cycle on both synthetic and real data showing its state-of-the-art performance without any external training data.

摘要:三维荧光显微镜通常遭受各向异性,其中沿轴向方向的分辨率低于横向成像平面内的分辨率。我们通过提出双周期来解决此问题,这是双环荧光图像的关节反卷积和融合的新框架。受到最近的神经清性方法的启发,双周期被设计为一种循环一致的生成网络,通过结合双视发电机和先前引导的退化模型,以自我监督的方式训练。我们在合成数据和真实数据上验证双周期,显示其最先进的性能,而无需任何外部培训数据。

CV-59-标题 Deep Learning-based Anonymization of Chest Radiographs A Utility-preserving Measure for Patient Privacy

链接: https://arxiv.org/abs/2209.11531
作者: Kai Packhäuser, Sebastian Gündel, Florian Thamm, Felix Denzinger, Andreas Maier
备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract: Robust and reliable anonymization of chest radiographs constitutes an essential step before publishing large datasets of such for research purposes. The conventional anonymization process is carried out by obscuring personal information in the images with black boxes and removing or replacing meta-information. However, such simple measures retain biometric information in the chest radiographs, allowing patients to be re-identified by a linkage attack. Therefore, we see an urgent need to obfuscate the biometric information appearing in the images. To the best of our knowledge, we propose the first deep learning-based approach to targetedly anonymize chest radiographs while maintaining data utility for diagnostic and machine learning purposes. Our model architecture is a composition of three independent neural networks that, when collectively used, allow for learning a deformation field that is able to impede patient re-identification. The individual influence of each component is investigated with an ablation study. Quantitative results on the ChestX-ray14 dataset show a reduction of patient re-identification from 81.8% to 58.6% in the area under the receiver operating characteristic curve (AUC) with little impact on the abnormality classification performance. This indicates the ability to preserve underlying abnormality patterns while increasing patient privacy. Furthermore, we compare the proposed deep learning-based anonymization approach with differentially private image pixelization, and demonstrate the superiority of our method towards resolving the privacy-utility trade-off for chest radiographs.

摘要:胸部X光片的强大而可靠的匿名化构成了出于研究目的发布大量数据集之前的重要步骤。传统的匿名过程是通过在图像中使用黑匣子中遮盖个人信息并删除或替换元信息来执行的。但是,这种简单的措施将生物识别信息保留在胸部X光片中,从而使患者可以通过连锁攻击重新识别。因此,我们看到迫切需要混淆图像中出现的生物特征识别信息。据我们所知,我们提出了第一种基于深度学习的方法,以目标匿名化胸部X光片,同时维护数据实用程序以诊断和机器学习目的。我们的模型架构是三个独立神经网络的组成,当共同使用时,它可以学习能够阻碍患者重新识别的变形场。通过消融研究研究每个组件的个体影响。 CHESTX-RAY14数据集的定量结果显示,在接收器操作特征曲线(AUC)下,患者重新识别从81.8%降低至58.6%,对异常分类性能的影响很小。这表明能够保留潜在的异常模式,同时增加患者隐私。此外,我们将提出的基于学习的深度匿名方法与差异化图像像素化进行比较,并证明了我们方法在解决胸部X光片的隐私性权衡权衡方面的优越性。

CV-60-标题 Segmentation-based Information Extraction and Amalgamation in Fundus Images for Glaucoma Detection

链接: https://arxiv.org/abs/2209.11456
作者: Yanni Wang, Gang Yang, Dayong Ding, Jianchun Zao
备注:

点击查看摘要

Abstract: Glaucoma is a severe blinding disease, for which automatic detection methods are urgently needed to alleviate the scarcity of ophthalmologists. Many works have proposed to employ deep learning methods that involve the segmentation of optic disc and cup for glaucoma detection, in which the segmentation process is often considered merely as an upstream sub-task. The relationship between fundus images and segmentation masks in terms of joint decision-making in glaucoma assessment is rarely explored. We propose a novel segmentation-based information extraction and amalgamation method for the task of glaucoma detection, which leverages the robustness of segmentation masks without disregarding the rich information in the original fundus images. Experimental results on both private and public datasets demonstrate that our proposed method outperforms all models that utilize solely either fundus images or masks.

摘要:青光眼是一种严重的盲目疾病,迫切需要自动检测方法来减轻眼科医生的稀缺性。许多作品提出采用深度学习方法,涉及视盘和杯中的分割以进行青光眼检测,其中分割过程通常仅被视为上游子任务。在青光眼评估中,底底图像与分割面具之间的关系很少探索。我们提出了一种基于细分的信息提取和融合方法来实现青光眼检测任务,该方法利用了分割掩模的稳健性,而无需忽略原始底底图像中的丰富信息。私有数据集和公共数据集的实验结果表明,我们提出的方法的表现优于所有仅利用底面图像或口罩的模型。

CV-61-标题 Modular Degradation Simulation and Restoration for Under-Display Camera

链接: https://arxiv.org/abs/2209.11455
作者: Yang Zhou, Yuda Song, Xin Du
备注:

点击查看摘要

Abstract: Under-display camera (UDC) provides an elegant solution for full-screen smartphones. However, UDC captured images suffer from severe degradation since sensors lie under the display. Although this issue can be tackled by image restoration networks, these networks require large-scale image pairs for training. To this end, we propose a modular network dubbed MPGNet trained using the generative adversarial network (GAN) framework for simulating UDC imaging. Specifically, we note that the UDC imaging degradation process contains brightness attenuation, blurring, and noise corruption. Thus we model each degradation with a characteristic-related modular network, and all modular networks are cascaded to form the generator. Together with a pixel-wise discriminator and supervised loss, we can train the generator to simulate the UDC imaging degradation process. Furthermore, we present a Transformer-style network named DWFormer for UDC image restoration. For practical purposes, we use depth-wise convolution instead of the multi-head self-attention to aggregate local spatial information. Moreover, we propose a novel channel attention module to aggregate global information, which is critical for brightness recovery. We conduct evaluations on the UDC benchmark, and our method surpasses the previous state-of-the-art models by 1.23 dB on the P-OLED track and 0.71 dB on the T-OLED track, respectively.

摘要:播放摄像头(UDC)为全屏智能手机提供了优雅的解决方案。但是,由于传感器位于显示屏下,UDC捕获的图像遭受了严重的降解。尽管可以通过图像恢复网络解决此问题,但这些网络需要大规模的图像对进行培训。为此,我们提出了一个模块化网络,称为MPGNET,该网络使用生成对抗网络(GAN)框架来模拟UDC成像。具体而言,我们注意到UDC成像降解过程包含亮度衰减,模糊和噪声损坏。因此,我们将每个降解与特征相关的模块化网络建模,并将所有模块化网络级联成型以形成生成器。加上像素的歧视器和受监督的损失,我们可以训练发电机以模拟UDC成像降解过程。此外,我们提出了一个用于UDC图像恢复的Dwformer的变压器式网络。出于实际目的,我们使用深度卷积而不是多头自我注意力来汇总本地空间信息。此外,我们提出了一个新型的渠道注意模块来汇总全局信息,这对于亮度恢复至关重要。我们对UDC基准进行了评估,我们的方法在P-Oled轨道上超过了先前的最新模型和T-Oled轨道上的0.71 dB。

CV-62-标题 Learning to screen Glaucoma like the ophthalmologists

链接: https://arxiv.org/abs/2209.11431
作者: Junde Wu, Huihui Fang, Fei Li, Huazhu Fu, Yanwu Xu
备注:

点击查看摘要

Abstract: GAMMA Challenge is organized to encourage the AI models to screen the glaucoma from a combination of 2D fundus image and 3D optical coherence tomography volume, like the ophthalmologists.

摘要:组织伽马挑战是为了鼓励AI模型从2D底面图像和3D光学相干断层扫描量的组合(例如眼科医生)筛选出青光眼。

CV-63-标题 Automated detection of Alzheimer disease using MRI images and deep neural networks- A review

链接: https://arxiv.org/abs/2209.11282
作者: Narotam Singh, Patteshwari.D, Neha Soni, Amita Kapoor
备注: 22 Pages, 5 Figures, 7 Tables

点击查看摘要

Abstract: Early detection of Alzheimer disease is crucial for deploying interventions and slowing the disease progression. A lot of machine learning and deep learning algorithms have been explored in the past decade with the aim of building an automated detection for Alzheimer. Advancements in data augmentation techniques and advanced deep learning architectures have opened up new frontiers in this field, and research is moving at a rapid speed. Hence, the purpose of this survey is to provide an overview of recent research on deep learning models for Alzheimer disease diagnosis. In addition to categorizing the numerous data sources, neural network architectures, and commonly used assessment measures, we also classify implementation and reproducibility. Our objective is to assist interested researchers in keeping up with the newest developments and in reproducing earlier investigations as benchmarks. In addition, we also indicate future research directions for this topic.

摘要:早期发现阿尔茨海默氏病对于部署干预措施和减慢疾病进展至关重要。在过去的十年中,已经探索了许多机器学习和深度学习算法,目的是为阿尔茨海默氏症建立自动检测。数据增强技术和先进的深度学习体系结构的进步已经在该领域开辟了新的边界,研究正在快速发展。因此,这项调查的目的是概述有关阿尔茨海默氏病诊断深度学习模型的最新研究。除了对众多数据源,神经网络架构以及常用的评估措施进行分类外,我们还对实施和可重复性进行了分类。我们的目标是协助感兴趣的研究人员跟上最新的发展,并将早期的调查作为基准。此外,我们还指出了该主题的未来研究方向。

CV-64-标题 Hierarchical Graph Convolutional Network Built by Multiscale Atlases for Brain Disorder Diagnosis Using Functional Connectivity

链接: https://arxiv.org/abs/2209.11232
作者: Mianxin Liu, Han Zhang, Feng Shi, Dinggang Shen
备注:

点击查看摘要

Abstract: Functional connectivity network (FCN) data from functional magnetic resonance imaging (fMRI) is increasingly used for the diagnoses of brain disorders. However, state-of-the-art studies used to build the FCN using a single brain parcellation atlas at a certain spatial scale, which largely neglected functional interactions across different spatial scales in hierarchical manners. In this study, we propose a novel framework to perform multiscale FCN analysis for brain disorder diagnosis. We first use a set of well-defined multiscale atlases to compute multiscale FCNs. Then, we utilize biologically meaningful brain hierarchical relationships among the regions in multiscale atlases to perform nodal pooling across multiple spatial scales, namely “Atlas-guided Pooling”. Accordingly, we propose a Multiscale-Atlases-based Hierarchical Graph Convolutional Network (MAHGCN), built on the stacked layers of graph convolution and the atlas-guided pooling, for a comprehensive extraction of diagnostic information from multiscale FCNs. Experiments on neuroimaging data from 1792 subjects demonstrate the effectiveness of our proposed method in the diagnoses of Alzheimer’s disease (AD), the prodromal stage of AD (i.e., mild cognitive impairment [MCI]), as well as autism spectrum disorder (ASD), with accuracy of 88.9%, 78.6%, and 72.7% respectively. All results show significant advantages of our proposed method over other competing methods. This study not only demonstrates the feasibility of brain disorder diagnosis using resting-state fMRI empowered by deep learning, but also highlights that the functional interactions in the multiscale brain hierarchy are worth being explored and integrated into deep learning network architectures for better understanding the neuropathology of brain disorders.

摘要:功能磁共振成像(FMRI)的功能连接网络(FCN)数据越来越多地用于诊断脑疾病。然而,最新的研究用来使用单个脑部分析地图集以一定的空间尺度构建FCN,该空间尺度很大程度上忽略了层次范围内不同空间尺度的功能相互作用。在这项研究中,我们提出了一个新型框架,以对脑部疾病诊断进行多尺度FCN分析。我们首先使用一组定义明确的多尺地图像来计算多尺度FCN。然后,我们利用多尺度地图集中各个区域之间具有生物学意义的大脑分层关系,以跨多个空间尺度进行淋巴结池,即“ Atlas指导的池”。因此,我们提出了一个基于多尺度的层次图形卷积网络(MAHGCN),该网络(MAHGCN)建立在图形卷积和ATLAS引导的池上,以全面地从多尺度FCN中详细提取诊断信息。关于1792名受试者的神经影像数据的实验证明了我们提出的方法在诊断阿尔茨海默氏病(AD),AD的前驱阶段(即轻度认知障碍[MCI])以及自闭症谱系障碍(ASD),,AD的前瞻性阶段(即,轻度认知障碍[MCI]),,精度分别为88.9%,78.6%和72.7%。所有结果都显示出我们提出的方法比其他竞争方法具有显着优势。这项研究不仅证明了使用深度学习增强的静止状态fMRI诊断的可行性,而且还强调,值得探索多尺度脑层次结构中的功能相互作用,并将其整合到深度学习网络体系结构中,以更好地理解有关的神经病理学。脑疾病。

CV-65-标题 A Trio-Method for Retinal Vessel Segmentation using Image Processing

链接: https://arxiv.org/abs/2209.11230
作者: Mahendra Kumar Gourisaria, Vinayak Singh, Manoj Sahni
备注: Accepted at 26th UK Conference on Medical Image Understanding and Analysis (MIUA-2022) (Abstract short paper)

点击查看摘要

Abstract: Inner Retinal neurons are a most essential part of the retina and they are supplied with blood via retinal vessels. This paper primarily focuses on the segmentation of retinal vessels using a triple preprocessing approach. DRIVE database was taken into consideration and preprocessed by Gabor Filtering, Gaussian Blur, and Edge Detection by Sobel and Pruning. Segmentation was driven out by 2 proposed U-Net architectures. Both the architectures were compared in terms of all the standard performance metrics. Preprocessing generated varied interesting results which impacted the results shown by the UNet architectures for segmentation. This real-time deployment can help in the efficient pre-processing of images with better segmentation and detection.

摘要:内部视网膜神经元是视网膜中最重要的部分,它们通过视网膜血管提供血液。本文主要着重于使用三重预处理方法分割视网膜血管。考虑了驱动器数据库,并通过Gabor过滤,高斯模糊和通过SOBEL和修剪来进行预处理。分割由2个提议的U-NET架构驱动。根据所有标准性能指标比较了两个体系结构。预处理产生的各种有趣的结果影响了UNET体系结构进行分割所示的结果。这种实时部署可以帮助通过更好的细分和检测进行有效的图像预处理。

人工智能

AI-0-标题 Evaluating Agent Interactions Through Episodic Knowledge Graphs

链接: https://arxiv.org/abs/2209.11746
作者: Selene Báez Santamaría, Piek Vossen, Thomas Baier
备注:

点击查看摘要

Abstract: We present a new method based on episodic Knowledge Graphs (eKGs) for evaluating (multimodal) conversational agents in open domains. This graph is generated by interpreting raw signals during conversation and is able to capture the accumulation of knowledge over time. We apply structural and semantic analysis of the resulting graphs and translate the properties into qualitative measures. We compare these measures with existing automatic and manual evaluation metrics commonly used for conversational agents. Our results show that our Knowledge-Graph-based evaluation provides more qualitative insights into interaction and the agent’s behavior.

摘要:我们提出了一种基于情节知识图(EKG)的新方法,用于评估开放域中的(多模式)对话剂。该图是通过解释对话过程中的原始信号而生成的,并且能够随着时间的推移捕获知识的积累。我们应用对所得图的结构和语义分析,并将这些属性转化为定性措施。我们将这些措施与通常用于对话代理的现有自动和手动评估指标进行比较。我们的结果表明,我们的基于知识的评估为互动和代理人的行为提供了更多的定性见解。

AI-1-标题 The "Beatrix Resurrections Robust Backdoor Detection via Gram Matrices

链接: https://arxiv.org/abs/2209.11715
作者: Wanlun Ma, Derui Wang, Ruoxi Sun, Minhui Xue, Sheng Wen, Yang Xiang
备注: 19 pages, 23 figures. Code availability: this https URL

点击查看摘要

Abstract: Deep Neural Networks (DNNs) are susceptible to backdoor attacks during training. The model corrupted in this way functions normally, but when triggered by certain patterns in the input, produces a predefined target label. Existing defenses usually rely on the assumption of the universal backdoor setting in which poisoned samples share the same uniform trigger. However, recent advanced backdoor attacks show that this assumption is no longer valid in dynamic backdoors where the triggers vary from input to input, thereby defeating the existing defenses. In this work, we propose a novel technique, Beatrix (backdoor detection via Gram matrix). Beatrix utilizes Gram matrix to capture not only the feature correlations but also the appropriately high-order information of the representations. By learning class-conditional statistics from activation patterns of normal samples, Beatrix can identify poisoned samples by capturing the anomalies in activation patterns. To further improve the performance in identifying target labels, Beatrix leverages kernel-based testing without making any prior assumptions on representation distribution. We demonstrate the effectiveness of our method through extensive evaluation and comparison with state-of-the-art defensive techniques. The experimental results show that our approach achieves an F1 score of 91.1% in detecting dynamic backdoors, while the state of the art can only reach 36.9%.

摘要:深度神经网络(DNNS)在训练过程中容易受到后门攻击的影响。该模型以这种方式损坏正常起作用,但是当输入中的某些模式触发时,会产生预定义的目标标签。现有防御通常依赖于通用后门设置的假设,其中有毒样品共享相同的均匀扳机。但是,最近的高级后门攻击表明,这种假设在动态后门中不再有效,在动态后门中,触发者因输入而异,从而击败了现有的防御。在这项工作中,我们提出了一种新颖的技术BEATRIX(通过革兰氏矩阵检测)。 BEATRIX利用革兰氏矩阵不仅捕获特征相关性,还可以捕获表示形式的适当高阶信息。通过从正常样本的激活模式中学习类条件统计,BEATRIX可以通过捕获激活模式中的异常来识别中毒样品。为了进一步提高识别目标标签的性能,BEATRIX利用基于内核的测试,而无需对表示分布进行任何先前的假设。我们通过与最先进的防御技术进行了广泛的评估和比较来证明我们的方法的有效性。实验结果表明,我们的方法在检测动态后门时达到了91.1%的F1得分,而最新技术只能达到36.9%。

AI-2-标题 Rethinking Missing Data Aleatoric Uncertainty-Aware Recommendation

链接: https://arxiv.org/abs/2209.11679
作者: Chenxu Wang, Fuli Feng, Yang Zhang, Qifan Wang, Xunhan Hu, Xiangnan He
备注:

点击查看摘要

Abstract: Historical interactions are the default choice for recommender model training, which typically exhibit high sparsity, i.e., most user-item pairs are unobserved missing data. A standard choice is treating the missing data as negative training samples and estimating interaction likelihood between user-item pairs along with the observed interactions. In this way, some potential interactions are inevitably mislabeled during training, which will hurt the model fidelity, hindering the model to recall the mislabeled items, especially the long-tail ones. In this work, we investigate the mislabeling issue from a new perspective of aleatoric uncertainty, which describes the inherent randomness of missing data. The randomness pushes us to go beyond merely the interaction likelihood and embrace aleatoric uncertainty modeling. Towards this end, we propose a new Aleatoric Uncertainty-aware Recommendation (AUR) framework that consists of a new uncertainty estimator along with a normal recommender model. According to the theory of aleatoric uncertainty, we derive a new recommendation objective to learn the estimator. As the chance of mislabeling reflects the potential of a pair, AUR makes recommendations according to the uncertainty, which is demonstrated to improve the recommendation performance of less popular items without sacrificing the overall performance. We instantiate AUR on three representative recommender models: Matrix Factorization (MF), LightGCN, and VAE from mainstream model architectures. Extensive results on two real-world datasets validate the effectiveness of AUR w.r.t. better recommendation results, especially on long-tail items.

摘要:历史互动是推荐模型训练的默认选择,通常表现出较高的稀疏性,即大多数用户项目对都是未观察到的缺失数据。标准选择是将缺失的数据视为负训练样本,并估计用户项目对之间的相互作用以及观察到的相互作用。通过这种方式,在训练过程中不可避免地会误标记一些潜在的互动,这将损害模型的保真度,阻碍模型回忆起错误标签的项目,尤其是长尾尾。在这项工作中,我们从新的不确定性的新角度研究了标签的问题,该问题描述了缺失数据的固有随机性。随机性促使我们超越了相互作用的可能性,并接受了不确定性建模。为此,我们提出了一个新的不确定性不确定性建议(AUR)框架,该框架由新的不确定性估计器以及正常的推荐模型组成。根据核心不确定性理论,我们得出了一个新的建议目标来学习估计量。由于错误标签的机会反映了一对的潜力,因此AUR根据不确定性提出了建议,该建议被证明是为了改善较不受欢迎的项目的建议性能而不会牺牲整体性能。我们在三个代表性推荐模型上实例化AUR:来自主流模型体系结构的矩阵分解(MF),LightGCN和VAE。两个现实世界数据集的广泛结果验证了AUR W.R.T.的有效性。更好的建议结果,尤其是在长尾项目上。

AI-3-标题 The SpeakIn Speaker Verification System for Far-Field Speaker Verification Challenge 2022

链接: https://arxiv.org/abs/2209.11625
作者: Yu Zheng, Jinghan Peng, Yihao Chen, Yajun Zhang, Jialong Wang, Min Liu, Minqiang Xu
备注: 5 pages. arXiv admin note: text overlap with arXiv:2209.10846

点击查看摘要

Abstract: This paper describes speaker verification (SV) systems submitted by the SpeakIn team to the Task 1 and Task 2 of the Far-Field Speaker Verification Challenge 2022 (FFSVC2022). SV tasks of the challenge focus on the problem of fully supervised far-field speaker verification (Task 1) and semi-supervised far-field speaker verification (Task 2). In Task 1, we used the VoxCeleb and FFSVC2020 datasets as train datasets. And for Task 2, we only used the VoxCeleb dataset as train set. The ResNet-based and RepVGG-based architectures were developed for this challenge. Global statistic pooling structure and MQMHA pooling structure were used to aggregate the frame-level features across time to obtain utterance-level representation. We adopted AM-Softmax and AAM-Softmax to classify the resulting embeddings. We innovatively propose a staged transfer learning method. In the pre-training stage we reserve the speaker weights, and there are no positive samples to train them in this stage. Then we fine-tune these weights with both positive and negative samples in the second stage. Compared with the traditional transfer learning strategy, this strategy can better improve the model performance. The Sub-Mean and AS-Norm backend methods were used to solve the problem of domain mismatch. In the fusion stage, three models were fused in Task1 and two models were fused in Task2. On the FFSVC2022 leaderboard, the EER of our submission is 3.0049% and the corresponding minDCF is 0.2938 in Task1. In Task2, EER and minDCF are 6.2060% and 0.5232 respectively. Our approach leads to excellent performance and ranks 1st in both challenge tasks.

摘要:本文介绍了Speakin团队提交的SPEAKER验证(SV)系统,该系统针对远场演讲者验证挑战2022(FFSVC2022)的任务1和任务2。挑战的SV任务集中在完全监督的远场演讲者验证(任务1)和半监督远场扬声器验证(任务2)的问题上。在任务1中,我们将Voxceleb和FFSVC2020数据集用作火车数据集。对于任务2,我们仅将Voxceleb数据集用作火车集。为此挑战开发了基于重新连接和基于REPVGG的架构。全局统计池结构和MQMHA池结构用于跨时间汇总框架级特征,以获得语音级别的表示。我们采用了Am-Softmax和Aam-Softmax来对产生的嵌入进行分类。我们创新提出了一种分阶段的转移学习方法。在训练阶段,我们保留扬声器的权重,并且在此阶段没有积极的样本来训练它们。然后,我们在第二阶段用正面和负样品微调这些权重。与传统的转移学习策略相比,该策略可以更好地改善模型性能。亚均值和标志的后端方法用于解决域不匹配的问题。在融合阶段,任务1中融合了三个模型,并在任务2中融合了两个模型。在FFSVC2022排行榜上,我们提交的EER为3.0049%,在Task1中,相应的MindCF为0.2938。在任务2中,EER和MindCF分别为6.2060%和0.5232。我们的方法可以提高表现出色,并在两项挑战任务中排名第一。

AI-4-标题 involve-MI Informative Planning with High-Dimensional Non-Parametric Beliefs

链接: https://arxiv.org/abs/2209.11591
作者: Gilad Rotman, Vadim Indelman
备注:

点击查看摘要

Abstract: One of the most complex tasks of decision making and planning is to gather information. This task becomes even more complex when the state is high-dimensional and its belief cannot be expressed with a parametric distribution. Although the state is high-dimensional, in many problems only a small fraction of it might be involved in transitioning the state and generating observations. We exploit this fact to calculate an information-theoretic expected reward, mutual information (MI), over a much lower-dimensional subset of the state, to improve efficiency and without sacrificing accuracy. A similar approach was used in previous works, yet specifically for Gaussian distributions, and we here extend it for general distributions. Moreover, we apply the dimensionality reduction for cases in which the new states are augmented to the previous, yet again without sacrificing accuracy. We then continue by developing an estimator for the MI which works in a Sequential Monte Carlo (SMC) manner, and avoids the reconstruction of future belief’s surfaces. Finally, we show how this work is applied to the informative planning optimization problem. This work is then evaluated in a simulation of an active SLAM problem, where the improvement in both accuracy and timing is demonstrated.

摘要:决策和计划最复杂的任务之一是收集信息。当状态具有高维度,并且无法用参数分布表达其信念时,此任务就会变得更加复杂。尽管国家是高维的,但在许多问题中,其中只有一小部分可能涉及过渡状态和产生观察结果。我们利用这一事实来计算信息理论的预期奖励,共同信息(MI),在国家的较低维度子集中,以提高效率和不牺牲准确性。以前的工作中使用了类似的方法,但专门用于高斯分布,我们在这里将其扩展为一般分布。此外,我们将降低维度降低用于将新状态扩展到上一个的情况下,又不牺牲准确性。然后,我们继续开发以连续的蒙特卡洛(SMC)方式工作的MI估计器,并避免重建未来信念的表面。最后,我们展示了如何将这项工作应用于信息丰富的计划优化问题。然后在模拟主动大满贯问题的模拟中评估这项工作,其中证明了准确性和时序的提高。

附件下载

点击下载今日全部论文列表