友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。


概览 (2022-09-26)


  • 17篇自然语言处理(NLP: cs.CL)
  • 66篇计算机视觉(CV: cs.CV)
  • 40篇机器学习(ML: cs.LG)
  • 5篇人工智能(AI: cs.AI)
  • 其它主题87篇


NLP-0-标题 Promptagator Few-shot Dense Retrieval From 8 Examples

链接: https://arxiv.org/abs/2209.11755
作者: Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, Ming-Wei Chang


Abstract: Much recent research on information retrieval has focused on how to transfer from one task (typically with abundant supervised data) to various other tasks where supervision is limited, with the implicit assumption that it is possible to generalize from one task to all the rest. However, this overlooks the fact that there are many diverse and unique retrieval tasks, each targeting different search intents, queries, and search domains. In this paper, we suggest to work on Few-shot Dense Retrieval, a setting where each task comes with a short description and a few examples. To amplify the power of a few examples, we propose Prompt-base Query Generation for Retriever (Promptagator), which leverages large language models (LLM) as a few-shot query generator, and creates task-specific retrievers based on the generated data. Powered by LLM’s generalization ability, Promptagator makes it possible to create task-specific end-to-end retrievers solely based on a few examples without using Natural Questions or MS MARCO to train %question generators or dual encoders. Surprisingly, LLM prompting with no more than 8 examples allows dual encoders to outperform heavily engineered models trained on MS MARCO like ColBERT v2 by more than 1.2 nDCG on average on 11 retrieval sets. Further training standard-size re-rankers using the same generated data yields another 5.0 point nDCG improvement. Our studies determine that query generation can be far more effective than previously observed, especially when a small amount of task-specific knowledge is given.

摘要:有关信息检索的许多最新研究集中在如何从一个任务(通常具有丰富的监督数据)转移到有限的其他各种任务,并具有隐含的假设,即可以从一项任务推广到所有其他任务。但是,这忽略了这样一个事实,即有许多多样化和独特的检索任务,每个任务都针对不同的搜索意图,查询和搜索域。在本文中,我们建议使用几乎没有散热的检索,每个任务都有一个简短的描述和一些示例。为了扩大一些示例的功能,我们提出了针对检索器(即将到来)的及时基本查询生成,该查询将大型语言模型(LLM)作为几个弹片查询生成器,并根据生成的数据创建特定于任务的检索器。通过LLM的概括能力提供动力,即将到来,即可可以仅基于一些示例来创建特定于任务的端到端检索,而无需使用自然问题或MS MARCO来训练%问题生成器或双重编码器。出乎意料的是,LLM提示不超过8个示例,允许双重编码器在MARCO(例如Colbert V2)上训练的大量工程模型平均在11个检索套件中超过1.2 NDCG。使用相同生成数据的进一步培训标准尺寸的重新级别可获得5.0点NDCG的改进。我们的研究确定,查询产生比以前观察到的更有效,尤其是在给出少量特定于任务知识的情况下。

NLP-1-标题 Temporal Analysis on Topics Using Word2Vec

链接: https://arxiv.org/abs/2209.11717
作者: Angad Sandhu, Aneesh Edara, Faizan Wajid, Ashok Agrawala


Abstract: The present study proposes a novel method of trend detection and visualization - more specifically, modeling the change in a topic over time. Where current models used for the identification and visualization of trends only convey the popularity of a singular word based on stochastic counting of usage, the approach in the present study illustrates the popularity and direction that a topic is moving in. The direction in this case is a distinct subtopic within the selected corpus. Such trends are generated by modeling the movement of a topic by using k-means clustering and cosine similarity to group the distances between clusters over time. In a convergent scenario, it can be inferred that the topics as a whole are meshing (tokens between topics, becoming interchangeable). On the contrary, a divergent scenario would imply that each topics’ respective tokens would not be found in the same context (the words are increasingly different to each other). The methodology was tested on a group of articles from various media houses present in the 20 Newsgroups dataset.

摘要:本研究提出了一种新颖的趋势检测和可视化方法 - 更具体地说,随着时间的推移,主题的变化建模。如果当前用于识别和可视化趋势的模型仅传达基于用法随机计数的单一单词的普及,那么本研究中的方法说明了一个主题正在发展的普及和方向。在这种情况下,方向是选定语料库中的独特亚主题。通过使用K-均值聚类和余弦相似性对主题的移动进行建模来对这种趋势进行建模,以将簇之间的距离分组。在收敛的场景中,可以推断出整个主题是在网络上的(主题之间的令牌,可以互换)。相反,一个不同的场景暗示每个主题的各自的令牌在相同的上下文中都不会找到(彼此之间越来越不同)。该方法对20个新闻组数据集中存在的各种媒体房屋的一组文章进行了测试。

NLP-2-标题 Best Prompts for Text-to-Image Models and How to Find Them

链接: https://arxiv.org/abs/2209.11711
作者: Nikita Pavlichenko, Dmitry Ustalov
备注: 12 pages (4 main pages), 4 figures, 4 tables


Abstract: Recent progress in generative models, especially in text-guided diffusion models, has enabled the production of aesthetically-pleasing imagery resembling the works of professional human artists. However, one has to carefully compose the textual description, called the prompt, and augment it with a set of clarifying keywords. Since aesthetics are challenging to evaluate computationally, human feedback is needed to determine the optimal prompt formulation and keyword combination. In this paper, we present a human-in-the-loop approach to learning the most useful combination of prompt keywords using a genetic algorithm. We also show how such an approach can improve the aesthetic appeal of images depicting the same descriptions.


NLP-3-标题 A Neural Model for Regular Grammar Induction

链接: https://arxiv.org/abs/2209.11628
作者: Peter Belcák, David Hofer, Roger Wattenhofer
备注: Accepted to the 21st IEEE International Conference on Machine Learning and Applications (ICMLA) 2022, 6 pages, 4 figures


Abstract: Grammatical inference is a classical problem in computational learning theory and a topic of wider influence in natural language processing. We treat grammars as a model of computation and propose a novel neural approach to induction of regular grammars from positive and negative examples. Our model is fully explainable, its intermediate results are directly interpretable as partial parses, and it can be used to learn arbitrary regular grammars when provided with sufficient data. Our method consistently attains high recall and precision scores across a range of tests of varying complexity. We make the detailed results and code readily available.


NLP-4-标题 Robust Domain Adaptation for Machine Reading Comprehension

链接: https://arxiv.org/abs/2209.11615
作者: Liang Jiang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng


Abstract: Most domain adaptation methods for machine reading comprehension (MRC) use a pre-trained question-answer (QA) construction model to generate pseudo QA pairs for MRC transfer. Such a process will inevitably introduce mismatched pairs (i.e., noisy correspondence) due to i) the unavailable QA pairs in target documents, and ii) the domain shift during applying the QA construction model to the target domain. Undoubtedly, the noisy correspondence will degenerate the performance of MRC, which however is neglected by existing works. To solve such an untouched problem, we propose to construct QA pairs by additionally using the dialogue related to the documents, as well as a new domain adaptation method for MRC. Specifically, we propose Robust Domain Adaptation for Machine Reading Comprehension (RMRC) method which consists of an answer extractor (AE), a question selector (QS), and an MRC model. Specifically, RMRC filters out the irrelevant answers by estimating the correlation to the document via the AE, and extracts the questions by fusing the candidate questions in multiple rounds of dialogue chats via the QS. With the extracted QA pairs, MRC is fine-tuned and provides the feedback to optimize the QS through a novel reinforced self-training method. Thanks to the optimization of the QS, our method will greatly alleviate the noisy correspondence problem caused by the domain shift. To the best of our knowledge, this could be the first study to reveal the influence of noisy correspondence in domain adaptation MRC models and show a feasible way to achieve robustness to mismatched pairs. Extensive experiments on three datasets demonstrate the effectiveness of our method.


NLP-5-标题 An Interdisciplinary Perspective on Evaluation and Experimental Design for Visual Text Analytics Position Paper

链接: https://arxiv.org/abs/2209.11534
作者: Kostiantyn Kucher, Nicole Sultanum, Angel Daza, Vasiliki Simaki, Maria Skeppstedt, Barbara Plank, Jean-Daniel Fekete, Narges Mahyar
备注: To appear in Proceedings of the 2022 IEEE Workshop on Evaluation and Beyond - Methodological Approaches to Visualization (BELIV '22)


Abstract: Appropriate evaluation and experimental design are fundamental for empirical sciences, particularly in data-driven fields. Due to the successes in computational modeling of languages, for instance, research outcomes are having an increasingly immediate impact on end users. As the gap in adoption by end users decreases, the need increases to ensure that tools and models developed by the research communities and practitioners are reliable, trustworthy, and supportive of the users in their goals. In this position paper, we focus on the issues of evaluating visual text analytics approaches. We take an interdisciplinary perspective from the visualization and natural language processing communities, as we argue that the design and validation of visual text analytics include concerns beyond computational or visual/interactive methods on their own. We identify four key groups of challenges for evaluating visual text analytics approaches (data ambiguity, experimental design, user trust, and "big picture’’ concerns) and provide suggestions for research opportunities from an interdisciplinary perspective.


NLP-6-标题 MetaPrompting Learning to Learn Better Prompts

链接: https://arxiv.org/abs/2209.11486
作者: Yutai Hou, Hongyuan Dong, Xinghao Wang, Bohan Li, Wanxiang Che


Abstract: Prompting method is regarded as one of the crucial progress for few-shot nature language processing. Recent research on prompting moves from discrete tokens based hard prompts'' to continuous soft prompts’', which employ learnable vectors as pseudo prompt tokens and achieve better performance. Though showing promising prospects, these soft-prompting methods are observed to rely heavily on good initialization to take effect. Unfortunately, obtaining a perfect initialization for soft prompts requires understanding of inner language models working and elaborate design, which is no easy task and has to restart from scratch for each new task. To remedy this, we propose a generalized soft prompting method called MetaPrompting, which adopts the well-recognized model-agnostic meta-learning algorithm to automatically find better prompt initialization that facilitates fast adaptation to new prompting tasks.Extensive experiments show MetaPrompting tackles soft prompt initialization problem and brings significant improvement on four different datasets (over 6 points improvement in accuracy for 1-shot setting), achieving new state-of-the-art performance.

摘要:提示方法被认为是几击自然语言处理的关键进展之一。最近对基于离散令牌的硬提示''转移到连续软提示’'的最新研究,这些提示将可学习的向量用作伪提示代币并实现更好的性能。尽管显示出有希望的前景,但观察到这些软宣传的方法在很大程度上依赖良好的初始化来生效。不幸的是,获得软提示的完美初始化需要了解内在语言模型的工作和精心设计,这绝非易事,必须从头开始重新启动每个新任务。为了解决此问题,我们提出了一种称为Metaprompting的广义软提示方法,该方法采用了良好认可的模型 - 静态元学习算法,以自动找到更好的及时初始化,从而快速适应新的促进任务。问题并在四个不同的数据集上带来了显着改善(1次设置的准确性提高了6分),从而实现了新的最新性能。

NLP-7-标题 ET5 A Novel End-to-end Framework for Conversational Machine Reading Comprehension

链接: https://arxiv.org/abs/2209.11484
作者: Xiao Zhang, Heyan Huang, Zewen Chi, Xian-Ling Mao
备注: Accepted by COLING2022


Abstract: Conversational machine reading comprehension (CMRC) aims to assist computers to understand an natural language text and thereafter engage in a multi-turn conversation to answer questions related to the text. Existing methods typically require three steps: (1) decision making based on entailment reasoning; (2) span extraction if required by the above decision; (3) question rephrasing based on the extracted span. However, for nearly all these methods, the span extraction and question rephrasing steps cannot fully exploit the fine-grained entailment reasoning information in decision making step because of their relative independence, which will further enlarge the information gap between decision making and question phrasing. Thus, to tackle this problem, we propose a novel end-to-end framework for conversational machine reading comprehension based on shared parameter mechanism, called entailment reasoning T5 (ET5). Despite the lightweight of our proposed framework, experimental results show that the proposed ET5 achieves new state-of-the-art results on the ShARC leaderboard with the BLEU-4 score of 55.2. Our model and code are publicly available at this https URL.

摘要:对话机阅读理解(CMRC)旨在帮助计算机理解自然语言文本,然后进行多转交谈以回答与文本有关的问题。现有方法通常需要三个步骤:(1)基于需要推理的决策; (2)如果上述决定的要求,请跨越提取; (3)基于提取的跨度重新绘制问题。但是,对于几乎所有这些方法,跨度提取和问题的改写步骤无法完全利用决策制定步骤中的细粒度构成推理信息,因为它们的相对独立性将进一步扩大决策制定和问题措辞之间的信息差距。因此,为了解决这个问题,我们提出了一个基于共享参数机制的对话机读取理解理解的新颖端到端框架,称为Intailment推理T5(ET5)。尽管我们提出的框架轻量级,但实验结果表明,拟议的ET5以55.2的BLEU-4分数在Sharc排行榜上取得了新的最新结果。我们的模型和代码在此HTTPS URL上公开可用。

NLP-8-标题 News Category Dataset

链接: https://arxiv.org/abs/2209.11429
作者: Rishabh Misra


Abstract: People rely on news to know what is happening around the world and inform their daily lives. In today’s world, when the proliferation of fake news is rampant, having a large-scale and high-quality source of authentic news articles with the published category information is valuable to learning authentic news’ Natural Language syntax and semantics. As part of this work, we present a News Category Dataset that contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost, along with useful metadata to enable various NLP tasks. In this paper, we also produce some novel insights from the dataset and describe various existing and potential applications of our dataset.


NLP-9-标题 Zero-shot Domain Adaptation for Neural Machine Translation with Retrieved Phrase-level Prompts

链接: https://arxiv.org/abs/2209.11409
作者: Zewei Sun, Qingnan Jiang, Shujian Huang, Jun Cao, Shanbo Cheng, Mingxuan Wang


Abstract: Domain adaptation is an important challenge for neural machine translation. However, the traditional fine-tuning solution requires multiple extra training and yields a high cost. In this paper, we propose a non-tuning paradigm, resolving domain adaptation with a prompt-based method. Specifically, we construct a bilingual phrase-level database and retrieve relevant pairs from it as a prompt for the input sentences. By utilizing Retrieved Phrase-level Prompts (RePP), we effectively boost the translation quality. Experiments show that our method improves domain-specific machine translation for 6.2 BLEU scores and improves translation constraints for 11.5% accuracy without additional training.

摘要:域适应是神经机器翻译的重要挑战。但是,传统的微调解决方案需要多次额外的培训,并产生高昂的成本。在本文中,我们提出了一种非调节范式,通过基于及时的方法解决域的适应性。具体来说,我们构建了双语短语级数据库,并从中检索相关对作为输入句子的提示。通过利用检索到的短语级提示(REPP),我们有效地提高了翻译质量。实验表明,我们的方法改善了域特异性的机器翻译,可用于6.2 BLEU分数,并改善了在没有额外训练的情况下,精度为11.5%的翻译约束。

NLP-10-标题 IDEA Interactive DoublE Attentions from Label Embedding for Text Classification

链接: https://arxiv.org/abs/2209.11407
作者: Ziyuan Wang, Hailiang Huang, Songqiao Han
备注: Accepted by ICTAI2022


Abstract: Current text classification methods typically encode the text merely into embedding before a naive or complicated classifier, which ignores the suggestive information contained in the label text. As a matter of fact, humans classify documents primarily based on the semantic meaning of the subcategories. We propose a novel model structure via siamese BERT and interactive double attentions named IDEA ( Interactive DoublE Attentions) to capture the information exchange of text and label names. Interactive double attentions enable the model to exploit the inter-class and intra-class information from coarse to fine, which involves distinguishing among all labels and matching the semantical subclasses of ground truth labels. Our proposed method outperforms the state-of-the-art methods using label texts significantly with more stable results.

摘要:当前文本分类方法通常仅在天真或复杂的分类器之前将文本编码为嵌入,该分类器忽略了标签文本中包含的建议信息。实际上,人类主要基于子类别的语义含义对文档进行分类。我们通过暹罗伯特(Siamese Bert)和名为Ideas(交互式双重注意力)的交互式双重注意提出了一种新颖的模型结构,以捕获文本和标签名称的信息交换。交互式双重注意力使该模型能够从粗糙到细小的类中开利的类和类内部信息,这涉及区分所有标签并匹配地面真实标签的语义子类。我们提出的方法的表现优于最新方法,使用标签文本显着,结果更稳定。

NLP-11-标题 Conversational QA Dataset Generation with Answer Revision

链接: https://arxiv.org/abs/2209.11396
作者: Seonjeong Hwang, Gary Geunbae Lee
备注: COLING 2022


Abstract: Conversational question–answer generation is a task that automatically generates a large-scale conversational question answering dataset based on input passages. In this paper, we introduce a novel framework that extracts question-worthy phrases from a passage and then generates corresponding questions considering previous conversations. In particular, our framework revises the extracted answers after generating questions so that answers exactly match paired questions. Experimental results show that our simple answer revision approach leads to significant improvement in the quality of synthetic data. Moreover, we prove that our framework can be effectively utilized for domain adaptation of conversational question answering.

摘要:对话问题 - 答案生成是一项任务,它会自动根据输入段落生成大规模的对话问题回答数据集。在本文中,我们介绍了一个新颖的框架,该框架从一段段落中提取了值得问候的短语,然后在考虑以前的对话时产生相应的问题。特别是,我们的框架在生成问题后修改了提取的答案,以便答案与配对的问题完全匹配。实验结果表明,我们简单的答案修订方法可显着改善合成数据的质量。此外,我们证明我们的框架可以有效地用于域的适应会话问答。

NLP-12-标题 Improving Conversational Recommender System via Contextual and Time-Aware Modeling with Less Domain-Specific Knowledge

链接: https://arxiv.org/abs/2209.11386
作者: Lingzhi Wang, Shafiq Joty, Wei Gao, Xingshan Zeng, Kam-Fai Wong


Abstract: Conversational Recommender Systems (CRS) has become an emerging research topic seeking to perform recommendations through interactive conversations, which generally consist of generation and recommendation modules. Prior work on CRS tends to incorporate more external and domain-specific knowledge like item reviews to enhance performance. Despite the fact that the collection and annotation of the external domain-specific information needs much human effort and degenerates the generalizability, too much extra knowledge introduces more difficulty to balance among them. Therefore, we propose to fully discover and extract internal knowledge from the context. We capture both entity-level and contextual-level representations to jointly model user preferences for the recommendation, where a time-aware attention is designed to emphasize the recently appeared items in entity-level representations. We further use the pre-trained BART to initialize the generation module to alleviate the data scarcity and enhance the context modeling. In addition to conducting experiments on a popular dataset (ReDial), we also include a multi-domain dataset (OpenDialKG) to show the effectiveness of our model. Experiments on both datasets show that our model achieves better performance on most evaluation metrics with less external knowledge and generalizes well to other domains. Additional analyses on the recommendation and generation tasks demonstrate the effectiveness of our model in different scenarios.

摘要:对话推荐系统(CRS)已成为一个新兴的研究主题,寻求通过交互式对话进行建议,这些对话通常由发电和建议模块组成。 CRS的先前工作倾向于将更多的外部和领域特定知识纳入项目评论,以提高性能。尽管事实的收集和注释特定于外部领域的信息需要大量的人类努力并脱离了普遍性,但过多的额外知识在它们之间带来了更大的困难。因此,我们建议从上下文中充分发现和提取内部知识。我们将实体级别和上下文级别的表示形式捕获为对建议的共同模拟用户的偏好,在这种情况下,时间吸引的注意力旨在强调实体级表示中最近出现的项目。我们进一步使用预训练的巴特来初始化生成模块,以减轻数据稀缺性并增强上下文建模。除了在流行数据集(REDIAIL)上进行实验外,我们还包括一个多域数据集(OpenDialKg)来显示我们模型的有效性。两个数据集的实验都表明,我们的模型在大多数评估指标上都具有更好的性能,其外部知识较少,并且可以很好地推广到其他领域。对建议和生成任务的其他分析证明了我们在不同情况下模型的有效性。

NLP-13-标题 Extending Word-Level Quality Estimation for Post-Editing Assistance

链接: https://arxiv.org/abs/2209.11378
作者: Yizhen Wei, Takehito Utsuro, Masaaki Nagata


Abstract: We define a novel concept called extended word alignment in order to improve post-editing assistance efficiency. Based on extended word alignment, we further propose a novel task called refined word-level QE that outputs refined tags and word-level correspondences. Compared to original word-level QE, the new task is able to directly point out editing operations, thus improves efficiency. To extract extended word alignment, we adopt a supervised method based on mBERT. To solve refined word-level QE, we firstly predict original QE tags by training a regression model for sequence tagging based on mBERT and XLM-R. Then, we refine original word tags with extended word alignment. In addition, we extract source-gap correspondences, meanwhile, obtaining gap tags. Experiments on two language pairs show the feasibility of our method and give us inspirations for further improvement.


NLP-14-标题 Towards Faithful Model Explanation in NLP A Survey

链接: https://arxiv.org/abs/2209.11326
作者: Qing Lyu, Marianna Apidianaki, Chris Callison-Burch
备注: 62 pages


Abstract: End-to-end neural NLP architectures are notoriously difficult to understand, which gives rise to numerous efforts towards model explainability in recent years. An essential principle of model explanation is Faithfulness, i.e., an explanation should accurately represent the reasoning process behind the model’s prediction. This survey first discusses the definition and evaluation of Faithfulness, as well as its significance for explainability. We then introduce the recent advances in faithful explanation by grouping approaches into five categories: similarity methods, analysis of model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. Each category will be illustrated with its representative studies, advantages, and shortcomings. Finally, we discuss all the above methods in terms of their common virtues and limitations, and reflect on future work directions towards faithful explainability. For researchers interested in studying interpretability, this survey will offer an accessible and comprehensive overview of the area, laying the basis for further exploration. For users hoping to better understand their own models, this survey will be an introductory manual helping with choosing the most suitable explanation method(s).


NLP-15-标题 ProgPrompt Generating Situated Robot Task Plans using Large Language Models

链接: https://arxiv.org/abs/2209.11302
作者: Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg


Abstract: Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with program-like specifications of the available actions and objects in an environment, as well as with example programs that can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical robot arm for tabletop tasks. Website at this http URL

摘要:任务计划可能需要定义有关机器人需要采取行动的世界的无数领域知识。为了改善这项工作,可以使用大型语言模型(LLM)在任务计划期间为潜在的下一个操作评分,甚至直接生成动作序列,鉴于没有其他域信息的自然语言指令。但是,这样的方法要么需要列举所有可能的下一步评分,要么生成可能包含在当前机器人中给定机器人上不可能操作的自由形式文本。我们提出了一个程序化的LLM提示结构,该结构能够跨越位置环境,机器人功能和任务的计划生成功能。我们的关键见解是提示LLM具有环境中可用操作和对象的类似程序的规格,以及可以执行的示例程序。我们通过消融实验提出了有关迅速结构和生成约束的具体建议,证明了虚拟屋家庭任务中最先进的成功率,并将我们的方法部署在桌面任务的物理机器人组上。网站上此HTTP URL

NLP-16-标题 XF2T Cross-lingual Fact-to-Text Generation for Low-Resource Languages

链接: https://arxiv.org/abs/2209.11252
作者: Shivprasad Sagare, Tushar Abhishek, Bhavyajeet Singh, Anubhav Sharma, Manish Gupta, Vasudeva Varma


Abstract: Multiple business scenarios require an automated generation of descriptive human-readable text from structured input data. Hence, fact-to-text generation systems have been developed for various downstream tasks like generating soccer reports, weather and financial reports, medical reports, person biographies, etc. Unfortunately, previous work on fact-to-text (F2T) generation has focused primarily on English mainly due to the high availability of relevant datasets. Only recently, the problem of cross-lingual fact-to-text (XF2T) was proposed for generation across multiple languages alongwith a dataset, XALIGN for eight languages. However, there has been no rigorous work on the actual XF2T generation problem. We extend XALIGN dataset with annotated data for four more languages: Punjabi, Malayalam, Assamese and Oriya. We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset, which we call XALIGNV2. Further, we investigate the performance of different text generation strategies: multiple variations of pretraining, fact-aware embeddings and structure-aware input encoding. Our extensive experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results on average across the twelve languages. We make our code, dataset and model publicly available, and hope that this will help advance further research in this critical area.



ML-0-标题 GLSO Grammar-guided Latent Space Optimization for Sample-efficient Robot Design Automation

链接: https://arxiv.org/abs/2209.11748
作者: Jiaheng Hu, Julian Whiman, Howie Choset


Abstract: Robots have been used in all sorts of automation, and yet the design of robots remains mainly a manual task. We seek to provide design tools to automate the design of robots themselves. An important challenge in robot design automation is the large and complex design search space which grows exponentially with the number of components, making optimization difficult and sample inefficient. In this work, we present Grammar-guided Latent Space Optimization (GLSO), a framework that transforms design automation into a low-dimensional continuous optimization problem by training a graph variational autoencoder (VAE) to learn a mapping between the graph-structured design space and a continuous latent space. This transformation allows optimization to be conducted in a continuous latent space, where sample efficiency can be significantly boosted by applying algorithms such as Bayesian Optimization. GLSO guides training of the VAE using graph grammar rules and robot world space features, such that the learned latent space focus on valid robots and is easier for the optimization algorithm to explore. Importantly, the trained VAE can be reused to search for designs specialized to multiple different tasks without retraining. We evaluate GLSO by designing robots for a set of locomotion tasks in simulation, and demonstrate that our method outperforms related state-of-the-art robot design automation methods.

摘要:机器人已用于各种自动化,但机器人的设计仍然主要是手动任务。我们试图提供设计工具来自动化机器人自己的设计。机器人设计自动化中的一个重要挑战是,大型且复杂的设计搜索空间随着组件的数量成倍增长,从而使优化难度和样本效率低下。在这项工作中,我们介绍了语法引导潜在空间优化(GLSO),该框架通过训练图形变量自动编码器(VAE)将设计自动化转换为低维连续优化问题,以学习图形结构的设计空间之间的映射和一个连续的潜在空间。这种转换允许在连续的潜在空间中进行优化,在这种情况下,通过应用诸如贝叶斯优化等算法,可以显着提高样品效率。 GLSO使用图形语法规则和机器人世界空间特征指导VAE训练VAE,从而使学习的潜在空间专注于有效的机器人,并且更容易探索优化算法。重要的是,可以重复使用训练有素的VAE来搜索专门针对多个不同任务的设计,而无需再培训。我们通过为模拟中的一组运动任务设计机器人来评估GLSO,并证明我们的方法优于相关的最新机器人设计自动化方法。

ML-1-标题 Unified Algorithms for RL with Decision-Estimation Coefficients No-Regret PAC and Reward-Free Learning

链接: https://arxiv.org/abs/2209.11745
作者: Fan Chen, Song Mei, Yu Bai


Abstract: Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al. (2021) as a necessary and sufficient complexity measure for sample-efficient no-regret RL. This paper makes progress towards a unified theory for RL with the DEC framework. First, we propose two new DEC-type complexity measures: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning. Next, we design new unified sample-efficient algorithms for all three learning goals. Our algorithms instantiate variants of the Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA improves upon the algorithms of Foster et al. (2021) which require either bounding a variant of the DEC which may be prohibitively large, or designing problem-specific estimation subroutines. As applications, we recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems using essentially a single algorithm. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling or Maximum Likelihood Estimation, showing that they enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC.

摘要:寻找样品效率学习的统一复杂度度量和算法是增强学习研究的核心主题(RL)。 Foster等人最近提出了决策估计系数(DEC)。 (2021)作为样品有效的NO-REGRET RL的必要和足够的复杂度度量。本文通过DEC框架朝着RL的统一理论取得了进步。首先,我们提出了两项​​新的DEC类型复杂性度量:探索性DEC(EDEC)和无奖励DEC(RFDEC)。我们表明,它们对于样本有效的PAC学习和无奖励学习是必要的,因此扩展了原始DEC,该DEC仅捕获了无需重新学习。接下来,我们为所有三个学习目标设计新的统一样品效率算法。我们的算法实例化估计到决策的变体(E2D)元算法具有强大而通用的模型估计值。即使在无重组的设置中,我们的算法E2D-TA也会在Foster等人的算法上提高。 (2021)需要对DEC的变体进行边界,该变体可能是过于大的,或者设计特定问题的估计值。作为应用程序,我们恢复了现有的,并获得了使用单个算法的各种可拖动RL问题的新样品学习结果。最后,作为一种连接,我们根据后采样或最大似然估计重新分析了两种现有的基于乐观模型的算法,表明它们在与DEC相似的结构条件下具有与E2D-TA相似的遗憾界限。

ML-2-标题 From Weakly Supervised Learning to Active Learning

链接: https://arxiv.org/abs/2209.11629
作者: Vivien Cabannes
备注: PhD Thesis, Ecole Normale Superieure, 2022


Abstract: Applied mathematics and machine computations have raised a lot of hope since the recent success of supervised learning. Many practitioners in industries have been trying to switch from their old paradigms to machine learning. Interestingly, those data scientists spend more time scrapping, annotating and cleaning data than fine-tuning models. This thesis is motivated by the following question: can we derive a more generic framework than the one of supervised learning in order to learn from clutter data? This question is approached through the lens of weakly supervised learning, assuming that the bottleneck of data collection lies in annotation. We model weak supervision as giving, rather than a unique target, a set of target candidates. We argue that one should look for an optimistic'' function that matches most of the observations. This allows us to derive a principle to disambiguate partial labels. We also discuss the advantage to incorporate unsupervised learning techniques into our framework, in particular manifold regularization approached through diffusion techniques, for which we derived a new algorithm that scales better with input dimension then the baseline method. Finally, we switch from passive to active weakly supervised learning, introducing the active labeling’’ framework, in which a practitioner can query weak information about chosen data. Among others, we leverage the fact that one does not need full information to access stochastic gradients and perform stochastic gradient descent.


ML-3-标题 Neural Clamping Joint Input Perturbation and Temperature Scaling for Neural Network Calibration

链接: https://arxiv.org/abs/2209.11604
作者: Yung-Chen Tang, Pin-Yu Chen, Tsung-Yi Ho


Abstract: Neural network calibration is an essential task in deep learning to ensure consistency between the confidence of model prediction and the true correctness likelihood. In this paper, we propose a new post-processing calibration method called Neural Clamping, which employs a simple joint input-output transformation on a pre-trained classifier via a learnable universal input perturbation and an output temperature scaling parameter. Moreover, we provide theoretical explanations on why Neural Clamping is provably better than temperature scaling. Evaluated on CIFAR-100 and ImageNet image recognition datasets and a variety of deep neural network models, our empirical results show that Neural Clamping significantly outperforms state-of-the-art post-processing calibration methods.


ML-4-标题 Machine Learning and Analytical Power Consumption Models for 5G Base Stations

链接: https://arxiv.org/abs/2209.11600
作者: Nicola Piovesan, David Lopez-Perez, Antonio De Domenico, Xinli Geng, Harvey Bao, Merouane Debbah
备注: Accepted by IEEE Communications Magazine


Abstract: The energy consumption of the fifth generation(5G) of mobile networks is one of the major concerns of the telecom industry. However, there is not currently an accurate and tractable approach to evaluate 5G base stations (BSs) power consumption. In this article, we propose a novel model for a realistic characterisation of the power consumption of 5G multi-carrier BSs, which builds on a large data collection campaign. At first, we define a machine learning architecture that allows modelling multiple 5G BS products. Then, we exploit the knowledge gathered by this framework to derive a realistic and analytically tractable power consumption model, which can help driving both theoretical analyses as well as feature standardisation, development and optimisation frameworks. Notably, we demonstrate that such model has high precision, and it is able of capturing the benefits of energy saving mechanisms. We believe this analytical model represents a fundamental tool for understanding 5G BSs power consumption, and accurately optimising the network energy efficiency.

摘要:移动网络第五代(5G)的能源消耗是电信行业的主要关注点之一。但是,目前没有一种评估5G基站(BSS)功耗的准确且可进行的方法。在本文中,我们提出了一个新颖的模型,以实现5G多载波BSS功耗的现实表征,该模型以大型数据收集活动为基础。首先,我们定义了允许对多个5G BS产品进行建模的机器学习体系结构。然后,我们利用该框架收集的知识来得出一个现实且可分析的功耗模型,这可以帮助推动理论分析以及功能标准化,开发和优化框架。值得注意的是,我们证明了这种模型具有很高的精度,并且能够捕获节能机制的好处。我们认为,该分析模型是理解5G BSS功耗的基本工具,并准确地优化了网络能源效率。

ML-5-标题 Quantification before Selection Active Dynamics Preference for Robust Reinforcement Learning

链接: https://arxiv.org/abs/2209.11596
作者: Kang Xu, Yan Ma, Wei Li


Abstract: Training a robust policy is critical for policy deployment in real-world systems or dealing with unknown dynamics mismatch in different dynamic systems. Domain Randomization~(DR) is a simple and elegant approach that trains a conservative policy to counter different dynamic systems without expert knowledge about the target system parameters. However, existing works reveal that the policy trained through DR tends to be over-conservative and performs poorly in target domains. Our key insight is that dynamic systems with different parameters provide different levels of difficulty for the policy, and the difficulty of behaving well in a system is constantly changing due to the evolution of the policy. If we can actively sample the systems with proper difficulty for the policy on the fly, it will stabilize the training process and prevent the policy from becoming over-conservative or over-optimistic. To operationalize this idea, we introduce Active Dynamics Preference~(ADP), which quantifies the informativeness and density of sampled system parameters. ADP actively selects system parameters with high informativeness and low density. We validate our approach in four robotic locomotion tasks with various discrepancies between the training and testing environments. Extensive results demonstrate that our approach has superior robustness for system inconsistency compared to several baselines.

摘要:培训强大的策略对于现实世界中的策略部署至关重要,或者处理不同动态系统中未知动态不匹配。域随机化〜(DR)是一种简单而优雅的方法,可以训练保守的政策,以反对不同的动态系统,而无需有关目标系统参数的专家知识。但是,现有的作品表明,通过DR培训的政策往往保守过度保守,并且在目标领域的表现差。我们的关键见解是,具有不同参数的动态系统为策略提供了不同级别的难度,并且由于策略的发展,在系统中表现良好的难度正在不断变化。如果我们可以为该政策进行适当的困难来积极地对系统进行采样,它将稳定培训过程,并防止政策变得过于保守或过度优势。为了实现这一想法,我们引入了主动动力学偏好(ADP),从而量化了采样系统参数的信息性和密度。 ADP积极选择具有高信息性和低密度的系统参数。我们在四个机器人运动任务中验证我们的方法,并在训练环境和测试环境之间存在各种差异。广泛的结果表明,与几个基线相比,我们的方法对系统不一致具有较高的鲁棒性。

ML-6-标题 Differentially private partitioned variational inference

链接: https://arxiv.org/abs/2209.11595
作者: Mikko A. Heikkilä, Matthew Ashman, Siddharth Swaroop, Richard E. Turner, Antti Honkela
备注: 30 pages, 4 figures


Abstract: Learning a privacy-preserving model from distributed sensitive data is an increasingly important problem, often formulated in the federated learning context. Variational inference has recently been extended to the non-private federated learning setting via the partitioned variational inference algorithm. For privacy protection, the current gold standard is called differential privacy. Differential privacy guarantees privacy in a strong, mathematically clearly defined sense. In this paper, we present differentially private partitioned variational inference, the first general framework for learning a variational approximation to a Bayesian posterior distribution in the federated learning setting while minimising the number of communication rounds and providing differential privacy guarantees for data subjects. We propose three alternative implementations in the general framework, one based on perturbing local optimisation done by individual parties, and two based on perturbing global updates (one using a version of federated averaging, one adding virtual parties to the protocol), and compare their properties both theoretically and empirically. We show that perturbing the local optimisation works well with simple and complex models as long as each party has enough local data. However, the privacy is always guaranteed independently by each party. In contrast, perturbing the global updates works best with relatively simple models. Given access to suitable secure primitives, such as secure aggregation or secure shuffling, the performance can be improved by all parties guaranteeing privacy jointly.


ML-7-标题 Learning Rigid Body Dynamics with Lagrangian Graph Neural Network

链接: https://arxiv.org/abs/2209.11588
作者: Ravinder Bhattoo, Sayan Ranu, N. M. Anoop Krishnan
备注: Accepted at NeurIPS 2022


Abstract: Lagrangian and Hamiltonian neural networks (LNN and HNN respectively) encode strong inductive biases that allow them to outperform other models of physical systems significantly. However, these models have, thus far, mostly been limited to simple systems such as pendulums and springs or a single rigid body such as a gyroscope or a rigid rotor. Here, we present a Lagrangian graph neural network (LGNN) that can learn the dynamics of rigid bodies by exploiting their topology. We demonstrate the performance of LGNN by learning the dynamics of ropes, chains, and trusses with the bars modeled as rigid bodies. LGNN also exhibits generalizability – LGNN trained on chains with a few segments exhibits generalizability to simulate a chain with large number of links and arbitrary link length. We also show that the LGNN can simulate unseen hybrid systems including bars and chains, on which they have not been trained on. Specifically, we show that the LGNN can be used to model the dynamics of complex real-world structures such as the stability of tensegrity structures. Finally, we discuss the non-diagonal nature of the mass matrix and it’s ability to generalize in complex systems.

摘要:Lagrangian和Hamiltonian神经网络(分别是LNN和HNN)编码强诱导偏见,使它们能够显着优于其他物理系统模型。但是,到目前为止,这些模型大多仅限于简单的系统,例如摆和弹簧或单个刚体的身体,例如陀螺仪或刚性转子。在这里,我们提出了一个拉格朗日图神经网络(LGNN),可以通过利用其拓扑来学习刚体的动态。我们通过学习以刚体为刚体的棒的绳索,链条和桁架的动力学来证明LGNN的性能。 LGNN还表现出普遍性 - 在链条上训练了一些细分市场的LGNN具有概括性,以模拟具有大量链接和任意链路长度的链条。我们还表明,LGNN可以模拟看不见的混合动力系统,包括尚未接受过培训的酒吧和链条。具体而言,我们表明LGNN可用于建模复杂的现实世界结构的动力学,例如紧张结构的稳定性。最后,我们讨论了质量矩阵的非对角性性质及其在复杂系统中概括的能力。

ML-8-标题 Applications of Machine Learning in Chemical and Biological Oceanography

链接: https://arxiv.org/abs/2209.11557
作者: Balamurugan Sadaiappan, Preethiya Balakrishnan, Vishal CR, Neethu T Vijayan, Mahendran Subramanian, Mangesh U Gauns
备注: 58 Pages, 5 Figures


Abstract: Machine learning (ML) refers to computer algorithms that predict a meaningful output or categorise complex systems based on a large amount of data. ML applied in a variety of areas, including natural science, engineering, space exploration, and even gaming development. This article focused on the use of machine learning in the field of chemical and biological oceanography. In the prediction of global fixed nitrogen levels, partial carbon dioxide pressure, and other chemical properties, the application of ML is a promising tool. Machine learning is also utilised in the field of biological oceanography to detect planktonic forms from various images (i.e., microscopy, FlowCAM and video recorder), spectrometers, and other signal processing techniques. Moreover, ML successfully classified the mammals using their acoustics, detecting endangered mammalian and fish species in a specific environment. Most importantly, using environmental data, the ML proved to be an effective method for predicting hypoxic conditions and the harmful algal bloom events, an important measurement in terms of environmental monitoring. Furthermore, machine learning was used to construct a number of databases for various species that will be useful to other researchers, and the creation of new algorithms will help the marine research community better comprehend the chemistry and biology of the ocean.

摘要:机器学习(ML)是指根据大量数据预测有意义的输出或对复杂系统进行分类的计算机算法。 ML应用于各个领域,包括自然科学,工程,太空探索甚至游戏开发。本文的重点是在化学和生物海洋学领域使用机器学习。在预测全球固定氮水平,部分二氧化碳压力和其他化学特性时,ML的应用是一种有前途的工具。机器学习还用于生物海洋学领域,可从各种图像(即显微镜,流车和视频记录器),光谱仪和其他信号处理技术中检测浮游形式。此外,ML使用其声学成功地对哺乳动物进行了分类,在特定的环境中检测到濒临灭绝的哺乳动物和鱼类。最重要的是,使用环境数据,ML被证明是预测缺氧条件和有害藻华事件的有效方法,这是对环境监测的重要测量。此外,机器学习被用来为各种物种构建许多对其他研究人员有用的数据库,而创建新算法将帮助海洋研究界更好地理解海洋的化学和生物学。

ML-9-标题 On Efficient Reinforcement Learning for Full-length Game of StarCraft II

链接: https://arxiv.org/abs/2209.11553
作者: Ruo-Ze Liu, Zhen-Jia Pang, Zhou-Yu Meng, Wenhai Wang, Yang Yu, Tong Lu
备注: 48 pages,21 figures


Abstract: StarCraft II (SC2) poses a grand challenge for reinforcement learning (RL), of which the main difficulties include huge state space, varying action space, and a long time horizon. In this work, we investigate a set of RL techniques for the full-length game of StarCraft II. We investigate a hierarchical RL approach involving extracted macro-actions and a hierarchical architecture of neural networks. We investigate a curriculum transfer training procedure and train the agent on a single machine with 4 GPUs and 48 CPU threads. On a 64x64 map and using restrictive units, we achieve a win rate of 99% against the level-1 built-in AI. Through the curriculum transfer learning algorithm and a mixture of combat models, we achieve a 93% win rate against the most difficult non-cheating level built-in AI (level-7). In this extended version of the paper, we improve our architecture to train the agent against the cheating level AIs and achieve the win rate against the level-8, level-9, and level-10 AIs as 96%, 97%, and 94%, respectively. Our codes are at this https URL. To provide a baseline referring the AlphaStar for our work as well as the research and open-source community, we reproduce a scaled-down version of it, mini-AlphaStar (mAS). The latest version of mAS is 1.07, which can be trained on the raw action space which has 564 actions. It is designed to run training on a single common machine, by making the hyper-parameters adjustable. We then compare our work with mAS using the same resources and show that our method is more effective. The codes of mini-AlphaStar are at this https URL. We hope our study could shed some light on the future research of efficient reinforcement learning on SC2 and other large-scale games.

摘要:Starcraft II(SC2)对强化学习(RL)提出了巨大的挑战,其中主要困难包括巨大的状态空间,不同的动作空间和长期的视野。在这项工作中,我们研究了《星际争霸II》全长游戏的一系列RL技术。我们研究了涉及提取的宏观活动和神经网络的层次结构的层次RL方法。我们研究了课程转移培训程序,并在具有4个GPU和48个CPU线的单台计算机上训练代理。在64x64地图并使用限制性单元上,我们对内置AI的获胜率达到99%。通过课程转移学习算法和战斗模型的混合物,我们在最困难的非作战水平内置AI(7级)中获得了93%的胜利率。在本文的扩展版本中,我们改进了架构,以针对作弊水平训练代理商,并在8级,9级和10级AIS上达到胜利率,为96%,97%和94 %, 分别。我们的代码在此HTTPS URL。为了为我们的工作以及研究和开源社区提供基线,我们将其复制了一个缩放版本的Mini-Alphastar(MAS)。 MAS的最新版本为1.07,可以在具有564个动作的原始动作空间上进行培训。它旨在通过使超参数可调节来在单个普通机器上进行训练。然后,我们使用相同的资源将我们的工作与MAS进行比较,并表明我们的方法更有效。迷你α的代码在此HTTPS URL处。我们希望我们的研究能够阐明对SC2和其他大型游戏有效增强学习的未来研究。

ML-10-标题 A Unified Perspective on Natural Gradient Variational Inference with Gaussian Mixture Models

链接: https://arxiv.org/abs/2209.11533
作者: Oleg Arenz, Philipp Dahlinger, Zihan Ye, Michael Volpp, Gerhard Neumann


Abstract: Variational inference with Gaussian mixture models (GMMs) enables learning of highly-tractable yet multi-modal approximations of intractable target distributions. GMMs are particular relevant for problem settings with up to a few hundred dimensions, for example in robotics, for modelling distributions over trajectories or joint distributions. This work focuses on two very effective methods for GMM-based variational inference that both employ independent natural gradient updates for the individual components and the categorical distribution of the weights. We show for the first time, that their derived updates are equivalent, although their practical implementations and theoretical guarantees differ. We identify several design choices that distinguish both approaches, namely with respect to sample selection, natural gradient estimation, stepsize adaptation, and whether trust regions are enforced or the number of components adapted. We perform extensive ablations on these design choices and show that they strongly affect the efficiency of the optimization and the variability of the learned distribution. Based on our insights, we propose a novel instantiation of our generalized framework, that combines first-order natural gradient estimates with trust-regions and component adaption, and significantly outperforms both previous methods in all our experiments.

摘要:使用高斯混合模型(GMM)的变异推断,可以学习可侵袭性目标分布的高度收缩但多模式的近似值。 GMM与最多数百个维度的问题设置特别相关,例如机器人技术,用于对轨迹或联合分布进行建模。这项工作着重于基于GMM的两种非常有效的方法,这些方法既采用独立的自然梯度更新来为单个组件和权重的分类分布。我们首次表明,尽管它们的实际实现和理论保证有所不同,但他们的派生更新是等效的。我们确定了几种设计选择,可以区分两种方法,即在样本选择,自然梯度估计,步骤适应以及信任区域是否得到强制或适应的组件数量方面。我们对这些设计选择进行广泛的消融,并表明它们强烈影响了优化的效率和学习分布的可变性。基于我们的见解,我们提出了对广义框架的新颖实例化,该实例将一阶自然梯度估计与信任区域和组件适应相结合,并且在我们所有实验中都显着优于以前的两种方法。

ML-11-标题 An artificial neural network-based system for detecting machine failures using tiny sound data A case study

链接: https://arxiv.org/abs/2209.11527
作者: Thanh Tran, Sebastian Bader, Jan Lundgren
备注: 8 pages, 9 figures, conference


Abstract: In an effort to advocate the research for a deep learning-based machine failure detection system, we present a case study of our proposed system based on a tiny sound dataset. Our case study investigates a variational autoencoder (VAE) for augmenting a small drill sound dataset from Valmet AB. A Valmet dataset contains 134 sounds that have been divided into two categories: “Anomaly” and “Normal” recorded from a drilling machine in Valmet AB, a company in Sundsvall, Sweden that supplies equipment and processes for the production of biofuels. Using deep learning models to detect failure drills on such a small sound dataset is typically unsuccessful. We employed a VAE to increase the number of sounds in the tiny dataset by synthesizing new sounds from original sounds. The augmented dataset was created by combining these synthesized sounds with the original sounds. We used a high-pass filter with a passband frequency of 1000 Hz and a low-pass filter with a passband frequency of 22\kern 0.16667em000 Hz to pre-process sounds in the augmented dataset before transforming them to Mel spectrograms. The pre-trained 2D-CNN Alexnet was then trained using these Mel spectrograms. When compared to using the original tiny sound dataset to train pre-trained Alexnet, using the augmented sound dataset enhanced the CNN model’s classification results by 6.62%(94.12% when trained on the augmented dataset versus 87.5% when trained on the original dataset).

摘要:为了提倡研究基于深度学习的机器故障检测系统的研究,我们根据微小的声音数据集对我们提出的系统进行了案例研究。我们的案例研究调查了一个变异自动编码器(VAE),用于增强Valmet AB的小型钻头数据集。一个气门数据集包含134种声音,分为两类:从Valmet AB的一台钻机中记录的“异常”和“正常”,这是瑞典Sundsvall的一家公司,该公司为生物燃料的生产提供设备和流程。使用深度学习模型来检测如此小的声音数据集上的故障钻头通常没有成功。我们采用了VAE来通过合成原始声音的新声音来增加微小数据集中的声音数量。增强数据集是通过将这些合成的声音与原始声音相结合来创建的。我们使用了一个高通滤波器,其通带频率为1000 Hz和一个具有22 \ kern的Passband频率的低通滤波器0.16667EM000 Hz,以在增强数据集中的预处理声音中,然后将其转换为MEL频谱图。然后使用这些MEL频谱图对预训练的2D-CNN ALEXNET进行训练。与使用原始的小声音数据集进行训练预先训练的Alexnet时,使用增强声音数据集将CNN模型的分类结果提高了6.62 \%(94.12 \%(在增强数据集对87.5 \%训练的原始训练时,接受了87.5 \%)数据集)。

ML-12-标题 The complexity of unsupervised learning of lexicographic preferences

链接: https://arxiv.org/abs/2209.11505
作者: Hélène Fargier (IRIT-ADRIA, ANITI), Pierre-François Gimenez (CIDRE), Jérôme Mengin (IRIT-ADRIA, ANITI), Bao Ngoc Le Nguyen (INSA Toulouse)


Abstract: This paper considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees.


ML-13-标题 Sequential Causal Effect Variational Autoencoder Time Series Causal Link Estimation under Hidden Confounding

链接: https://arxiv.org/abs/2209.11497
作者: Violeta Teodora Trifunov, Maha Shadaydeh, Joachim Denzler


Abstract: Estimating causal effects from observational data in the presence of latent variables sometimes leads to spurious relationships which can be misconceived as causal. This is an important issue in many fields such as finance and climate science. We propose Sequential Causal Effect Variational Autoencoder (SCEVAE), a novel method for time series causality analysis under hidden confounding. It is based on the CEVAE framework and recurrent neural networks. The causal link’s intensity of the confounded variables is calculated by using direct causal criteria based on Pearl’s do-calculus. We show the efficacy of SCEVAE by applying it to synthetic datasets with both linear and nonlinear causal links. Furthermore, we apply our method to real aerosol-cloud-climate observation data. We compare our approach to a time series deconfounding method with and without substitute confounders on the synthetic data. We demonstrate that our method performs better by comparing both methods to the ground truth. In the case of real data, we use the expert knowledge of causal links and show how the use of correct proxy variables aids data reconstruction.


ML-14-标题 Active Few-Shot Classification a New Paradigm for Data-Scarce Learning Settings

链接: https://arxiv.org/abs/2209.11481
作者: Aymane Abdali, Vincent Gripon, Lucas Drumetz, Bartosz Boguslawski


Abstract: We consider a novel formulation of the problem of Active Few-Shot Classification (AFSC) where the objective is to classify a small, initially unlabeled, dataset given a very restrained labeling budget. This problem can be seen as a rival paradigm to classical Transductive Few-Shot Classification (TFSC), as both these approaches are applicable in similar conditions. We first propose a methodology that combines statistical inference, and an original two-tier active learning strategy that fits well into this framework. We then adapt several standard vision benchmarks from the field of TFSC. Our experiments show the potential benefits of AFSC can be substantial, with gains in average weighted accuracy of up to 10% compared to state-of-the-art TFSC methods for the same labeling budget. We believe this new paradigm could lead to new developments and standards in data-scarce learning settings.


ML-15-标题 Optimizing Class Distribution in Memory for Multi-Label Online Continual Learning

链接: https://arxiv.org/abs/2209.11469
作者: Yan-Shuo Liang, Wu-Jun Li


Abstract: Online continual learning, especially when task identities and task boundaries are unavailable, is a challenging continual learning setting. One representative kind of methods for online continual learning is replay-based methods, in which a replay buffer called memory is maintained to keep a small part of past samples for overcoming catastrophic forgetting. When tackling with online continual learning, most existing replay-based methods focus on single-label problems in which each sample in the data stream has only one label. But multi-label problems may also happen in the online continual learning setting in which each sample may have more than one label. In the online setting with multi-label samples, the class distribution in data stream is typically highly imbalanced, and it is challenging to control class distribution in memory since changing the number of samples belonging to one class may affect the number of samples belonging to other classes. But class distribution in memory is critical for replay-based memory to get good performance, especially when the class distribution in data stream is highly imbalanced. In this paper, we propose a simple but effective method, called optimizing class distribution in memory (OCDM), for multi-label online continual learning. OCDM formulates the memory update mechanism as an optimization problem and updates the memory by solving this problem. Experiments on two widely used multi-label datasets show that OCDM can control the class distribution in memory well and can outperform other state-of-the-art methods.

摘要:在线持续学习,尤其是在任务身份和任务边界不可用时,是一个充满挑战的持续学习设置。一种代表性的在线持续学习方法是基于重播的方法,其中保留称为内存的重播缓冲区,以保留过去样本的一小部分,以克服灾难性的遗忘。当通过在线持续学习来解决时,大多数现有的基于重播的方法都集中在单标签问题上,其中数据流中的每个样本只有一个标签。但是,在在线持续学习环境中,多标签问题也可能发生,在线持续学习环境中,每个样本可能具有多个标签。在使用多标签样本的在线设置中,数据流中的类分布通常是高度不平衡的,并且在内存中控制类别的分配是一项挑战课程。但是,内存中的课程分布对于基于重播的内存至关重要,以获得良好的性能,尤其是当数据流中的类分布高度不平衡时。在本文中,我们提出了一种简单但有效的方法,称为多标签在线持续学习,称为内存中的班级分布(OCDM)。 OCDM将内存更新机制制定为优化问题,并通过解决此问题来更新内存。在两个广泛使用的多标签数据集上的实验表明,OCDM可以很好地控制内存中的类分布,并且可以胜过其他最先进的方法。

ML-16-标题 Smart Active Sampling to enhance Quality Assurance Efficiency

链接: https://arxiv.org/abs/2209.11464
作者: Clemens Heistracher, Stefan Stricker, Pedro Casas, Daniel Schall, Jana Kemnitz


Abstract: We propose a new sampling strategy, called smart active sapling, for quality inspections outside the production line. Based on the principles of active learning a machine learning model decides which samples are sent to quality inspection. On the one hand, this minimizes the production of scrap parts due to earlier detection of quality violations. On the other hand, quality inspection costs are reduced for smooth operation.

摘要:我们提出了一种新的抽样策略,称为Smart Active Sapling,用于生产线之外的质量检查。根据主动学习的原则,机器学习模型决定将哪些样品发送到质量检查。一方面,由于较早发现质量违规行为,这可以最大程度地减少废料零件的产生。另一方面,质量检查成本降低了,以进行平稳运行。

ML-17-标题 A Preliminary Investigation of MLOps Practices in GitHub

链接: https://arxiv.org/abs/2209.11453
作者: Fabio Calefato, Filippo Lanubile, Luigi Quaranta
备注: Presented at ESEM '22, the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement


Abstract: Background. The rapid and growing popularity of machine learning (ML) applications has led to an increasing interest in MLOps, that is, the practice of continuous integration and deployment (CI/CD) of ML-enabled systems. Aims. Since changes may affect not only the code but also the ML model parameters and the data themselves, the automation of traditional CI/CD needs to be extended to manage model retraining in production. Method. In this paper, we present an initial investigation of the MLOps practices implemented in a set of ML-enabled systems retrieved from GitHub, focusing on GitHub Actions and CML, two solutions to automate the development workflow. Results. Our preliminary results suggest that the adoption of MLOps workflows in open-source GitHub projects is currently rather limited. Conclusions. Issues are also identified, which can guide future research work.

摘要:背景。机器学习(ML)应用程序的迅速流行已导致对MLOP的兴趣越来越多,即ML启用ML的系统的连续集成和部署(CI/CD)的实践。目标。由于更改不仅可能影响代码,还会影响ML模型参数和数据本身,因此需要扩展传统CI/CD的自动化以管理生产中的模型再培训。方法。在本文中,我们对从GitHub检索的一组启用ML的系统中实施的MLOP实践进行了初步研究,重点是GitHub Action和CML,这是两种解决开发工作流程的解决方案。结果。我们的初步结果表明,在开源GitHub项目中采用MLOPS工作流程目前相当有限。结论。还确定了问题,可以指导未来的研究工作。

ML-18-标题 A Robust and Explainable Data-Driven Anomaly Detection Approach For Power Electronics

链接: https://arxiv.org/abs/2209.11427
作者: Alexander Beattie, Pavol Mulinka, Subham Sahoo, Ioannis T. Christou, Charalampos Kalalas, Daniel Gutierrez-Rojas, Pedro H. J. Nardelli


Abstract: Timely and accurate detection of anomalies in power electronics is becoming increasingly critical for maintaining complex production systems. Robust and explainable strategies help decrease system downtime and preempt or mitigate infrastructure cyberattacks. This work begins by explaining the types of uncertainty present in current datasets and machine learning algorithm outputs. Three techniques for combating these uncertainties are then introduced and analyzed. We further present two anomaly detection and classification approaches, namely the Matrix Profile algorithm and anomaly transformer, which are applied in the context of a power electronic converter dataset. Specifically, the Matrix Profile algorithm is shown to be well suited as a generalizable approach for detecting real-time anomalies in streaming time-series data. The STUMPY python library implementation of the iterative Matrix Profile is used for the creation of the detector. A series of custom filters is created and added to the detector to tune its sensitivity, recall, and detection accuracy. Our numerical results show that, with simple parameter tuning, the detector provides high accuracy and performance in a variety of fault scenarios.


ML-19-标题 LEADER Learning Attention over Driving Behaviors for Planning under Uncertainty

链接: https://arxiv.org/abs/2209.11422
作者: Mohamad H. Danesh, Panpan Cai, David Hsu
备注: CoRL 2022 (oral)


Abstract: Uncertainty on human behaviors poses a significant challenge to autonomous driving in crowded urban environments. The partially observable Markov decision processes (POMDPs) offer a principled framework for planning under uncertainty, often leveraging Monte Carlo sampling to achieve online performance for complex tasks. However, sampling also raises safety concerns by potentially missing critical events. To address this, we propose a new algorithm, LEarning Attention over Driving bEhavioRs (LEADER), that learns to attend to critical human behaviors during planning. LEADER learns a neural network generator to provide attention over human behaviors in real-time situations. It integrates the attention into a belief-space planner, using importance sampling to bias reasoning towards critical events. To train the algorithm, we let the attention generator and the planner form a min-max game. By solving the min-max game, LEADER learns to perform risk-aware planning without human labeling.


ML-20-标题 Relation Embedding based Graph Neural Networks for Handling Heterogeneous Graph

链接: https://arxiv.org/abs/2209.11414
作者: Junfu Wang, Yuanfang Guo, Liang Yang, Yunhong Wang


Abstract: Heterogeneous graph learning has drawn significant attentions in recent years, due to the success of graph neural networks (GNNs) and the broad applications of heterogeneous information networks. Various heterogeneous graph neural networks have been proposed to generalize GNNs for processing the heterogeneous graphs. Unfortunately, these approaches model the heterogeneity via various complicated modules. This paper aims to propose a simple yet efficient framework to make the homogeneous GNNs have adequate ability to handle heterogeneous graphs. Specifically, we propose Relation Embedding based Graph Neural Networks (RE-GNNs), which employ only one parameter per relation to embed the importance of edge type relations and self-loop connections. To optimize these relation embeddings and the other parameters simultaneously, a gradient scaling factor is proposed to constrain the embeddings to converge to suitable values. Besides, we theoretically demonstrate that our RE-GNNs have more expressive power than the meta-path based heterogeneous GNNs. Extensive experiments on the node classification tasks validate the effectiveness of our proposed method.


ML-21-标题 Achieve the Minimum Width of Neural Networks for Universal Approximation

链接: https://arxiv.org/abs/2209.11395
作者: Yongqiang Cai


Abstract: The universal approximation property (UAP) of neural networks is fundamental for deep learning, and it is well known that wide neural networks are universal approximators of continuous functions within both the L^p norm and the continuous/uniform norm. However, the exact minimum width, w_\min , for the UAP has not been studied thoroughly. Recently, using a decoder-memorizer-encoder scheme, \citetPark2021Minimum found that w_\min = \max(d_x+1,d_y) for both the L^p -UAP of ReLU networks and the C -UAP of ReLU+STEP networks, where d_x,d_y are the input and output dimensions, respectively. In this paper, we consider neural networks with an arbitrary set of activation functions. We prove that both C -UAP and L^p -UAP for functions on compact domains share a universal lower bound of the minimal width; that is, w^_\min = \max(d_x,d_y) . In particular, the critical width, w^_\min , for L^p -UAP can be achieved by leaky-ReLU networks, provided that the input or output dimension is larger than one. Our construction is based on the approximation power of neural ordinary differential equations and the ability to approximate flow maps by neural networks. The nonmonotone or discontinuous activation functions case and the one-dimensional case are also discussed.

摘要:神经网络的通用近似特性(UAP)对于深度学习至关重要,众所周知,广泛的神经网络是L^p Norm和连续/统一规范中连续功能的通用近似值。但是,尚未对UAP的确切最小宽度,w_ \ min进行彻底研究。最近,使用解码器模式编码器方案,\ citetPark2021mimine发现w_ \ min = \ max(d_x+1,d_y)对于relu网络的l^p -uap和relu+step网络的c -uap,c。其中d_x,d_y分别是输入和输出尺寸。在本文中,我们考虑具有任意激活功能的神经网络。我们证明,在紧凑型域上的函数的c -uap和l^p -uap共享最小宽度的通用下限。也就是说,w^_ \ min = \ max(d_x,d_y)。特别是,只要输入或输出维度大于一个,就可以通过泄漏的relu网络来实现临界宽度,w^_ \ min,可以通过泄漏的relu网络来实现。我们的构建基于神经普通微分方程的近似能力以及通过神经网络近似流量图的能力。还讨论了非单极管或不连续的激活函数情况和一维情况。

ML-22-标题 Do Current Multi-Task Optimization Methods in Deep Learning Even Help?

链接: https://arxiv.org/abs/2209.11379
作者: Derrick Xin, Behrooz Ghorbani, Ankush Garg, Orhan Firat, Justin Gilmer


Abstract: Recent research has proposed a series of specialized optimization algorithms for deep multi-task models. It is often claimed that these multi-task optimization (MTO) methods yield solutions that are superior to the ones found by simply optimizing a weighted average of the task losses. In this paper, we perform large-scale experiments on a variety of language and vision tasks to examine the empirical validity of these claims. We show that, despite the added design and computational complexity of these algorithms, MTO methods do not yield any performance improvements beyond what is achievable via traditional optimization approaches. We highlight alternative strategies that consistently yield improvements to the performance profile and point out common training pitfalls that might cause suboptimal results. Finally, we outline challenges in reliably evaluating the performance of MTO algorithms and discuss potential solutions.


ML-23-标题 A Jensen-Shannon Divergence Based Loss Function for Bayesian Neural Networks

链接: https://arxiv.org/abs/2209.11366
作者: Ponkrshnan Thiagarajan, Susanta Ghosh
备注: To be submitted for peer review in IEEE


Abstract: Kullback-Leibler (KL) divergence is widely used for variational inference of Bayesian Neural Networks (BNNs). However, the KL divergence has limitations such as unboundedness and asymmetry. We examine the Jensen-Shannon (JS) divergence that is more general, bounded, and symmetric. We formulate a novel loss function for BNNs based on the geometric JS divergence and show that the conventional KL divergence-based loss function is its special case. We evaluate the divergence part of the proposed loss function in a closed form for a Gaussian prior. For any other general prior, Monte Carlo approximations can be used. We provide algorithms for implementing both of these cases. We demonstrate that the proposed loss function offers an additional parameter that can be tuned to control the degree of regularisation. We derive the conditions under which the proposed loss function regularises better than the KL divergence-based loss function for Gaussian priors and posteriors. We demonstrate performance improvements over the state-of-the-art KL divergence-based BNN on the classification of a noisy CIFAR data set and a biased histopathology data set.

摘要:Kullback-Leibler(KL)差异广泛用于贝叶斯神经网络(BNNS)的变异推断。然而,KL差异具有无限性和不对称性等局限性。我们检查了更通用,有限和对称的詹森 - 香农(JS)差异。我们根据几何JS差异为BNN制定新的损失函数,并表明基于KL差异的常规损失函数是其特殊情况。我们以封闭形式的高斯先验评估拟议损失函数的差异部分。对于任何其他一般的先验,都可以使用蒙特卡洛近似值。我们提供了实施这两种情况的算法。我们证明所提出的损失函数提供了一个可以调整的附加参数,以控制正则化程度。我们得出了所提出的损失函数在高斯先验和后代的基于KL差异的损失函数更好的条件。我们证明了基于嘈杂的CIFAR数据集和有偏见的组织病理学数据集的最新基于KL差异的BNN的性能提高。

ML-24-标题 Convolutional Learning on Multigraphs

链接: https://arxiv.org/abs/2209.11354
作者: Landon Butler, Alejandro Parada-Mayorga, Alejandro Ribeiro


Abstract: Graph convolutional learning has led to many exciting discoveries in diverse areas. However, in some applications, traditional graphs are insufficient to capture the structure and intricacies of the data. In such scenarios, multigraphs arise naturally as discrete structures in which complex dynamics can be embedded. In this paper, we develop convolutional information processing on multigraphs and introduce convolutional multigraph neural networks (MGNNs). To capture the complex dynamics of information diffusion within and across each of the multigraph’s classes of edges, we formalize a convolutional signal processing model, defining the notions of signals, filtering, and frequency representations on multigraphs. Leveraging this model, we develop a multigraph learning architecture, including a sampling procedure to reduce computational complexity. The introduced architecture is applied towards optimal wireless resource allocation and a hate speech localization task, offering improved performance over traditional graph neural networks.


ML-25-标题 StyleTime Style Transfer for Synthetic Time Series Generation

链接: https://arxiv.org/abs/2209.11306
作者: Yousef El-Laham, Svitlana Vyetrenko


Abstract: Neural style transfer is a powerful computer vision technique that can incorporate the artistic “style” of one image to the “content” of another. The underlying theory behind the approach relies on the assumption that the style of an image is represented by the Gram matrix of its features, which is typically extracted from pre-trained convolutional neural networks (e.g., VGG-19). This idea does not straightforwardly extend to time series stylization since notions of style for two-dimensional images are not analogous to notions of style for one-dimensional time series. In this work, a novel formulation of time series style transfer is proposed for the purpose of synthetic data generation and enhancement. We introduce the concept of stylized features for time series, which is directly related to the time series realism properties, and propose a novel stylization algorithm, called StyleTime, that uses explicit feature extraction techniques to combine the underlying content (trend) of one time series with the style (distributional properties) of another. Further, we discuss evaluation metrics, and compare our work to existing state-of-the-art time series generation and augmentation schemes. To validate the effectiveness of our methods, we use stylized synthetic data as a means for data augmentation to improve the performance of recurrent neural network models on several forecasting tasks.


ML-26-标题 An Investigation of the Bias-Variance Tradeoff in Meta-Gradients

链接: https://arxiv.org/abs/2209.11303
作者: Risto Vuorio, Jacob Beck, Shimon Whiteson, Jakob Foerster, Gregory Farquhar


Abstract: Meta-gradients provide a general approach for optimizing the meta-parameters of reinforcement learning (RL) algorithms. Estimation of meta-gradients is central to the performance of these meta-algorithms, and has been studied in the setting of MAML-style short-horizon meta-RL problems. In this context, prior work has investigated the estimation of the Hessian of the RL objective, as well as tackling the problem of credit assignment to pre-adaptation behavior by making a sampling correction. However, we show that Hessian estimation, implemented for example by DiCE and its variants, always adds bias and can also add variance to meta-gradient estimation. Meanwhile, meta-gradient estimation has been studied less in the important long-horizon setting, where backpropagation through the full inner optimization trajectories is not feasible. We study the bias and variance tradeoff arising from truncated backpropagation and sampling correction, and additionally compare to evolution strategies, which is a recently popular alternative strategy to long-horizon meta-learning. While prior work implicitly chooses points in this bias-variance space, we disentangle the sources of bias and variance and present an empirical study that relates existing estimators to each other.


ML-27-标题 Scalable Gaussian Process Hyperparameter Optimization via Coverage Regularization

链接: https://arxiv.org/abs/2209.11280
作者: Killian Wood, Alec M. Dunton, Amanda Muyskens, Benjamin W. Priest
备注: 4 pages content, 3 figures, 6 tables


Abstract: Gaussian processes (GPs) are Bayesian non-parametric models popular in a variety of applications due to their accuracy and native uncertainty quantification (UQ). Tuning GP hyperparameters is critical to ensure the validity of prediction accuracy and uncertainty; uniquely estimating multiple hyperparameters in, e.g. the Matern kernel can also be a significant challenge. Moreover, training GPs on large-scale datasets is a highly active area of research: traditional maximum likelihood hyperparameter training requires quadratic memory to form the covariance matrix and has cubic training complexity. To address the scalable hyperparameter tuning problem, we present a novel algorithm which estimates the smoothness and length-scale parameters in the Matern kernel in order to improve robustness of the resulting prediction uncertainties. Using novel loss functions similar to those in conformal prediction algorithms in the computational framework provided by the hyperparameter estimation algorithm MuyGPs, we achieve improved UQ over leave-one-out likelihood maximization while maintaining a high degree of scalability as demonstrated in numerical experiments.


ML-28-标题 Environment Optimization for Multi-Agent Navigation

链接: https://arxiv.org/abs/2209.11279
作者: Zhan Gao, Amanda Prorok


Abstract: Traditional approaches to the design of multi-agent navigation algorithms consider the environment as a fixed constraint, despite the obvious influence of spatial constraints on agents’ performance. Yet hand-designing improved environment layouts and structures is inefficient and potentially expensive. The goal of this paper is to consider the environment as a decision variable in a system-level optimization problem, where both agent performance and environment cost can be accounted for. We begin by proposing a novel environment optimization problem. We show, through formal proofs, under which conditions the environment can change while guaranteeing completeness (i.e., all agents reach their navigation goals). Our solution leverages a model-free reinforcement learning approach. In order to accommodate a broad range of implementation scenarios, we include both online and offline optimization, and both discrete and continuous environment representations. Numerical results corroborate our theoretical findings and validate our approach.


ML-29-标题 Minimizing Human Assistance Augmenting a Single Demonstration for Deep Reinforcement Learning

链接: https://arxiv.org/abs/2209.11275
作者: Abraham George, Alison Bartsch, Amir Barati Farimani
备注: 7 pages, 11 figures


Abstract: The use of human demonstrations in reinforcement learning has proven to significantly improve agent performance. However, any requirement for a human to manually ‘teach’ the model is somewhat antithetical to the goals of reinforcement learning. This paper attempts to minimize human involvement in the learning process while still retaining the performance advantages by using a single human example collected through a simple-to-use virtual reality simulation to assist with RL training. Our method augments a single demonstration to generate numerous human-like demonstrations that, when combined with Deep Deterministic Policy Gradients and Hindsight Experience Replay (DDPG + HER), significantly improve training time on simple tasks and allows the agent to solve a complex task (block stacking) that DDPG + HER alone cannot solve. The model achieves this significant training advantage using a single human example, requiring less than a minute of human input.

摘要:事实证明,在加强学习中使用人类示范可以显着改善代理的性能。但是,任何要求人手动“教”该模型的要求与强化学习的目标有些相反。本文试图通过使用通过简单使用的虚拟现实模拟收集的单个人类示例来帮助进行RL培训,以最大程度地减少人类参与学习过程的参与,同时仍保留了绩效优势。我们的方法增加了一次演示,以产生许多类似人类的演示,与深层确定性的政策梯度和事后的经验重播(DDPG + HER)相结合时,可以显着改善对简单任务的训练时间,并允许代理商解决复杂的任务(Block Block堆叠)DDPG +她一个人无法解决。该模型使用单个人类示例实现了这一重要的训练优势,需要少于一分钟的人类输入。

ML-30-标题 Artificial Intelligence in Material Engineering A review on applications of AI in Material Engineering

链接: https://arxiv.org/abs/2209.11234
作者: Lipichanda Goswami, Manoj Deka, Mohendra Roy
备注: V1


Abstract: Recently, there has been extensive use of artificial Intelligence (AI) in the field of material engineering. This can be attributed to the development of high performance computing and thereby feasibility to test deep learning models with large parameters. In this article we tried to review some of the latest developments in the applications of AI in material engineering.


ML-31-标题 Multidimensional Interactive Fixed-Effects

链接: https://arxiv.org/abs/2209.11691
作者: Hugo Freeman


Abstract: This paper studies a linear and additively separable model for multidimensional panel data of three or more dimensions with unobserved interactive fixed effects. Two approaches are considered to account for these unobserved interactive fixed-effects when estimating coefficients on the observed covariates. First, the model is embedded within the standard two-dimensional panel framework and restrictions are derived under which the factor structure methods in Bai (2009) lead to consistent estimation of model parameters. The second approach considers group fixed-effects and kernel methods that are more robust to the multidimensional nature of the problem. Theoretical results and simulations show the benefit of standard two-dimensional panel methods when the structure of the interactive fixed-effect term is known, but also highlight how the group fixed-effects and kernel methods perform well without knowledge of this structure. The methods are implemented to estimate the demand elasticity for beer under a handful of models for demand.


ML-32-标题 Exact conservation laws for neural network integrators of dynamical systems

链接: https://arxiv.org/abs/2209.11661
作者: Eike Hermann Müller
备注: 21 pages, 16 figures; submitted to Journal of Computational Physics


Abstract: The solution of time dependent differential equations with neural networks has attracted a lot of attention recently. The central idea is to learn the laws that govern the evolution of the solution from data, which might be polluted with random noise. However, in contrast to other machine learning applications, usually a lot is known about the system at hand. For example, for many dynamical systems physical quantities such as energy or (angular) momentum are exactly conserved. Hence, the neural network has to learn these conservation laws from data and they will only be satisfied approximately due to finite training time and random noise. In this paper we present an alternative approach which uses Noether’s Theorem to inherently incorporate conservation laws into the architecture of the neural network. We demonstrate that this leads to better predictions for three model systems: the motion of a non-relativistic particle in a three-dimensional Newtonian gravitational potential, the motion of a massive relativistic particle in the Schwarzschild metric and a system of two interacting particles in four dimensions.


ML-33-标题 Differentiable physics-enabled closure modeling for Burgers turbulence

链接: https://arxiv.org/abs/2209.11614
作者: Varun Shankar, Vedant Puri, Ramesh Balakrishnan, Romit Maulik, Venkatasubramanian Viswanathan


Abstract: Data-driven turbulence modeling is experiencing a surge in interest following algorithmic and hardware developments in the data sciences. We discuss an approach using the differentiable physics paradigm that combines known physics with machine learning to develop closure models for Burgers’ turbulence. We consider the 1D Burgers system as a prototypical test problem for modeling the unresolved terms in advection-dominated turbulence problems. We train a series of models that incorporate varying degrees of physical assumptions on an a posteriori loss function to test the efficacy of models across a range of system parameters, including viscosity, time, and grid resolution. We find that constraining models with inductive biases in the form of partial differential equations that contain known physics or existing closure approaches produces highly data-efficient, accurate, and generalizable models, outperforming state-of-the-art baselines. Addition of structure in the form of physics information also brings a level of interpretability to the models, potentially offering a stepping stone to the future of closure modeling.


ML-34-标题 Power Management in Smart Residential Building with Deep Learning Model for Occupancy Detection by Usage Pattern of Electric Appliances

链接: https://arxiv.org/abs/2209.11520
作者: Sangkeum Lee, Sarvar Hussain Nengroo, Hojun Jin, Yoonmee Doh, Chungho Lee, Taewook Heo, Dongsoo Har
备注: 11 pages, 7 figures, to be submitted to 7th International Conference on Renewable Energy and Conservation, ICREC 2022


Abstract: With the growth of smart building applications, occupancy information in residential buildings is becoming more and more significant. In the context of the smart buildings’ paradigm, this kind of information is required for a wide range of purposes, including enhancing energy efficiency and occupant comfort. In this study, occupancy detection in residential building is implemented using deep learning based on technical information of electric appliances. To this end, a novel approach of occupancy detection for smart residential building system is proposed. The dataset of electric appliances, sensors, light, and HVAC, which is measured by smart metering system and is collected from 50 households, is used for simulations. To classify the occupancy among datasets, the support vector machine and autoencoder algorithm are used. Confusion matrix is utilized for accuracy, precision, recall, and F1 to demonstrate the comparative performance of the proposed method in occupancy detection. The proposed algorithm achieves occupancy detection using technical information of electric appliances by 95.7~98.4%. To validate occupancy detection data, principal component analysis and the t-distributed stochastic neighbor embedding (t-SNE) algorithm are employed. Power consumption with renewable energy system is reduced to 11.1~13.1% in smart buildings by using occupancy detection.


ML-35-标题 Error Mitigation-Aided Optimization of Parameterized Quantum Circuits Convergence Analysis

链接: https://arxiv.org/abs/2209.11514
作者: Sharu Theresa Jose, Osvaldo Simeone
备注: Submitted for journal publication


Abstract: Variational quantum algorithms (VQAs) offer the most promising path to obtaining quantum advantages via noisy intermediate-scale quantum (NISQ) processors. Such systems leverage classical optimization to tune the parameters of a parameterized quantum circuit (PQC). The goal is minimizing a cost function that depends on measurement outputs obtained from the PQC. Optimization is typically implemented via stochastic gradient descent (SGD). On NISQ computers, gate noise due to imperfections and decoherence affects the stochastic gradient estimates by introducing a bias. Quantum error mitigation (QEM) techniques can reduce the estimation bias without requiring any increase in the number of qubits, but they in turn cause an increase in the variance of the gradient estimates. This work studies the impact of quantum gate noise on the convergence of SGD for the variational eigensolver (VQE), a fundamental instance of VQAs. The main goal is ascertaining conditions under which QEM can enhance the performance of SGD for VQEs. It is shown that quantum gate noise induces a non-zero error-floor on the convergence error of SGD (evaluated with respect to a reference noiseless PQC), which depends on the number of noisy gates, the strength of the noise, as well as the eigenspectrum of the observable being measured and minimized. In contrast, with QEM, any arbitrarily small error can be obtained. Furthermore, for error levels attainable with or without QEM, QEM can reduce the number of required iterations, but only as long as the quantum noise level is sufficiently small, and a sufficiently large number of measurements is allowed at each SGD iteration. Numerical examples for a max-cut problem corroborate the main theoretical findings.

摘要:变异量子算法(VQAS)提供了通过嘈杂的中间尺度量子(NISQ)处理器获得量子优势的最有希望的途径。这样的系统利用经典优化来调整参数化量子电路(PQC)的参数。目标是最大程度地减少取决于从PQC获得的测量输出的成本函数。通常通过随机梯度下降(SGD)实现优化。在NISQ计算机上,由于缺陷和破坏性而引起的栅极噪声通过引入偏差会影响随机梯度的估计。量子误差缓解(QEM)技术可以减少估计偏差而无需量子数量增加,但它们又导致梯度估计的方差增加。这项工作研究了量子门噪声对SGD收敛的影响,而VQA的基本实例是变异的eigensolver(VQE)。主要目标是确定QEM可以增强VQE的SGD性能的条件。结果表明,量子门噪声在SGD的收敛误差(根据参考无噪声PQC评估)诱导非零误差 - 基础,这取决于噪声门的数量,噪声的强度以及可观察到的可观察到的特征性被测量和最小化。相反,使用QEM,可以获得任何任意小的误差。此外,对于有或没有QEM的误差级别,QEM可以减少所需的迭代次数,但是只要量子噪声水平足够小,并且在每种SGD迭代中允许足够大的测量值。最大切割问题的数值示例证实了主要理论发现。

ML-36-标题 Image Classification using Sequence of Pixels

链接: https://arxiv.org/abs/2209.11495
作者: Gajraj Kuldeep


Abstract: This study compares sequential image classification methods based on recurrent neural networks. We describe methods based on recurrent neural networks such as Long-Short-Term memory(LSTM), bidirectional Long-Short-Term memory(BiLSTM) architectures, etc. We also review the state-of-the-art sequential image classification architectures. We mainly focus on LSTM, BiLSTM, temporal convolution network, and independent recurrent neural network architecture in the study. It is known that RNN lacks in learning long-term dependencies in the input sequence. We use a simple feature construction method using orthogonal Ramanujan periodic transform on the input sequence. Experiments demonstrate that if these features are given to LSTM or BiLSTM networks, the performance increases drastically. Our focus in this study is to increase the training accuracy simultaneously reducing the training time for the LSTM and BiLSTM architecture, but not on pushing the state-of-the-art results, so we use simple LSTM/BiLSTM architecture. We compare sequential input with the constructed feature as input to single layer LSTM and BiLSTM network for MNIST and CIFAR datasets. We observe that sequential input to the LSTM network with 128 hidden unit training for five epochs results in training accuracy of 33% whereas constructed features as input to the same LSTM network results in training accuracy of 90% with 1/3 lesser time.


ML-37-标题 Computational Discovery of Energy-Efficient Heat Treatment for Microstructure Design using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2209.11259
作者: Jaber R. Mianroodi, Nima H. Siboni, Dierk Raabe


Abstract: Deep Reinforcement Learning (DRL) is employed to develop autonomously optimized and custom-designed heat-treatment processes that are both, microstructure-sensitive and energy efficient. Different from conventional supervised machine learning, DRL does not rely on static neural network training from data alone, but a learning agent autonomously develops optimal solutions, based on reward and penalty elements, with reduced or no supervision. In our approach, a temperature-dependent Allen-Cahn model for phase transformation is used as the environment for the DRL agent, serving as the model world in which it gains experience and takes autonomous decisions. The agent of the DRL algorithm is controlling the temperature of the system, as a model furnace for heat-treatment of alloys. Microstructure goals are defined for the agent based on the desired microstructure of the phases. After training, the agent can generate temperature-time profiles for a variety of initial microstructure states to reach the final desired microstructure state. The agent’s performance and the physical meaning of the heat-treatment profiles generated are investigated in detail. In particular, the agent is capable of controlling the temperature to reach the desired microstructure starting from a variety of initial conditions. This capability of the agent in handling a variety of conditions paves the way for using such an approach also for recycling-oriented heat treatment process design where the initial composition can vary from batch to batch, due to impurity intrusion, and also for the design of energy-efficient heat treatments. For testing this hypothesis, an agent without penalty on the total consumed energy is compared with one that considers energy costs. The energy cost penalty is imposed as an additional criterion on the agent for finding the optimal temperature-time profile.

摘要:深入加固学习(DRL)用于开发自主优化和定制设计的热处理过程,这些过程既对微观结构敏感又节能。与常规监督的机器学习不同,DRL不仅依赖于数据中的静态神经网络培训,但是学习代理人会根据奖励和惩罚元素自主开发最佳解决方案,并减少或没有监督。在我们的方法中,依赖温度的艾伦 - 卡恩模型用于相转换,用作DRL代理的环境,是其获得经验并采取自主决策的模型世界。 DRL算法的试剂正在控制系统的温度,作为用于合金热处理的模型炉。根据所需的相位微观结构为代理定义了微观结构目标。训练后,代理可以为各种初始微观结构状态生成温度时间曲线,以达到最终所需的微观结构状态。详细研究了代理商的性能和热处理概况的物理含义。特别是,该试剂能够控制温度以从各种初始条件开始达到所需的微观结构。代理在处理各种条件方面的这种能力为使用这种方法铺平了道路,也用于回收的导向热处理过程设计,由于杂质的侵入,初始组合物可能因批量而异,以及用于设计节能热处理。为了检验这一假设,将无罚款的代理人与考虑能源成本的代理人进行了比较。对能源成本的罚款是针对找到最佳温度时间剖面的代理的附加标准。

ML-38-标题 Assessing Robustness of EEG Representations under Data-shifts via Latent Space and Uncertainty Analysis

链接: https://arxiv.org/abs/2209.11233
作者: Neeraj Wagh, Jionghao Wei, Samarth Rawal, Brent M. Berry, Yogatheesan Varatharajah
备注: Preprint under review


Abstract: The recent availability of large datasets in bio-medicine has inspired the development of representation learning methods for multiple healthcare applications. Despite advances in predictive performance, the clinical utility of such methods is limited when exposed to real-world data. Here we develop model diagnostic measures to detect potential pitfalls during deployment without assuming access to external data. Specifically, we focus on modeling realistic data shifts in electrophysiological signals (EEGs) via data transforms, and extend the conventional task-based evaluations with analyses of a) model’s latent space and b) predictive uncertainty, under these transforms. We conduct experiments on multiple EEG feature encoders and two clinically relevant downstream tasks using publicly available large-scale clinical EEGs. Within this experimental setting, our results suggest that measures of latent space integrity and model uncertainty under the proposed data shifts may help anticipate performance degradation during deployment.


ML-39-标题 DFX A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

链接: https://arxiv.org/abs/2209.10797
作者: Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, Joo-Young Kim
备注: Extension of HOTCHIPS 2022 and accepted in MICRO 2022


Abstract: Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.

摘要:Transformer是一种深入学习语言模型,广泛用于数据中心中的自然语言处理(NLP)服务。在变压器模型中,生成的预训练的变压器(GPT)在文本生成或自然语言生成(NLG)中取得了显着的性能,它需要在摘要阶段处理大型输入上下文,然后是产生一个生成阶段的一次单词。常规平台(例如GPU)专门用于在摘要阶段平行处理大型输入,但是由于其顺序特征,它们的性能在生成阶段显着降低。因此,需要一个有效的硬件平台来解决由文本生成的顺序特征引起的高潜伏期。在本文中,我们提出了DFX,这是一种多FPGA加速器,该设备在摘要和发电阶段中执行GPT-2模型端到端,并具有低延迟和高吞吐量。 DFX使用模型并行性和优化的数据流,这是模型和硬件感知的设备之间快速同时执行执行。其计算核心根据自定义说明运行,并提供GPT-2操作端到端。我们在四个Xilinx Alveo U280 FPGAS上实现了建议的硬件体系结构,并利用了高带宽内存(HBM)的所有频道,以及用于高硬件效率的最大计算资源数量。 DFX在现代GPT-2模型上实现了四个NVIDIA V100 GPU的5.58倍加速度和3.99倍的能效。 DFX的成本效益比GPU设备更具成本效益,这表明它是云数据中心中文本生成工作负载的有前途解决方案。


CV-0-标题 Lightweight Transformers for Human Activity Recognition on Mobile Devices

链接: https://arxiv.org/abs/2209.11750
作者: Sannara EK, François Portet, Philippe Lalanda


Abstract: Human Activity Recognition (HAR) on mobile devices has shown to be achievable with lightweight neural models learned from data generated by the user’s inertial measurement units (IMUs). Most approaches for instanced-based HAR have used Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTMs), or a combination of the two to achieve state-of-the-art results with real-time performances. Recently, the Transformers architecture in the language processing domain and then in the vision domain has pushed further the state-of-the-art over classical architectures. However, such Transformers architecture is heavyweight in computing resources, which is not well suited for embedded applications of HAR that can be found in the pervasive computing domain. In this study, we present Human Activity Recognition Transformer (HART), a lightweight, sensor-wise transformer architecture that has been specifically adapted to the domain of the IMUs embedded on mobile devices. Our experiments on HAR tasks with several publicly available datasets show that HART uses fewer FLoating-point Operations Per Second (FLOPS) and parameters while outperforming current state-of-the-art results. Furthermore, we present evaluations across various architectures on their performances in heterogeneous environments and show that our models can better generalize on different sensing devices or on-body positions.

摘要:移动设备上的人类活动识别(HAR)已证明可以通过从用户的惯性测量单元(IMU)生成的数据中学到的轻量级神经模型来实现。基于Instanced HAR的大多数方法都使用卷积神经网络(CNN),长期记忆(LSTMS)或两者组合以实现实时性能来实现最新结果。最近,在语言处理域中,然后在视觉域中的变形金刚体系结构进一步推动了对古典体系结构的最先进。但是,这种变形金刚在计算资源中是重量级的,它不适合在Pervasive Computing域中找到HAR的嵌入式应用程序。在这项研究中,我们提出了人类活动识别变压器(HART),这是一种轻巧的,传感器的变压器结构,已专门适用于嵌入移动设备上的IMU的域。我们对HAR任务的实验具有几个公开可用的数据集,表明HART使用较少的每秒浮点操作(FLOPS)和参数,同时超过了当前的最新结果。此外,我们在各种体系结构中对它们在异质环境中的性能进行了评估,并表明我们的模型可以更好地推广到不同的感应设备或体内位置。

CV-1-标题 Adaptive-SpikeNet Event-based Optical Flow Estimation using Spiking Neural Networks with Learnable Neuronal Dynamics

链接: https://arxiv.org/abs/2209.11741
作者: Adarsh Kumar Kosta, Kaushik Roy


Abstract: Event-based cameras have recently shown great potential for high-speed motion estimation owing to their ability to capture temporally rich information asynchronously. Spiking Neural Networks (SNNs), with their neuro-inspired event-driven processing can efficiently handle such asynchronous data, while neuron models such as the leaky-integrate and fire (LIF) can keep track of the quintessential timing information contained in the inputs. SNNs achieve this by maintaining a dynamic state in the neuron memory, retaining important information while forgetting redundant data over time. Thus, we posit that SNNs would allow for better performance on sequential regression tasks compared to similarly sized Analog Neural Networks (ANNs). However, deep SNNs are difficult to train due to vanishing spikes at later layers. To that effect, we propose an adaptive fully-spiking framework with learnable neuronal dynamics to alleviate the spike vanishing problem. We utilize surrogate gradient-based backpropagation through time (BPTT) to train our deep SNNs from scratch. We validate our approach for the task of optical flow estimation on the Multi-Vehicle Stereo Event-Camera (MVSEC) dataset and the DSEC-Flow dataset. Our experiments on these datasets show an average reduction of 13% in average endpoint error (AEE) compared to state-of-the-art ANNs. We also explore several down-scaled models and observe that our SNN models consistently outperform similarly sized ANNs offering 10%-16% lower AEE. These results demonstrate the importance of SNNs for smaller models and their suitability at the edge. In terms of efficiency, our SNNs offer substantial savings in network parameters (48x) and computational energy (51x) while attaining ~10% lower EPE compared to the state-of-the-art ANN implementations.

摘要:基于事件的摄像机最近显示出对高速运动估计的巨大潜力,因为它们可以异步捕获时间丰富的信息。具有神经启发的事件驱动的处理的尖峰神经网络(SNN)可以有效地处理异步数据,而神经元模型(例如泄漏的综合和火灾(LIF))可以跟踪输入中包含的典型时序信息。 SNN通过在神经元内存中保持动态状态,保留重要信息,同时忘记冗余数据随着时间的推移而实现这一目标。因此,我们认为,与类似大小的模拟神经网络(ANN)相比,SNN将允许在顺序回归任务上更好地性能。但是,由于以后的层消失了,很难训练深SNN。为此,我们提出了一个具有可学习的神经元动力学的自适应完全刺激框架,以减轻尖峰消失的问题。我们在时间(BPTT)中利用基于替代梯度的反向传播来从头开始训练我们的深SNN。我们验证了在多车立体化事件相机(MVSEC)数据集和DSEC-FLOW数据集中的光流估计任务的方法。我们在这些数据集上的实验显示,与最新的ANN相比,平均终点误差(AEE)平均降低了13%。我们还探索了几个缩小的模型,并观察到我们的SNN模型始终超过大小的ANN,提供10%-16%的AEE。这些结果证明了SNN对较小模型的重要性及其在边缘的适用性。在效率方面,与最先进的ANN实施相比,我们的SNN可节省大量的网络参数(48倍)和计算能(51倍),同时获得了〜10%的EPE。

CV-2-标题 On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks

链接: https://arxiv.org/abs/2209.11740
作者: Hubert Leterme (UGA, LJK), Kévin Polisano (UGA, LJK), Valérie Perrier (Grenoble INP, LJK), Karteek Alahari (LJK)


Abstract: In this paper, we aim to improve the mathematical interpretability of convolutional neural networks for image classification. When trained on natural image datasets, such networks tend to learn parameters in the first layer that closely resemble oriented Gabor filters. By leveraging the properties of discrete Gabor-like convolutions, we prove that, under specific conditions, feature maps computed by the subsequent max pooling operator tend to approximate the modulus of complex Gabor-like coefficients, and as such, are stable with respect to certain input shifts. We then compute a probabilistic measure of shift invariance for these layers. More precisely, we show that some filters, depending on their frequency and orientation, are more likely than others to produce stable image representations. We experimentally validate our theory by considering a deterministic feature extractor based on the dual-tree wavelet packet transform, a particular case of discrete Gabor-like decomposition. We demonstrate a strong correlation between shift invariance on the one hand and similarity with complex modulus on the other hand.


CV-3-标题 Catoptric Light can be Dangerous Effective Physical-World Attack by Natural Phenomenon

链接: https://arxiv.org/abs/2209.11739
作者: Chengyin Hu, Weiwen Shi
备注: arXiv admin note: substantial text overlap with arXiv:2209.09652, arXiv:2209.02430


Abstract: Deep neural networks (DNNs) have achieved great success in many tasks. Therefore, it is crucial to evaluate the robustness of advanced DNNs. The traditional methods use stickers as physical perturbations to fool the classifiers, which is difficult to achieve stealthiness and there exists printing loss. Some new types of physical attacks use light beam to perform attacks (e.g., laser, projector), whose optical patterns are artificial rather than natural. In this work, we study a new type of physical attack, called adversarial catoptric light (AdvCL), in which adversarial perturbations are generated by common natural phenomena, catoptric light, to achieve stealthy and naturalistic adversarial attacks against advanced DNNs in physical environments. Carefully designed experiments demonstrate the effectiveness of the proposed method in simulated and real-world environments. The attack success rate is 94.90% in a subset of ImageNet and 83.50% in the real-world environment. We also discuss some of AdvCL’s transferability and defense strategy against this attack.


CV-4-标题 Semantic scene descriptions as an objective of human vision

链接: https://arxiv.org/abs/2209.11737
作者: Adrien Doerig, Tim C Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, Ian Charest


Abstract: Interpreting the meaning of a visual scene requires not only identification of its constituent objects, but also a rich semantic characterization of object interrelations. Here, we study the neural mechanisms underlying visuo-semantic transformations by applying modern computational techniques to a large-scale 7T fMRI dataset of human brain responses elicited by complex natural scenes. Using semantic embeddings obtained by applying linguistic deep learning models to human-generated scene descriptions, we identify a widely distributed network of brain regions that encode semantic scene descriptions. Importantly, these semantic embeddings better explain activity in these regions than traditional object category labels. In addition, they are effective predictors of activity despite the fact that the participants did not actively engage in a semantic task, suggesting that visuo-semantic transformations are a default mode of vision. In support of this view, we then show that highly accurate reconstructions of scene captions can be directly linearly decoded from patterns of brain activity. Finally, a recurrent convolutional neural network trained on semantic embeddings further outperforms semantic embeddings in predicting brain activity, providing a mechanistic model of the brain’s visuo-semantic transformations. Together, these experimental and computational results suggest that transforming visual input into rich semantic scene descriptions may be a central objective of the visual system, and that focusing efforts on this new objective may lead to improved models of visual information processing in the human brain.

摘要:解释视觉场景的含义不仅需要识别其成分对象,还需要对象相互关系的丰富语义表征。在这里,我们通过将现代计算技术应用于复杂自然场景引起的人类脑反应的大规模7T fMRI数据集,研究视觉语义转换的神经机制。使用通过将语言深度学习模型应用于人类生成的场景描述获得的语义嵌入,我们确定了编码语义场景描述的大脑区域的广泛分布网络。重要的是,这些语义嵌入比传统对象类别标签更好地解释了这些区域的活动。此外,尽管参与者没有积极从事语义任务,但它们还是活动的有效预测指标,这表明Visuo-Semantic转换是默认的视觉方式。为了支持这种观点,我们表明,可以直接通过大脑活动模式直接将场景字幕的高度精确重建。最后,经过语义嵌入训练的经常性卷积神经网络进一步超过了语义嵌入在预测大脑活动时的语义嵌入,从而提供了大脑视觉语义转换的机械模型。这些实验和计算结果在一起表明,将视觉输入转换为丰富的语义场景描述可能是视觉系统的核心目标,并且将重点放在这一新目标上可能会导致改进人类大脑中视觉信息处理的模型。

CV-5-标题 Boost CTR Prediction for New Advertisements via Modeling Visual Content

链接: https://arxiv.org/abs/2209.11727
作者: Tan Yu, Zhipeng Jin, Jie Liu, Yi Yang, Hongliang Fei, Ping Li


Abstract: Existing advertisements click-through rate (CTR) prediction models are mainly dependent on behavior ID features, which are learned based on the historical user-ad interactions. Nevertheless, behavior ID features relying on historical user behaviors are not feasible to describe new ads without previous interactions with users. To overcome the limitations of behavior ID features in modeling new ads, we exploit the visual content in ads to boost the performance of CTR prediction models. Specifically, we map each ad into a set of visual IDs based on its visual content. These visual IDs are further used for generating the visual embedding for enhancing CTR prediction models. We formulate the learning of visual IDs into a supervised quantization problem. Due to a lack of class labels for commercial images in advertisements, we exploit image textual descriptions as the supervision to optimize the image extractor for generating effective visual IDs. Meanwhile, since the hard quantization is non-differentiable, we soften the quantization operation to make it support the end-to-end network training. After mapping each image into visual IDs, we learn the embedding for each visual ID based on the historical user-ad interactions accumulated in the past. Since the visual ID embedding depends only on the visual content, it generalizes well to new ads. Meanwhile, the visual ID embedding complements the ad behavior ID embedding. Thus, it can considerably boost the performance of the CTR prediction models previously relying on behavior ID features for both new ads and ads that have accumulated rich user behaviors. After incorporating the visual ID embedding in the CTR prediction model of Baidu online advertising, the average CTR of ads improves by 1.46%, and the total charge increases by 1.10%.


CV-6-标题 Multilevel Robustness for 2D Vector Field Feature Tracking Selection and Comparison

链接: https://arxiv.org/abs/2209.11708
作者: Lin Yan, Paul Aaron Ullrich, Luke P. Van Roekel, Bei Wang, Hanqi Guo


Abstract: Critical point tracking is a core topic in scientific visualization for understanding the dynamic behavior of time-varying vector field data. The topological notion of robustness has been introduced recently to quantify the structural stability of critical points, that is, the robustness of a critical point is the minimum amount of perturbation to the vector field necessary to cancel it. A theoretical basis has been established previously that relates critical point tracking with the notion of robustness, in particular, critical points could be tracked based on their closeness in stability, measured by robustness, instead of just distance proximities within the domain. However, in practice, the computation of classic robustness may produce artifacts when a critical point is close to the boundary of the domain; thus, we do not have a complete picture of the vector field behavior within its local neighborhood. To alleviate these issues, we introduce a multilevel robustness framework for the study of 2D time-varying vector fields. We compute the robustness of critical points across varying neighborhoods to capture the multiscale nature of the data and to mitigate the boundary effect suffered by the classic robustness computation. We demonstrate via experiments that such a new notion of robustness can be combined seamlessly with existing feature tracking algorithms to improve the visual interpretability of vector fields in terms of feature tracking, selection, and comparison for large-scale scientific simulations. We observe, for the first time, that the minimum multilevel robustness is highly correlated with physical quantities used by domain scientists in studying a real-world tropical cyclone dataset. Such observation helps to increase the physical interpretability of robustness.


CV-7-标题 Multivariate Wasserstein Functional Connectivity for Autism Screening

链接: https://arxiv.org/abs/2209.11703
作者: Oleg Kachan, Alexander Bernstein


Abstract: Most approaches to the estimation of brain functional connectivity from the functional magnetic resonance imaging (fMRI) data rely on computing some measure of statistical dependence, or more generally, a distance between univariate representative time series of regions of interest (ROIs) consisting of multiple voxels. However, summarizing a ROI’s multiple time series with its mean or the first principal component (1PC) may result to the loss of information as, for example, 1PC explains only a small fraction of variance of the multivariate signal of the neuronal activity. We propose to compare ROIs directly, without the use of representative time series, defining a new measure of multivariate connectivity between ROIs, not necessarily consisting of the same number of voxels, based on the Wasserstein distance. We assess the proposed Wasserstein functional connectivity measure on the autism screening task, demonstrating its superiority over commonly used univariate and multivariate functional connectivity measures.


CV-8-标题 Edge-oriented Implicit Neural Representation with Channel Tuning

链接: https://arxiv.org/abs/2209.11697
作者: Wonjoon Chang, Dahee Kwon, Bumjin Park


Abstract: Implicit neural representation, which expresses an image as a continuous function rather than a discrete grid form, is widely used for image processing. Despite its outperforming results, there are still remaining limitations on restoring clear shapes of a given signal such as the edges of an image. In this paper, we propose Gradient Magnitude Adjustment algorithm which calculates the gradient of an image for training the implicit representation. In addition, we propose Edge-oriented Representation Network (EoREN) that can reconstruct the image with clear edges by fitting gradient information (Edge-oriented module). Furthermore, we add Channel-tuning module to adjust the distribution of given signals so that it solves a chronic problem of fitting gradients. By separating backpropagation paths of the two modules, EoREN can learn true color of the image without hindering the role for gradients. We qualitatively show that our model can reconstruct complex signals and demonstrate general reconstruction ability of our model with quantitative results.


CV-9-标题 Dynamic camera alignment optimization problem based on Fractal Decomposition based Algorithm

链接: https://arxiv.org/abs/2209.11695
作者: Arcadi Llanza, Nadiya Shvai, Amir Nakib


Abstract: In this work, we tackle the Dynamic Optimization Problem (DOP) of IA in a real-world application using a Dynamic Optimization Algorithm (DOA) called Fractal Decomposition Algorithm (FDA), introduced by recently. We used FDA to perform IA on CCTV camera feed from a tunnel. As the camera viewpoint can change by multiple reasons such as wind, maintenance, etc. the alignment is required to guarantee the correct functioning of video-based traffic security system.


CV-10-标题 Rate-Distortion in Image Coding for Machines

链接: https://arxiv.org/abs/2209.11694
作者: Alon Harell, Anderson De Andrade, Ivan V. Bajic


Abstract: In recent years, there has been a sharp increase in transmission of images to remote servers specifically for the purpose of computer vision. In many applications, such as surveillance, images are mostly transmitted for automated analysis, and rarely seen by humans. Using traditional compression for this scenario has been shown to be inefficient in terms of bit-rate, likely due to the focus on human based distortion metrics. Thus, it is important to create specific image coding methods for joint use by humans and machines. One way to create the machine side of such a codec is to perform feature matching of some intermediate layer in a Deep Neural Network performing the machine task. In this work, we explore the effects of the layer choice used in training a learnable codec for humans and machines. We prove, using the data processing inequality, that matching features from deeper layers is preferable in the sense of rate-distortion. Next, we confirm our findings empirically by re-training an existing model for scalable human-machine coding. In our experiments we show the trade-off between the human and machine sides of such a scalable model, and discuss the benefit of using deeper layers for training in that regard.


CV-11-标题 T3VIP Transformation-based 3D Video Prediction

链接: https://arxiv.org/abs/2209.11693
作者: Iman Nematollahi, Erick Rosete-Beas, Seyed Mahdi B. Azad, Raghu Rajan, Frank Hutter, Wolfram Burgard
备注: Accepted at the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)


Abstract: For autonomous skill acquisition, robots have to learn about the physical rules governing the 3D world dynamics from their own past experience to predict and reason about plausible future outcomes. To this end, we propose a transformation-based 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts and predicting their corresponding rigid transformations. Our model is fully unsupervised, captures the stochastic nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals. To fully leverage all the 2D and 3D observational signals, we equip our model with automatic hyperparameter optimization (HPO) to interpret the best way of learning from them. To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera. Our extensive evaluation with simulated and real-world datasets demonstrates that our formulation leads to interpretable 3D models that predict future depth videos while achieving on-par performance with 2D models on RGB video prediction. Moreover, we demonstrate that our model outperforms 2D baselines on visuomotor control. Videos, code, dataset, and pre-trained models are available at this http URL.

摘要:对于自主技能获取,机器人必须了解从自己过去的经验中管理3D世界动态的物理规则,以预测和理由关于合理的未来结果。为此,我们提出了一种基于转换的3D视频预测(T3VIP)方法,该方法通过将场景分解为对象部分并预测其相应的刚性转换来明确对3D运动进行建模。我们的模型是完全无监督的,捕获了现实世界的随机性质,图像和点云领域中的观察提示构成了其学习信号。为了充分利用所有2D和3D观测信号,我们为模型配备了自动的超参数优化(HPO),以解释从中学习的最佳方法。据我们所知,我们的模型是第一个生成模型,它为静态相机提供了RGB-D视频预测。我们对模拟和现实世界数据集进行了广泛的评估表明,我们的配方会导致可解释的3D模型,这些模型可以预测未来的深度视频,同时在RGB视频预测上使用2D模型实现PAR性能。此外,我们证明了我们的模型在视觉运动控制方面优于2D基准。此HTTP URL可用视频,代码,数据集和预训练的模型。

CV-12-标题 Meteorological Satellite Images Prediction Based on Deep Multi-scales Extrapolation Fusion

链接: https://arxiv.org/abs/2209.11682
作者: Fang Huang, Wencong Cheng, PanFeng Wang, ZhiGang Wang, HongHong He


Abstract: Meteorological satellite imagery is critical for meteorologists. The data have played an important role in monitoring and analyzing weather and climate changes. However, satellite imagery is a kind of observation data and exists a significant time delay when transmitting the data back to Earth. It is important to make accurate predictions for meteorological satellite images, especially the nowcasting prediction up to 2 hours ahead. In recent years, there has been growing interest in the research of nowcasting prediction applications of weather radar images based on deep learning. Compared to the weather radar images prediction problem, the main challenge for meteorological satellite images prediction is the large-scale observation areas and therefore the large sizes of the observation products. Here we present a deep multi-scales extrapolation fusion method, to address the challenge of the meteorological satellite images nowcasting prediction. First, we downsample the original satellite images dataset with large size to several images datasets with smaller resolutions, then we use a deep spatiotemporal sequences prediction method to generate the multi-scales prediction images with different resolutions separately. Second, we fuse the multi-scales prediction results to the targeting prediction images with the original size by a conditional generative adversarial network. The experiments based on the FY-4A meteorological satellite data show that the proposed method can generate realistic prediction images that effectively capture the evolutions of the weather systems in detail. We believe that the general idea of this work can be potentially applied to other spatiotemporal sequence prediction tasks with a large size.


CV-13-标题 An Overview of Violence Detection Techniques Current Challenges and Future Directions

链接: https://arxiv.org/abs/2209.11680
作者: Nadia Mumtaz, Naveed Ejaz, Shabana Habib, Syed Muhammad Mohsin, Prayag Tiwari, Shahab S. Band, Neeraj Kumar
备注: Artificial Intelligence Review


Abstract: The Big Video Data generated in today’s smart cities has raised concerns from its purposeful usage perspective, where surveillance cameras, among many others are the most prominent resources to contribute to the huge volumes of data, making its automated analysis a difficult task in terms of computation and preciseness. Violence Detection (VD), broadly plunging under Action and Activity recognition domain, is used to analyze Big Video data for anomalous actions incurred due to humans. The VD literature is traditionally based on manually engineered features, though advancements to deep learning based standalone models are developed for real-time VD analysis. This paper focuses on overview of deep sequence learning approaches along with localization strategies of the detected violence. This overview also dives into the initial image processing and machine learning-based VD literature and their possible advantages such as efficiency against the current complex models. Furthermore,the datasets are discussed, to provide an analysis of the current models, explaining their pros and cons with future directions in VD domain derived from an in-depth analysis of the previous methods.


CV-14-标题 PNeRF Probabilistic Neural Scene Representations for Uncertain 3D Visual Mapping

链接: https://arxiv.org/abs/2209.11677
作者: Yassine Ahmine, Arnab Dey, Andrew I. Comport
备注: 7 Pages, 6 Figures, 5 Tables. Submitted to IEEE International Conference on Robotics and Automation 2023 (ICRA 2023)


Abstract: Recently neural scene representations have provided very impressive results for representing 3D scenes visually, however, their study and progress have mainly been limited to visualization of virtual models in computer graphics or scene reconstruction in computer vision without explicitly accounting for sensor and pose uncertainty. Using this novel scene representation in robotics applications, however, would require accounting for this uncertainty in the neural map. The aim of this paper is therefore to propose a novel method for training \em probabilistic neural scene representations with uncertain training data that could enable the inclusion of these representations in robotics applications. Acquiring images using cameras or depth sensors contains inherent uncertainty, and furthermore, the camera poses used for learning a 3D model are also imperfect. If these measurements are used for training without accounting for their uncertainty, then the resulting models are non-optimal, and the resulting scene representations are likely to contain artifacts such as blur and un-even geometry. In this work, the problem of uncertainty integration to the learning process is investigated by focusing on training with uncertain information in a probabilistic manner. The proposed method involves explicitly augmenting the training likelihood with an uncertainty term such that the learnt probability distribution of the network is minimized with respect to the training uncertainty. It will be shown that this leads to more accurate image rendering quality, in addition to more precise and consistent geometry. Validation has been carried out on both synthetic and real datasets showing that the proposed approach outperforms state-of-the-art methods. The results show notably that the proposed method is capable of rendering novel high-quality views even when the training data is limited.


CV-15-标题 Image-to-Image Translation for Autonomous Driving from Coarsely-Aligned Image Pairs

链接: https://arxiv.org/abs/2209.11673
作者: Youya Xia, Josephine Monica, Wei-Lun Chao, Bharath Hariharan, Kilian Q Weinberger, Mark Campbell
备注: Submitted to the International Conference on Robotics and Automation (ICRA) 2023


Abstract: A self-driving car must be able to reliably handle adverse weather conditions (e.g., snowy) to operate safely. In this paper, we investigate the idea of turning sensor inputs (i.e., images) captured in an adverse condition into a benign one (i.e., sunny), upon which the downstream tasks (e.g., semantic segmentation) can attain high accuracy. Prior work primarily formulates this as an unpaired image-to-image translation problem due to the lack of paired images captured under the exact same camera poses and semantic layouts. While perfectly-aligned images are not available, one can easily obtain coarsely-paired images. For instance, many people drive the same routes daily in both good and adverse weather; thus, images captured at close-by GPS locations can form a pair. Though data from repeated traversals are unlikely to capture the same foreground objects, we posit that they provide rich contextual information to supervise the image translation model. To this end, we propose a novel training objective leveraging coarsely-aligned image pairs. We show that our coarsely-aligned training scheme leads to a better image translation quality and improved downstream tasks, such as semantic segmentation, monocular depth estimation, and visual localization.


CV-16-标题 View-Invariant Skeleton-based Action Recognition via Global-Local Contrastive Learning

链接: https://arxiv.org/abs/2209.11634
作者: Cunling Bian, Wei Feng, Fanbo Meng, Song Wang


Abstract: Skeleton-based human action recognition has been drawing more interest recently due to its low sensitivity to appearance changes and the accessibility of more skeleton data. However, even the 3D skeletons captured in practice are still sensitive to the viewpoint and direction gave the occlusion of different human-body joints and the errors in human joint localization. Such view variance of skeleton data may significantly affect the performance of action recognition. To address this issue, we propose in this paper a new view-invariant representation learning approach, without any manual action labeling, for skeleton-based human action recognition. Specifically, we leverage the multi-view skeleton data simultaneously taken for the same person in the network training, by maximizing the mutual information between the representations extracted from different views, and then propose a global-local contrastive loss to model the multi-scale co-occurrence relationships in both spatial and temporal domains. Extensive experimental results show that the proposed method is robust to the view difference of the input skeleton data and significantly boosts the performance of unsupervised skeleton-based human action methods, resulting in new state-of-the-art accuracies on two challenging multi-view benchmarks of PKUMMD and NTU RGB+D.

摘要:基于骨架的人类动作识别最近引起了人们对外观变化的敏感性和更多骨架数据的可访问性的敏感性。但是,即使在实践中捕获的3D骨骼也对观点和方向仍然敏感,并给出了不同人体关节的阻塞和人类关节定位中的误差。骨骼数据的这种视图差异可能会严重影响动作识别的性能。为了解决这个问题,我们在本文中提出了一种新的视图不变的表示方法,而没有任何手动动作标签,用于基于骨架的人类行动识别。具体而言,我们通过最大化从不同观点提取的表示形式之间的相互信息来利用同一个人同时对同一个人进行的多视图骨架数据,然后提出一个全局 - 局部对比度损失,以模拟多规模CO - 空间和时间域中的发生关系。广泛的实验结果表明,所提出的方法对输入骨骼数据的视图差异是可靠的,并显着提高了基于无监督骨架的人类动作方法的性能,从而在两个具有挑战性的多视图上产生了新的最新精确度Pkummd和NTU RGB+d的基准。

CV-17-标题 I-SPLIT Deep Network Interpretability for Split Computing

链接: https://arxiv.org/abs/2209.11607
作者: Federico Cunico, Luigi Capogrosso, Francesco Setti, Damiano Carra, Franco Fummi, Marco Cristani
备注: ICPR 2022


Abstract: This work makes a substantial step in the field of split computing, i.e., how to split a deep neural network to host its early part on an embedded device and the rest on a server. So far, potential split locations have been identified exploiting uniquely architectural aspects, i.e., based on the layer sizes. Under this paradigm, the efficacy of the split in terms of accuracy can be evaluated only after having performed the split and retrained the entire pipeline, making an exhaustive evaluation of all the plausible splitting points prohibitive in terms of time. Here we show that not only the architecture of the layers does matter, but the importance of the neurons contained therein too. A neuron is important if its gradient with respect to the correct class decision is high. It follows that a split should be applied right after a layer with a high density of important neurons, in order to preserve the information flowing until then. Upon this idea, we propose Interpretable Split (I-SPLIT): a procedure that identifies the most suitable splitting points by providing a reliable prediction on how well this split will perform in terms of classification accuracy, beforehand of its effective implementation. As a further major contribution of I-SPLIT, we show that the best choice for the splitting point on a multiclass categorization problem depends also on which specific classes the network has to deal with. Exhaustive experiments have been carried out on two networks, VGG16 and ResNet-50, and three datasets, Tiny-Imagenet-200, notMNIST, and Chest X-Ray Pneumonia. The source code is available at this https URL.

摘要:这项工作在拆分计算领域迈出了重大步骤,即如何拆分深神经网络以将其早期部分托管在嵌入式设备和服务器上的其余部分。到目前为止,已经确定了潜在的分裂位置,以利用独特的建筑方面,即基于层尺寸。在此范式下,只有在执行分裂并重新训练整个管道后,才能评估分裂的疗效,从而对所有合理的分裂点在时间方面进行详尽的评估。在这里,我们表明,不仅层的结构确实很重要,而且其中包含的神经元的重要性也很重要。如果神经元相对于正确的班级决策,神经元很重要。因此,应在具有高密度的重要神经元的层后立即施加拆分,以保留流动的信息。根据这个想法,我们提出了可解释的拆分(i-split):通过提供有关该分型在分类准确性方面的表现,事先对其有效实现的可靠性,以确定最合适的分裂点的过程。作为I-Split的另一个重大贡献,我们表明,多类分类问题的分裂点的最佳选择还取决于网络必须处理的特定类别。详尽的实验已在两个网络(VGG16和Resnet-50)以及三个数据集(Tiny-Imagenet-200,Notmnist和胸部X射线肺炎)上进行。源代码可在此HTTPS URL上获得。

CV-18-标题 Multi-Granularity Graph Pooling for Video-based Person Re-Identification

链接: https://arxiv.org/abs/2209.11584
作者: Honghu Pan, Yongyong Chen, Zhenyu He


Abstract: The video-based person re-identification (ReID) aims to identify the given pedestrian video sequence across multiple non-overlapping cameras. To aggregate the temporal and spatial features of the video samples, the graph neural networks (GNNs) are introduced. However, existing graph-based models, like STGCN, perform the \textitmean/\textitmax pooling on node features to obtain the graph representation, which neglect the graph topology and node importance. In this paper, we propose the graph pooling network (GPNet) to learn the multi-granularity graph representation for the video retrieval, where the \textitgraph pooling layer is implemented to downsample the graph. We first construct a multi-granular graph, whose node features denote image embedding learned by backbone, and edges are established between the temporal and Euclidean neighborhood nodes. We then implement multiple graph convolutional layers to perform the neighborhood aggregation on the graphs. To downsample the graph, we propose a multi-head full attention graph pooling (MHFAPool) layer, which integrates the advantages of existing node clustering and node selection pooling methods. Specifically, MHFAPool takes the main eigenvector of full attention matrix as the aggregation coefficients to involve the global graph information in each pooled nodes. Extensive experiments demonstrate that our GPNet achieves the competitive results on four widely-used datasets, i.e., MARS, DukeMTMC-VideoReID, iLIDS-VID and PRID-2011.

摘要:基于视频的人重新识别(REID)旨在识别多个非重叠摄像机的给定的人群序列。为了汇总视频样本的时间和空间特征,引入了图神经网络(GNN)。但是,现有基于图的模型(例如STGCN)在节点功能上执行\ textitmean/\ textitMax池,以获取图表表示,该图表忽略了图形拓扑和节点的重要性。在本文中,我们建议图形池网络(GPNET)学习视频检索的多粒度图表示,其中实现了\ textItgraph池池以对图进行下样本。我们首先构建了一个多粒图,其节点特征表示由骨架学到的图像嵌入,并且在颞和欧几里得邻域节点之间建立了边缘。然后,我们实现多个图形卷积层以在图上执行邻域聚集。为了下图,我们提出了一个多头全注意图池(MHFAPOOL)层,该图集合了现有节点群集和节点选择池的优势。具体而言,MHFAPOOL将全部注意矩阵的主要特征向量作为聚合系数涉及每个汇总节点中的全局图信息。广泛的实验表明,我们的GPNET在四个广泛使用的数据集(即火星,dukemtmc-veneoreid,ilids-vid and Prid-2011)上实现了竞争结果。

CV-19-标题 Pose-Aided Video-based Person Re-Identification via Recurrent Graph Convolutional Network

链接: https://arxiv.org/abs/2209.11582
作者: Honghu Pan, Qiao Liu, Yongyong Chen, Yunqi He, Yuan Zheng, Feng Zheng, Zhenyu He


Abstract: Existing methods for video-based person re-identification (ReID) mainly learn the appearance feature of a given pedestrian via a feature extractor and a feature aggregator. However, the appearance models would fail when different pedestrians have similar appearances. Considering that different pedestrians have different walking postures and body proportions, we propose to learn the discriminative pose feature beyond the appearance feature for video retrieval. Specifically, we implement a two-branch architecture to separately learn the appearance feature and pose feature, and then concatenate them together for inference. To learn the pose feature, we first detect the pedestrian pose in each frame through an off-the-shelf pose detector, and construct a temporal graph using the pose sequence. We then exploit a recurrent graph convolutional network (RGCN) to learn the node embeddings of the temporal pose graph, which devises a global information propagation mechanism to simultaneously achieve the neighborhood aggregation of intra-frame nodes and message passing among inter-frame graphs. Finally, we propose a dual-attention method consisting of node-attention and time-attention to obtain the temporal graph representation from the node embeddings, where the self-attention mechanism is employed to learn the importance of each node and each frame. We verify the proposed method on three video-based ReID datasets, i.e., Mars, DukeMTMC and iLIDS-VID, whose experimental results demonstrate that the learned pose feature can effectively improve the performance of existing appearance models.


CV-20-标题 Towards Complete-View and High-Level Pose-based Gait Recognition

链接: https://arxiv.org/abs/2209.11577
作者: Honghu Pan, Yongyong Chen, Tingyang Xu, Yunqi He, Zhenyu He


Abstract: The model-based gait recognition methods usually adopt the pedestrian walking postures to identify human beings. However, existing methods did not explicitly resolve the large intra-class variance of human pose due to camera views changing. In this paper, we propose to generate multi-view pose sequences for each single-view pose sample by learning full-rank transformation matrices via lower-upper generative adversarial network (LUGAN). By the prior of camera imaging, we derive that the spatial coordinates between cross-view poses satisfy a linear transformation of a full-rank matrix, thereby, this paper employs the adversarial training to learn transformation matrices from the source pose and target views to obtain the target pose sequences. To this end, we implement a generator composed of graph convolutional (GCN) layers, fully connected (FC) layers and two-branch convolutional (CNN) layers: GCN layers and FC layers encode the source pose sequence and target view, then CNN branches learn a lower triangular matrix and an upper triangular matrix, respectively, finally they are multiplied to formulate the full-rank transformation matrix. For the purpose of adversarial training, we further devise a condition discriminator that distinguishes whether the pose sequence is true or generated. To enable the high-level correlation learning, we propose a plug-and-play module, named multi-scale hypergraph convolution (HGC), to replace the spatial graph convolutional layer in baseline, which could simultaneously model the joint-level, part-level and body-level correlations. Extensive experiments on two large gait recognition datasets, i.e., CASIA-B and OUMVLP-Pose, demonstrate that our method outperforms the baseline model and existing pose-based methods by a large margin.

摘要:基于模型的步态识别方法通常采用行人步行姿势来识别人类。但是,由于摄像头视图的改变,现有方法并未明确解决人类姿势的较大阶层差异。在本文中,我们建议通过通过低UPPER生成的对抗网络(Lugan)学习全级转换矩阵来为每个单视姿势样本生成多视图姿势序列。通过摄像机成像的先验,我们得出的是,跨视图之间的空间坐标满足了全级矩阵的线性转换,因此,本文采用了对抗性训练来从源姿势学习转换矩阵,并获得目标视图以获得目标。目标姿势序列。为此,我们实现了由图形卷积(GCN)层组成的发电机,完全连接(FC)层和两支分支卷积(CNN)层:GCN层和FC层编码源姿势序列和目标视图,然后是CNN分支最后,分别学习一个三角形基质和上三角基质,最后它们被乘以制定全级转换矩阵。出于对抗训练的目的,我们进一步设计了一个条件鉴别因子,该条件区分姿势序列是真实的还是产生的。为了启用高级相关性学习,我们提出了一个名为Multi尺度超图卷积(HGC)的插件播放模块,以替换基线中的空间图卷积层,该层可以同时模拟联合级别的部分,部分部分 - 水平和身体水平的相关性。在两个大型步态识别数据集(即CASIA-B和OUMVLP置位)上进行的广泛实验表明,我们的方法的表现优于基线模型,并以一个较大的边距基于基于姿势的方法。

CV-21-标题 Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

链接: https://arxiv.org/abs/2209.11572
作者: Xiang Fang, Daizong Liu, Pan Zhou, YuChong Hu


Abstract: As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from an untrimmed video according to a given language query. Most previous methods depend heavily on numerous manual annotations (i.e., moment boundaries), which are extremely expensive to acquire in practice. In addition, due to the domain gap between different datasets, directly applying these pre-trained models to an unseen domain leads to a significant performance drop. In this paper, we focus on a novel task: cross-domain VMR, where fully-annotated datasets are available in one domain (source domain''), but the domain of interest (target domain’') only contains unannotated datasets. As far as we know, we present the first study on cross-domain VMR. To address this new task, we propose a novel Multi-Modal Cross-Domain Alignment (MMCDA) network to transfer the annotation knowledge from the source domain to the target domain. However, due to the domain discrepancy between the source and target domains and the semantic gap between videos and queries, directly applying trained models to the target domain generally leads to a performance drop. To solve this problem, we develop three novel modules: (i) a domain alignment module is designed to align the feature distributions between different domains of each modality; (ii) a cross-modal alignment module aims to map both video and query features into a joint embedding space and to align the feature distributions between different modalities in the target domain; (iii) a specific alignment module tries to obtain the fine-grained similarity between a specific frame and the given query for optimal localization. By jointly training these three modules, our MMCDA can learn domain-invariant and semantic-aligned cross-modal representations.

摘要:作为多媒体信息检索中越来越流行的任务,视频瞬间检索(VMR)旨在根据给定的语言查询从未修剪视频中定位目标时刻。以前的大多数方法都在很大程度上取决于众多手动注释(即瞬间边界),在实践中获取非常昂贵。此外,由于不同数据集之间的域间隙,直接将这些预训练的模型应用于看不见的域,这会导致显着的性能下降。在本文中,我们专注于一项新任务:跨域VMR,其中一个域中完全注重数据集(````源域’‘’),但是感兴趣的域(``目标域’')仅包含未通知的数据集。据我们所知,我们介绍了有关跨域VMR的第一项研究。为了解决这一新任务,我们提出了一个新型的多模式跨域比对(MMCDA)网络,以将注释知识从源域转移到目标域。但是,由于源和目标域之间的域差异以及视频和查询之间的语义差距,直接将经过训练的模型应用于目标域通常会导致性能下降。为了解决这个问题,我们开发了三个新型模块:(i)域对齐模块旨在使每种模式的不同域之间的特征分布对齐; (ii)跨模式对齐模块旨在将视频和查询特征映射到关节嵌入空间中,并将目标域不同模态之间的特征分布对齐; (iii)特定的比对模块试图获得特定帧与给定查询之间的细粒度相似性以进行最佳定位。通过共同训练这三个模块,我们的MMCDA可以学习域不变和语义一致的跨模式表示。

CV-22-标题 Query-based Hard-Image Retrieval for Object Detection at Test Time

链接: https://arxiv.org/abs/2209.11559
作者: Edward Ayers, Jonathan Sadeghi, John Redford, Romain Mueller, Puneet K. Dokania


Abstract: There is a longstanding interest in capturing the error behaviour of object detectors by finding images where their performance is likely to be unsatisfactory. In real-world applications such as autonomous driving, it is also crucial to characterise potential failures beyond simple requirements of detection performance. For example, a missed detection of a pedestrian close to an ego vehicle will generally require closer inspection than a missed detection of a car in the distance. The problem of predicting such potential failures at test time has largely been overlooked in the literature and conventional approaches based on detection uncertainty fall short in that they are agnostic to such fine-grained characterisation of errors. In this work, we propose to reformulate the problem of finding “hard” images as a query-based hard image retrieval task, where queries are specific definitions of “hardness”, and offer a simple and intuitive method that can solve this task for a large family of queries. Our method is entirely post-hoc, does not require ground-truth annotations, is independent of the choice of a detector, and relies on an efficient Monte Carlo estimation that uses a simple stochastic model in place of the ground-truth. We show experimentally that it can be applied successfully to a wide variety of queries for which it can reliably identify hard images for a given detector without any labelled data. We provide results on ranking and classification tasks using the widely used RetinaNet, Faster-RCNN, Mask-RCNN, and Cascade Mask-RCNN object detectors.

摘要:通过查找图像可能不满意的图像来捕获对象检测器的错误行为,这一兴趣很长。在实际应用(例如自动驾驶)中,对于表征除了简单的检测性能要求之外的潜在失败也至关重要。例如,与远处未遗漏的汽车检测相比,错过对靠近自我车辆的行人的侦查通常需要更仔细的检查。在测试时间预测这种潜在失败的问题在文献和基于检测不确定性的传统方法中被忽略了,因为它们对这种错误的细粒度表征不可知。在这项工作中,我们建议将查找“硬”图像作为基于查询的硬图像检索任务的问题进行重新制定,其中查询是“硬度”的特定定义,并提供了一种简单而直观的方法,可以解决此任务大型查询家庭。我们的方法完全是事后的,不需要地面真相注释,独立于检测器的选择,并且依赖于有效的蒙特卡洛估计,该估计使用简单的随机模型代替地面真相。我们通过实验表明,它可以成功地应用于各种查询中,它可以可靠地识别给定检测器的硬图像,而无需任何标记的数据。我们使用广泛使用的视网膜,更快的RCNN,Mask-RCNN和CASCADE MASK-RCNN对象检测器提供有关排名和分类任务的结果。

CV-23-标题 MAGIC Mask-Guided Image Synthesis by Inverting a Quasi-Robust Classifier

链接: https://arxiv.org/abs/2209.11549
作者: Mozhdeh Rouhsedaghat, Masoud Monajatipoor, Kai-Wei Chang, C. -C. Jay Kuo, Iacopo Masi
备注: 12 pages, 9 figures, technical report


Abstract: We offer a method for one-shot image synthesis that allows controlling manipulations of a single image by inverting a quasi-robust classifier equipped with strong regularizers. Our proposed method, entitled Magic, samples structured gradients from a pre-trained quasi-robust classifier to better preserve the input semantics while preserving its classification accuracy, thereby guaranteeing credibility in the synthesis. Unlike current methods that use complex primitives to supervise the process or use attention maps as a weak supervisory signal, Magic aggregates gradients over the input, driven by a guide binary mask that enforces a strong, spatial prior. Magic implements a series of manipulations with a single framework achieving shape and location control, intense non-rigid shape deformations, and copy/move operations in the presence of repeating objects and gives users firm control over the synthesis by requiring simply specifying binary guide masks. Our study and findings are supported by various qualitative comparisons with the state-of-the-art on the same images sampled from ImageNet and quantitative analysis using machine perception along with a user survey of 100+ participants that endorse our synthesis quality.


CV-24-标题 Statistical shape representations for temporal registration of plant components in 3D

链接: https://arxiv.org/abs/2209.11526
作者: Karoline Heiwolt, Cengiz Öztireli, Grzegorz Cielniak
备注: 6 pages plus references, 7 figures, Submitted to ICRA 2023


Abstract: Plants are dynamic organisms. Understanding temporal variations in vegetation is an essential problem for all robots in the wild. However, associating repeated 3D scans of plants across time is challenging. A key step in this process is re-identifying and tracking the same individual plant components over time. Previously, this has been achieved by comparing their global spatial or topological location. In this work, we demonstrate how using shape features improves temporal organ matching. We present a landmark-free shape compression algorithm, which allows for the extraction of 3D shape features of leaves, characterises leaf shape and curvature efficiently in few parameters, and makes the association of individual leaves in feature space possible. The approach combines 3D contour extraction and further compression using Principal Component Analysis (PCA) to produce a shape space encoding, which is entirely learned from data and retains information about edge contours and 3D curvature. Our evaluation on temporal scan sequences of tomato plants shows, that incorporating shape features improves temporal leaf-matching. A combination of shape, location, and rotation information proves most informative for recognition of leaves over time and yields a true positive rate of 75%, a 15% improvement on sate-of-the-art methods. This is essential for robotic crop monitoring, which enables whole-of-lifecycle phenotyping.


CV-25-标题 WS-3D-Lane Weakly Supervised 3D Lane Detection With 2D Lane Labels

链接: https://arxiv.org/abs/2209.11523
作者: Jianyong Ai, Wenbo Ding, Jiuhua Zhao, Jiachen Zhong
备注: 7 pages, 8 figures


Abstract: Compared to 2D lanes, real 3D lane data is difficult to collect accurately. In this paper, we propose a novel method for training 3D lanes with only 2D lane labels, called weakly supervised 3D lane detection WS-3D-Lane. By assumptions of constant lane width and equal height on adjacent lanes, we indirectly supervise 3D lane heights in the training. To overcome the problem of the dynamic change of the camera pitch during data collection, a camera pitch self-calibration method is proposed. In anchor representation, we propose a double-layer anchor with a improved non-maximum suppression (NMS) method, which enables the anchor-based method to predict two lane lines that are close. Experiments are conducted on the base of 3D-LaneNet under two supervision methods. Under weakly supervised setting, our WS-3D-Lane outperforms previous 3D-LaneNet: F-score rises to 92.3% on Apollo 3D synthetic dataset, and F1 rises to 74.5% on ONCE-3DLanes. Meanwhile, WS-3D-Lane in purely supervised setting makes more increments and outperforms state-of-the-art. To the best of our knowledge, WS-3D-Lane is the first try of 3D lane detection under weakly supervised setting.

摘要:与2D车道相比,实际3D车道数据很难准确收集。在本文中,我们提出了一种仅使用2D车道标签训练3D车道的新方法,称为弱监督的3D车道检测WS-3D车道。通过在相邻车道上的恒定车道宽度和相等高度的假设,我们间接监督训练中的3D车道高度。为了克服数据收集过程中相机音调动态变化的问题,提出了相机音调自校准方法。在锚固表示中,我们提出了一个具有改进的非限量抑制(NMS)方法的双层锚,该方法使基于锚的方法可以预测两条接近的车道线。实验是在两种监督方法下在3D-LANENEN的基础上进行的。在弱监督的环境下,我们的WS-3D车道的表现优于先前的3D-LANEN:APOLLO 3D合成数据集的F得分上升到92.3%,而F1在3DDLANES上上升到74.5%。同时,在纯监督环境中的WS-3D车道可以提高更多的增量,并且优于最先进的设置。据我们所知,WS-3D车道是在弱监督环境下进行3D车道检测的第一次尝试。

CV-26-标题 Vector Quantized Semantic Communication System

链接: https://arxiv.org/abs/2209.11519
作者: Qifan Fu, Huiqiang Xie, Zhijin Qin, Gregory Slabaugh, Xiaoming Tao


Abstract: Although analog semantic communication systems have received considerable attention in the literature, there is less work on digital semantic communication systems. In this paper, we develop a deep learning (DL)-enabled vector quantized (VQ) semantic communication system for image transmission, named VQ-DeepSC. Specifically, we propose a convolutional neural network (CNN)-based transceiver to extract multi-scale semantic features of images and introduce multi-scale semantic embedding spaces to perform semantic feature quantization, rendering the data compatible with digital communication systems. Furthermore, we employ adversarial training to improve the quality of received images by introducing a PatchGAN discriminator. Experimental results demonstrate that the proposed VQ-DeepSC outperforms traditional image transmission methods in terms of SSIM.


CV-27-标题 Marine Video Kit A New Marine Video Dataset for Content-based Analysis and Retrieval

链接: https://arxiv.org/abs/2209.11518
作者: Quang-Trung Truong, Tuan-Anh Vu, Tan-Sang Ha, Lokoc Jakub, Yue Him Wong Tim, Ajay Joneja, Sai-Kit Yeung
备注: 12 pages of content with 2 pages of reference


Abstract: Effective analysis of unusual domain specific video collections represents an important practical problem, where state-of-the-art general purpose models still face limitations. Hence, it is desirable to design benchmark datasets that challenge novel powerful models for specific domains with additional constraints. It is important to remember that domain specific data may be noisier (e.g., endoscopic or underwater videos) and often require more experienced users for effective search. In this paper, we focus on single-shot videos taken from moving cameras in underwater environments which constitute a nontrivial challenge for research purposes. The first shard of a new Marine Video Kit dataset is presented to serve for video retrieval and other computer vision challenges. In addition to basic meta-data statistics, we present several insights and reference graphs based on low-level features as well as semantic annotations of selected keyframes. The analysis contains also experiments showing limitations of respected general purpose models for retrieval.


CV-28-标题 Comparison of synthetic dataset generation methods for medical intervention rooms using medical clothing detection as an example

链接: https://arxiv.org/abs/2209.11493
作者: Patrick Schülein, Hannah Teufel, Ronja Vorpahl, Indira Emter, Yannick Bukschat, Marcus Pfister, Anke Siebert, Nils Rathmann, Steffen Diehl, Marcus Vetter


Abstract: The availability of real data from areas with high privacy requirements, such as the medical intervention space, is low and the acquisition legally complex. Therefore, this work presents a way to create a synthetic dataset for the medical context, using medical clothing as an example. The goal is to close the reality gap between the synthetic and real data. For this purpose, methods of 3D-scanned clothing and designed clothing are compared in a Domain-Randomization and Structured-Domain-Randomization scenario using an Unreal-Engine plugin or Unity. Additionally a Mixed-Reality dataset in front of a greenscreen and a target domain dataset were used. Our experiments show, that Structured-Domain-Randomization of designed clothing together with Mixed-Reality data provide a baseline achieving 72.0% mAP on a test dataset of the clinical target domain. When additionally using 15% of available target domain train data, the gap towards 100% (660 images) target domain train data could be nearly closed 80.05% mAP (81.95% mAP). Finally we show that when additionally using 100% target domain train data the accuracy could be increased to 83.35% mAP.


CV-29-标题 Grouped Adaptive Loss Weighting for Person Search

链接: https://arxiv.org/abs/2209.11492
作者: Yanling Tian, Di Chen, Yunan Liu, Shanshan Zhang, Jian Yang
备注: Accepted by ACM MM


Abstract: Person search is an integrated task of multiple sub-tasks such as foreground/background classification, bounding box regression and person re-identification. Therefore, person search is a typical multi-task learning problem, especially when solved in an end-to-end manner. Recently, some works enhance person search features by exploiting various auxiliary information, e.g. person joint keypoints, body part position, attributes, etc., which brings in more tasks and further complexifies a person search model. The inconsistent convergence rate of each task could potentially harm the model optimization. A straightforward solution is to manually assign different weights to different tasks, compensating for the diverse convergence rates. However, given the special case of person search, i.e. with a large number of tasks, it is impractical to weight the tasks manually. To this end, we propose a Grouped Adaptive Loss Weighting (GALW) method which adjusts the weight of each task automatically and dynamically. Specifically, we group tasks according to their convergence rates. Tasks within the same group share the same learnable weight, which is dynamically assigned by considering the loss uncertainty. Experimental results on two typical benchmarks, CUHK-SYSU and PRW, demonstrate the effectiveness of our method.

摘要:人员搜索是多个子任务的集成任务,例如前景/背景分类,边界框回归和人员重新识别。因此,人搜索是一个典型的多任务学习问题,尤其是在以端到端方式解决时。最近,一些作品通过利用各种辅助信息,例如人关节关键点,身体部位位置,属性等,这带来了更多的任务并使人搜索模型更加复杂。每个任务的不一致的趋同率可能会损害模型优化。一个直接的解决方案是手动为不同的任务分配不同的权重,以补偿各种融合率。但是,鉴于人搜索的特殊情况,即有大量任务,手动加权任务是不切实际的。为此,我们提出了一种分组的自适应减肥方法(GALW)方法,该方法会自动和动态地调整每个任务的权重。具体而言,我们根据其收敛率对任务进行分组。同一组中的任务共享相同的可学习权重,这是通过考虑损失不确定性动态分配的。对两个典型基准(Cuhk-Sysu and Prw)的实验结果证明了我们方法的有效性。

CV-30-标题 GIDP Learning a Good Initialization and Inducing Descriptor Post-enhancing for Large-scale Place Recognition

链接: https://arxiv.org/abs/2209.11488
作者: Zhaoxin Fan, Zhenbo Song, Hongyan Liu, Jun He
备注: 7 pages


Abstract: Large-scale place recognition is a fundamental but challenging task, which plays an increasingly important role in autonomous driving and robotics. Existing methods have achieved acceptable good performance, however, most of them are concentrating on designing elaborate global descriptor learning network structures. The importance of feature generalization and descriptor post-enhancing has long been neglected. In this work, we propose a novel method named GIDP to learn a Good Initialization and Inducing Descriptor Poseenhancing for Large-scale Place Recognition. In particular, an unsupervised momentum contrast point cloud pretraining module and a reranking-based descriptor post-enhancing module are proposed respectively in GIDP. The former aims at learning a good initialization for the point cloud encoding network before training the place recognition model, while the later aims at post-enhancing the predicted global descriptor through reranking at inference time. Extensive experiments on both indoor and outdoor datasets demonstrate that our method can achieve state-of-the-art performance using simple and general point cloud encoding backbones.

摘要:大规模的地方识别是一项基本但具有挑战性的任务,在自主驾驶和机器人技术中起着越来越重要的作用。现有的方法已经达到了可接受的良好性能,但是,其中大多数都集中精力设计精美的全球描述符学习网络结构。长期以来忽略了特征概括和描述后的特征概括和描述符的重要性。在这项工作中,我们提出了一种名为GIDP的新方法,以学习良好的初始化并引起描述符,以供大规模识别。特别是,在GIDP中分别提出了无监督的动量对比度云预处理模块和基于重新的描述符后增强模块。前者旨在在训练位置识别模型之前对Point Cloud编码网络进行良好的初始化,而后来的目标是通过推理时间重新掌握预测的全局描述符。在室内和室外数据集上进行的广泛实验表明,我们的方法可以使用简单和一般的点云编码主干来实现最先进的性能。

CV-31-标题 Weakly Supervised Two-Stage Training Scheme for Deep Video Fight Detection Model

链接: https://arxiv.org/abs/2209.11477
作者: Zhenting Qi, Ruike Zhu, Zheyu Fu, Wenhao Chai, Volodymyr Kindratenko
备注: Accepted by ICTAI 2022


Abstract: Fight detection in videos is an emerging deep learning application with today’s prevalence of surveillance systems and streaming media. Previous work has largely relied on action recognition techniques to tackle this problem. In this paper, we propose a simple but effective method that solves the task from a new perspective: we design the fight detection model as a composition of an action-aware feature extractor and an anomaly score generator. Also, considering that collecting frame-level labels for videos is too laborious, we design a weakly supervised two-stage training scheme, where we utilize multiple-instance-learning loss calculated on video-level labels to train the score generator, and adopt the self-training technique to further improve its performance. Extensive experiments on a publicly available large-scale dataset, UBI-Fights, demonstrate the effectiveness of our method, and the performance on the dataset exceeds several previous state-of-the-art approaches. Furthermore, we collect a new dataset, VFD-2000, that specializes in video fight detection, with a larger scale and more scenarios than existing datasets. The implementation of our method and the proposed dataset will be publicly available at this https URL.

摘要:视频中的战斗检测是当今监视系统和流媒体的流行率的新兴深度学习应用程序。以前的工作主要依靠行动识别技术来解决这个问题。在本文中,我们提出了一种简单但有效的方法,该方法从新的角度解决了任务:我们将战斗检测模型设计为动作感知功能提取器和异常得分生成器的组成。另外,考虑到视频收集帧级标签太费力了,我们设计了一个弱监督的两阶段训练计划,在此我们使用在视频级别标签上计算出的多个实体学习损失来培训得分生成器,并采用自我训练的技术以进一步提高其性能。在公开可用的大规模数据集(UBI-Fights)上进行了广泛的实验,证明了我们方法的有效性,并且数据集的性能超过了几种先前的最先进的方法。此外,我们收集了一个新的数据集VFD-2000,该数据集专门研究视频战斗检测,比现有数据集更大,场景更大。我们的方法的实现和拟议的数据集将在此HTTPS URL上公开可用。

CV-32-标题 Unsupervised Hashing with Semantic Concept Mining

链接: https://arxiv.org/abs/2209.11475
作者: Rong-Cheng Tu, Xian-Ling Mao, Kevin Qinghong Lin, Chengfei Cai, Weize Qin, Hongfa Wang, Wei Wei, Heyan Huang


Abstract: Recently, to improve the unsupervised image retrieval performance, plenty of unsupervised hashing methods have been proposed by designing a semantic similarity matrix, which is based on the similarities between image features extracted by a pre-trained CNN model. However, most of these methods tend to ignore high-level abstract semantic concepts contained in images. Intuitively, concepts play an important role in calculating the similarity among images. In real-world scenarios, each image is associated with some concepts, and the similarity between two images will be larger if they share more identical concepts. Inspired by the above intuition, in this work, we propose a novel Unsupervised Hashing with Semantic Concept Mining, called UHSCM, which leverages a VLP model to construct a high-quality similarity matrix. Specifically, a set of randomly chosen concepts is first collected. Then, by employing a vision-language pretraining (VLP) model with the prompt engineering which has shown strong power in visual representation learning, the set of concepts is denoised according to the training images. Next, the proposed method UHSCM applies the VLP model with prompting again to mine the concept distribution of each image and construct a high-quality semantic similarity matrix based on the mined concept distributions. Finally, with the semantic similarity matrix as guiding information, a novel hashing loss with a modified contrastive loss based regularization item is proposed to optimize the hashing network. Extensive experiments on three benchmark datasets show that the proposed method outperforms the state-of-the-art baselines in the image retrieval task.


CV-33-标题 TeST Test-time Self-Training under Distribution Shift

链接: https://arxiv.org/abs/2209.11459
作者: Samarth Sinha, Peter Gehler, Francesco Locatello, Bernt Schiele


Abstract: Despite their recent success, deep neural networks continue to perform poorly when they encounter distribution shifts at test time. Many recently proposed approaches try to counter this by aligning the model to the new distribution prior to inference. With no labels available this requires unsupervised objectives to adapt the model on the observed test data. In this paper, we propose Test-Time Self-Training (TeST): a technique that takes as input a model trained on some source data and a novel data distribution at test time, and learns invariant and robust representations using a student-teacher framework. We find that models adapted using TeST significantly improve over baseline test-time adaptation algorithms. TeST achieves competitive performance to modern domain adaptation algorithms, while having access to 5-10x less data at time of adaption. We thoroughly evaluate a variety of baselines on two tasks: object detection and image segmentation and find that models adapted with TeST. We find that TeST sets the new state-of-the art for test-time domain adaptation algorithms.

摘要:尽管他们最近的成功,但深度神经网络在测试时遇到分配变化时仍会表现不佳。最近,许多提出的方法试图通过将模型与推理之前的新分布对齐来解决。由于没有可用的标签,因此需要无监督的目标才能使模型适应观察到的测试数据。在本文中,我们提出了测试时间自我训练(测试):一种技术,该技术在测试时以某些源数据和新的数据分配为输入,并使用学生教师框架来学习不变且强大的表示形式。 。我们发现使用测试适应的模型可以显着改善基线测试时间适应算法。测试可以实现现代领域适应算法的竞争性能,同时自适应时访问5-10倍的数据。我们对两项任务进行了各种基准:对象检测和图像分割,并发现该模型适用于测试。我们发现测试设置了用于测试时间域适应算法的新最新技术。

CV-34-标题 Motion Guided Deep Dynamic 3D Garments

链接: https://arxiv.org/abs/2209.11449
作者: Meng Zhang, Duygu Ceylan, Niloy J. Mitra
备注: 11 pages


Abstract: Realistic dynamic garments on animated characters have many AR/VR applications. While authoring such dynamic garment geometry is still a challenging task, data-driven simulation provides an attractive alternative, especially if it can be controlled simply using the motion of the underlying character. In this work, we focus on motion guided dynamic 3D garments, especially for loose garments. In a data-driven setup, we first learn a generative space of plausible garment geometries. Then, we learn a mapping to this space to capture the motion dependent dynamic deformations, conditioned on the previous state of the garment as well as its relative position with respect to the underlying body. Technically, we model garment dynamics, driven using the input character motion, by predicting per-frame local displacements in a canonical state of the garment that is enriched with frame-dependent skinning weights to bring the garment to the global space. We resolve any remaining per-frame collisions by predicting residual local displacements. The resultant garment geometry is used as history to enable iterative rollout prediction. We demonstrate plausible generalization to unseen body shapes and motion inputs, and show improvements over multiple state-of-the-art alternatives.


CV-35-标题 Rethinking Performance Gains in Image Dehazing Networks

链接: https://arxiv.org/abs/2209.11448
作者: Yuda Song, Yang Zhou, Hui Qian, Xin Du


Abstract: Image dehazing is an active topic in low-level vision, and many image dehazing networks have been proposed with the rapid development of deep learning. Although these networks’ pipelines work fine, the key mechanism to improving image dehazing performance remains unclear. For this reason, we do not target to propose a dehazing network with fancy modules; rather, we make minimal modifications to popular U-Net to obtain a compact dehazing network. Specifically, we swap out the convolutional blocks in U-Net for residual blocks with the gating mechanism, fuse the feature maps of main paths and skip connections using the selective kernel, and call the resulting U-Net variant gUNet. As a result, with a significantly reduced overhead, gUNet is superior to state-of-the-art methods on multiple image dehazing datasets. Finally, we verify these key designs to the performance gain of image dehazing networks through extensive ablation studies.

摘要:Dimage Dehazing是低级视觉中的一个活跃主题,并且随着深度学习的快速发展,已经提出了许多图像去除网络。尽管这些网络的管道效果很好,但改善图像飞行性能的关键机制尚不清楚。因此,我们不针对带有精美模块的飞行网络。相反,我们对流行的U-NET进行了最小的修改,以获得紧凑的飞行网络。具体而言,我们将U-NET中的卷积块与门控机构,使用选择性内核进行融合,并跳过连接,并调用所得的U-NET变体Gunet。结果,由于开销大大减少,Gunet优于多个图像脱掩的数据集上的最新方法。最后,我们通过广泛的消融研究来验证这些关键设计为图像去除网络的性能增益。

CV-36-标题 Understanding Open-Set Recognition by Jacobian Norm of Representation

链接: https://arxiv.org/abs/2209.11436
作者: Jaewoo Park, Hojin Park, Eunju Jeong, Andrew Beng Jin Teoh


Abstract: In contrast to conventional closed-set recognition, open-set recognition (OSR) assumes the presence of an unknown class, which is not seen to a model during training. One predominant approach in OSR is metric learning, where a model is trained to separate the inter-class representations of known class data. Numerous works in OSR reported that, even though the models are trained only with the known class data, the models become aware of the unknown, and learn to separate the unknown class representations from the known class representations. This paper analyzes this emergent phenomenon by observing the Jacobian norm of representation. We theoretically show that minimizing the intra-class distances within the known set reduces the Jacobian norm of known class representations while maximizing the inter-class distances within the known set increases the Jacobian norm of the unknown class. The closed-set metric learning thus separates the unknown from the known by forcing their Jacobian norm values to differ. We empirically validate our theoretical framework with ample pieces of evidence using standard OSR datasets. Moreover, under our theoretical framework, we explain how the standard deep learning techniques can be helpful for OSR and use the framework as a guiding principle to develop an effective OSR model.

摘要:与常规的闭合识别相反,开放式识别(OSR)假设存在未知类别,在训练过程中未被视为模型。 OSR中的一种主要方法是度量学习,其中对模型进行了训练以分离已知类别数据的类间表示。 OSR中的许多作品报告说,即使模型仅通过已知类别的数据进行培训,模型也会意识到未知数,并学会将未知类表征与已知类别表示分开。本文通过观察雅各布的代表规范来分析这种新兴现象。从理论上讲,我们表明已知集中的阶层内距离最小化会减少已知类表征的雅各布式规范,同时最大化已知集合中的阶层间距离会增加未知类别的雅各布式规范。因此,封闭式度量学习通过迫使其雅各布规范值有所不同,从而将未知的未知数与已知分开。我们通过使用标准OSR数据集的大量证据来验证我们的理论框架。此外,在我们的理论框架下,我们解释了标准的深度学习技术如何有助于OSR并将框架作为指导原则来开发有效的OSR模型。

CV-37-标题 Towards Frame Rate Agnostic Multi-Object Tracking

链接: https://arxiv.org/abs/2209.11404
作者: Weitao Feng, Lei Bai, Yongqiang Yao, Fengwei Yu, Wanli Ouyang
备注: 21 pages; Author version


Abstract: Multi-Object Tracking (MOT) is one of the most fundamental computer vision tasks which contributes to a variety of video analysis applications. Despite the recent promising progress, current MOT research is still limited to a fixed sampling frame rate of the input stream. In fact, we empirically find that the accuracy of all recent state-of-the-art trackers drops dramatically when the input frame rate changes. For a more intelligent tracking solution, we shift the attention of our research work to the problem of Frame Rate Agnostic MOT (FraMOT). In this paper, we propose a Frame Rate Agnostic MOT framework with Periodic training Scheme (FAPS) to tackle the FraMOT problem for the first time. Specifically, we propose a Frame Rate Agnostic Association Module (FAAM) that infers and encodes the frame rate information to aid identity matching across multi-frame-rate inputs, improving the capability of the learned model in handling complex motion-appearance relations in FraMOT. Besides, the association gap between training and inference is enlarged in FraMOT because those post-processing steps not included in training make a larger difference in lower frame rate scenarios. To address it, we propose Periodic Training Scheme (PTS) to reflect all post-processing steps in training via tracking pattern matching and fusion. Along with the proposed approaches, we make the first attempt to establish an evaluation method for this new task of FraMOT in two different modes, i.e., known frame rate and unknown frame rate, aiming to handle a more complex situation. The quantitative experiments on the challenging MOT datasets (FraMOT version) have clearly demonstrated that the proposed approaches can handle different frame rates better and thus improve the robustness against complicated scenarios.


CV-38-标题 LGDN Language-Guided Denoising Network for Video-Language Modeling

链接: https://arxiv.org/abs/2209.11388
作者: Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, Zhiwu Lu
备注: Accepted by NeurIPS2022


Abstract: Video-language modeling has attracted much attention with the rapid growth of web videos. Most existing methods assume that the video frames and text description are semantically correlated, and focus on video-language modeling at video level. However, this hypothesis often fails for two reasons: (1) With the rich semantics of video contents, it is difficult to cover all frames with a single video-level description; (2) A raw video typically has noisy/meaningless information (e.g., scenery shot, transition or teaser). Although a number of recent works deploy attention mechanism to alleviate this problem, the irrelevant/noisy information still makes it very difficult to address. To overcome such challenge, we thus propose an efficient and effective model, termed Language-Guided Denoising Network (LGDN), for video-language modeling. Different from most existing methods that utilize all extracted video frames, LGDN dynamically filters out the misaligned or redundant frames under the language supervision and obtains only 2–4 salient frames per video for cross-modal token-level alignment. Extensive experiments on five public datasets show that our LGDN outperforms the state-of-the-arts by large margins. We also provide detailed ablation study to reveal the critical importance of solving the noise issue, in hope of inspiring future video-language work.

摘要:通过网络视频的快速增长,视频语言建模吸引了很多关注。大多数现有方法都假定视频帧和文本描述是语义上关联的,并专注于视频级别的视频模型。但是,该假设通常是有两个原因的:(1)凭借视频内容丰富的语义,很难用单个视频级别的描述覆盖所有帧; (2)原始视频通常具有嘈杂/毫无意义的信息(例如,镜头,过渡或预告片)。尽管最近的许多作品部署了注意力来减轻此问题,但无关/嘈杂的信息仍然使得很难解决。为了克服此类挑战,我们提出了一个高效有效的模型,称为语言引导网络(LGDN),用于视频语言建模。与使用所有提取的视频帧的大多数现有方法不同,LGDN在语言监督下动态过滤了未对准或冗余的帧,并且每个视频仅获得2—4个显着帧,以进行交叉模式令牌级别的对准。在五个公共数据集上进行的广泛实验表明,我们的LGDN优于最先进的利润率。我们还提供了详细的消融研究,以揭示解决噪声问题的关键重要性,以启发未来的视频语言工作。

CV-39-标题 Tensor-Based Multi-Modality Feature Selection and Regression for Alzheimers Disease Diagnosis

链接: https://arxiv.org/abs/2209.11372
作者: Jun Yu, Zhaoming Kong, Liang Zhan, Li Shen, Lifang He


Abstract: The assessment of Alzheimer’s Disease (AD) and Mild Cognitive Impairment (MCI) associated with brain changes remains a challenging task. Recent studies have demonstrated that combination of multi-modality imaging techniques can better reflect pathological characteristics and contribute to more accurate diagnosis of AD and MCI. In this paper, we propose a novel tensor-based multi-modality feature selection and regression method for diagnosis and biomarker identification of AD and MCI from normal controls. Specifically, we leverage the tensor structure to exploit high-level correlation information inherent in the multi-modality data, and investigate tensor-level sparsity in the multilinear regression model. We present the practical advantages of our method for the analysis of ADNI data using three imaging modalities (VBM- MRI, FDG-PET and AV45-PET) with clinical parameters of disease severity and cognitive scores. The experimental results demonstrate the superior performance of our proposed method against the state-of-the-art for the disease diagnosis and the identification of disease-specific regions and modality-related differences. The code for this work is publicly available at this https URL.

摘要:与大脑变化相关的阿尔茨海默氏病(AD)和轻度认知障碍(MCI)的评估仍然是一项艰巨的任务。最近的研究表明,多模式成像技术的组合可以更好地反映病理特征,并有助于更准确地诊断AD和MCI。在本文中,我们提出了一种新型的基于张量的多模式特征选择和回归方法,用于诊断和生物标志物对正常对照组的AD和MCI鉴定。具体而言,我们利用张量结构来利用多模式数据中固有的高级相关信息,并研究多线性回归模型中的张量级稀疏性。我们使用三种成像方式(VBM- MRI,FDG-PET和AV45-PET)具有疾病严重程度和认知评分的临床参数来分析ADNI数据的方法的实际优势。实验结果表明,我们提出的方法与疾病诊断的最新方法的优越性能以及疾病特异性区域和与模态相关的差异的鉴定。此工作的代码可在此HTTPS URL上公开获得。

CV-40-标题 CUTS A Fully Unsupervised Framework for Medical Image Segmentation

链接: https://arxiv.org/abs/2209.11359
作者: Matthew Amodio, Feng Gao, Arman Avesta, Sanjay Aneja, Lucian V. Del Priore, Jay Wang, Smita Krishnaswamy


Abstract: In this work we introduce CUTS (Contrastive and Unsupervised Training for Segmentation) the first fully unsupervised deep learning framework for medical image segmentation, facilitating the use of the vast majority of imaging data that is not labeled or annotated. Segmenting medical images into regions of interest is a critical task for facilitating both patient diagnoses and quantitative research. A major limiting factor in this segmentation is the lack of labeled data, as getting expert annotations for each new set of imaging data or task can be expensive, labor intensive, and inconsistent across annotators: thus, we utilize self-supervision based on pixel-centered patches from the images themselves. Our unsupervised approach is based on a training objective with both contrastive learning and autoencoding aspects. Previous contrastive learning approaches for medical image segmentation have focused on image-level contrastive training, rather than our intra-image patch-level approach or have used this as a pre-training task where the network needed further supervised training afterwards. By contrast, we build the first entirely unsupervised framework that operates at the pixel-centered-patch level. Specifically, we add novel augmentations, a patch reconstruction loss, and introduce a new pixel clustering and identification framework. Our model achieves improved results on several key medical imaging tasks, as verified by held-out expert annotations on the task of segmenting geographic atrophy (GA) regions of images of the retina.


CV-41-标题 NasHD Efficient ViT Architecture Performance Ranking using Hyperdimensional Computing

链接: https://arxiv.org/abs/2209.11356
作者: Dongning Ma, Pengfei Zhao, Xun Jiao


Abstract: Neural Architecture Search (NAS) is an automated architecture engineering method for deep learning design automation, which serves as an alternative to the manual and error-prone process of model development, selection, evaluation and performance estimation. However, one major obstacle of NAS is the extremely demanding computation resource requirements and time-consuming iterations particularly when the dataset scales. In this paper, targeting at the emerging vision transformer (ViT), we present NasHD, a hyperdimensional computing based supervised learning model to rank the performance given the architectures and configurations. Different from other learning based methods, NasHD is faster thanks to the high parallel processing of HDC architecture. We also evaluated two HDC encoding schemes: Gram-based and Record-based of NasHD on their performance and efficiency. On the VIMER-UFO benchmark dataset of 8 applications from a diverse range of domains, NasHD Record can rank the performance of nearly 100K vision transformer models with about 1 minute while still achieving comparable results with sophisticated models.


CV-42-标题 Learning Interpretable Dynamics from Images of a Freely Rotating 3D Rigid Body

链接: https://arxiv.org/abs/2209.11355
作者: Justice Mason, Christine Allen-Blanchette, Nicholas Zolman, Elizabeth Davison, Naomi Leonard
备注: 8 pages, 7 figures


Abstract: In many real-world settings, image observations of freely rotating 3D rigid bodies, such as satellites, may be available when low-dimensional measurements are not. However, the high-dimensionality of image data precludes the use of classical estimation techniques to learn the dynamics and a lack of interpretability reduces the usefulness of standard deep learning methods. In this work, we present a physics-informed neural network model to estimate and predict 3D rotational dynamics from image sequences. We achieve this using a multi-stage prediction pipeline that maps individual images to a latent representation homeomorphic to \mathbfSO(3) , computes angular velocities from latent pairs, and predicts future latent states using the Hamiltonian equations of motion with a learned representation of the Hamiltonian. We demonstrate the efficacy of our approach on a new rotating rigid-body dataset with sequences of rotating cubes and rectangular prisms with uniform and non-uniform density.

摘要:在许多现实世界中,当没有低维度测量值时,可以使用自由旋转的3D刚体(例如卫星)的图像观测值。但是,图像数据的高维度排除了学习动力学和缺乏解释性的使用,从而降低了标准深度学习方法的有用性。在这项工作中,我们提出了一个物理知识的神经网络模型,以估计和预测图像序列中的3D旋转动力学。我们使用多阶段预测管道实现了这一目标,该管道将单个图像映射到潜在表示同构为\ mathbfso(3),从潜在对计算角速度,并使用汉密尔顿的运动方程来预测未来的潜在状态哈密​​顿人。我们证明了方法对新的旋转刚体数据集的功效,该数据集具有旋转立方体和矩形棱镜序列,并具有均匀且不均匀的密度。

CV-43-标题 Oracle Analysis of Representations for Deep Open Set Detection

链接: https://arxiv.org/abs/2209.11350
作者: Risheek Garrepalli, Alan Fern, Thomas G. Dietterich


Abstract: The problem of detecting a novel class at run time is known as Open Set Detection & is important for various real-world applications like medical application, autonomous driving, etc. Open Set Detection within context of deep learning involves solving two problems: (i) Must map the input images into a latent representation that contains enough information to detect the outliers, and (ii) Must learn an anomaly scoring function that can extract this information from the latent representation to identify the anomalies. Research in deep anomaly detection methods has progressed slowly. One reason may be that most papers simultaneously introduce new representation learning techniques and new anomaly scoring approaches. The goal of this work is to improve this methodology by providing ways of separately measuring the effectiveness of the representation learning and anomaly scoring. This work makes two methodological contributions. The first is to introduce the notion of Oracle anomaly detection for quantifying the information available in a learned latent representation. The second is to introduce Oracle representation learning, which produces a representation that is guaranteed to be sufficient for accurate anomaly detection. These two techniques help researchers to separate the quality of the learned representation from the performance of the anomaly scoring mechanism so that they can debug and improve their systems. The methods also provide an upper limit on how much open category detection can be improved through better anomaly scoring mechanisms. The combination of the two oracles gives an upper limit on the performance that any open category detection method could achieve. This work introduces these two oracle techniques and demonstrates their utility by applying them to several leading open category detection methods.

摘要:在运行时检测新颖类的问题称为开放式检测,对于各种现实世界应用,例如医疗应用,自动驾驶等。 i)必须将输入图像映射到潜在表示中,该图像包含足够的信息来检测异常值,并且(ii)必须学习一个可以从潜在表示中提取此信息以识别异常的异常评分函数。深度异常检测方法的研究缓慢进展。原因之一可能是大多数论文同时引入了新的表示学习技术和新的异常评分方法。这项工作的目的是通过提供分别衡量表示学习和异常评分的有效性的方法来改善这种方法。这项工作做出了两项方法论贡献。首先是引入甲骨文异常检测的概念,以量化学习潜在表示中可用的信息。第二个是引入Oracle表示学习,该学习产生的表示形式可以保证足以准确的异常检测。这两种技术可帮助研究人员将学习表示的质量与异常评分机制的性能分开,以便他们可以调试和改善系统。这些方法还为通过更好的异常评分机制改善了多少开放类别检测提供了上限。两个牙齿的组合给出了任何开放类别检测方法可以实现的性能的上限。这项工作介绍了这两种Oracle技术,并通过将它们应用于几种领先的开放类别检测方法来演示其实用性。

CV-44-标题 Swin2SR SwinV2 Transformer for Compressed Image Super-Resolution and Restoration

链接: https://arxiv.org/abs/2209.11345
作者: Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte
备注: European Conference on Computer Vision (ECCV 2022) Workshops


Abstract: Compression plays an important role on the efficient transmission and storage of images and videos through band-limited systems such as streaming services, virtual reality or videogames. However, compression unavoidably leads to artifacts and the loss of the original information, which may severely degrade the visual quality. For these reasons, quality enhancement of compressed images has become a popular research topic. While most state-of-the-art image restoration methods are based on convolutional neural networks, other transformers-based methods such as SwinIR, show impressive performance on these tasks. In this paper, we explore the novel Swin Transformer V2, to improve SwinIR for image super-resolution, and in particular, the compressed input scenario. Using this method we can tackle the major issues in training transformer vision models, such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data. We conduct experiments on three representative tasks: JPEG compression artifacts removal, image super-resolution (classical and lightweight), and compressed image super-resolution. Experimental results demonstrate that our method, Swin2SR, can improve the training convergence and performance of SwinIR, and is a top-5 solution at the “AIM 2022 Challenge on Super-Resolution of Compressed Image and Video”.

摘要:压缩在通过限制系统(例如流媒体服务,虚拟现实或视频游戏)等系统的有效传输和存储图像和视频中起着重要作用。但是,不可避免地会导致伪影和原始信息的丢失,这可能会严重降低视觉质量。由于这些原因,压缩图像的质量增强已成为流行的研究主题。尽管大多数最先进的图像恢复方法基于卷积神经网络,但基于Swinir等其他基于变压器的方法在这些任务上表现出令人印象深刻的性能。在本文中,我们探索了新型的Swin Transformer V2,以改善图像超分辨率的Swinir,尤其是压缩输入方案。使用这种方法,我们可以解决训练变压器视觉模型中的主要问题,例如训练不稳定性,预训练和微调之间的分辨率差距以及数据饥饿。我们对三个代表性任务进行实验:JPEG压缩伪像去除,图像超分辨率(经典和轻巧)以及压缩的图像超分辨率。实验结果表明,我们的方法SWIN2SR可以改善SWINIR的训练收敛性和性能,并且是“ AIM 2022挑战压缩图像和视频的超分辨率”的前5个解决方案。

CV-45-标题 Fast Disparity Estimation from a Single Compressed Light Field Measurement

链接: https://arxiv.org/abs/2209.11342
作者: Emmanuel Martinez, Edwin Vargas, Henry Arguello


Abstract: The abundant spatial and angular information from light fields has allowed the development of multiple disparity estimation approaches. However, the acquisition of light fields requires high storage and processing cost, limiting the use of this technology in practical applications. To overcome these drawbacks, the compressive sensing (CS) theory has allowed the development of optical architectures to acquire a single coded light field measurement. This measurement is decoded using an optimization algorithm or deep neural network that requires high computational costs. The traditional approach for disparity estimation from compressed light fields requires first recovering the entire light field and then a post-processing step, thus requiring long times. In contrast, this work proposes a fast disparity estimation from a single compressed measurement by omitting the recovery step required in traditional approaches. Specifically, we propose to jointly optimize an optical architecture for acquiring a single coded light field snapshot and a convolutional neural network (CNN) for estimating the disparity maps. Experimentally, the proposed method estimates disparity maps comparable with those obtained from light fields reconstructed using deep learning approaches. Furthermore, the proposed method is 20 times faster in training and inference than the best method that estimates the disparity from reconstructed light fields.


CV-46-标题 A domain adaptive deep learning solution for scanpath prediction of paintings

链接: https://arxiv.org/abs/2209.11338
作者: Mohamed Amine Kerkouri, Marouane Tliba, Aladine Chetouani, Alessandro Bruno
备注: Accepted at CBMI2022 graz, austria


Abstract: Cultural heritage understanding and preservation is an important issue for society as it represents a fundamental aspect of its identity. Paintings represent a significant part of cultural heritage, and are the subject of study continuously. However, the way viewers perceive paintings is strictly related to the so-called HVS (Human Vision System) behaviour. This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings. In further details, we introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans, including the fundamental understanding of a scene, and then extend it to painting images. The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers’ attention. We use an FCNN (Fully Convolutional Neural Network), in which we exploit a differentiable channel-wise selection and Soft-Argmax modules. We also incorporate learnable Gaussian distributions onto the network bottleneck to simulate visual attention process bias in natural scene images. Furthermore, to reduce the effect of shifts between different domains (i.e. natural images, painting), we urge the model to learn unsupervised general features from other domains using a gradient reversal classifier. The results obtained by our model outperform existing state-of-the-art ones in terms of accuracy and efficiency.


CV-47-标题 UNav An Infrastructure-Independent Vision-Based Navigation System for People with Blindness and Low vision

链接: https://arxiv.org/abs/2209.11336
作者: Anbang Yang, Mahya Beheshti, Todd E Hudson, Rajesh Vedanthan, Wachara Riewpaiboon, Pattanasak Mongkolwat, Chen Feng, John-Ross Rizzo


Abstract: Vision-based localization approaches now underpin newly emerging navigation pipelines for myriad use cases from robotics to assistive technologies. Compared to sensor-based solutions, vision-based localization does not require pre-installed sensor infrastructure, which is costly, time-consuming, and/or often infeasible at scale. Herein, we propose a novel vision-based localization pipeline for a specific use case: navigation support for end-users with blindness and low vision. Given a query image taken by an end-user on a mobile application, the pipeline leverages a visual place recognition (VPR) algorithm to find similar images in a reference image database of the target space. The geolocations of these similar images are utilized in downstream tasks that employ a weighted-average method to estimate the end-user’s location and a perspective-n-point (PnP) algorithm to estimate the end-user’s direction. Additionally, this system implements Dijkstra’s algorithm to calculate a shortest path based on a navigable map that includes trip origin and destination. The topometric map used for localization and navigation is built using a customized graphical user interface that projects a 3D reconstructed sparse map, built from a sequence of images, to the corresponding a priori 2D floor plan. Sequential images used for map construction can be collected in a pre-mapping step or scavenged through public databases/citizen science. The end-to-end system can be installed on any internet-accessible device with a camera that hosts a custom mobile application. For evaluation purposes, mapping and localization were tested in a complex hospital environment. The evaluation results demonstrate that our system can achieve localization with an average error of less than 1 meter without knowledge of the camera’s intrinsic parameters, such as focal length.

摘要:基于视觉的本地化方法现在是针对机器人技术到辅助技术的无数用例的新出现的导航管道的基础。与基于传感器的解决方案相比,基于视觉的定位不需要预安装的传感器基础架构,这是昂贵,耗时和/或通常不可行的。本文中,我们为特定用例提出了一个基于视觉的本地化管道:针对失明和低视力的最终用户的导航支持。给定最终用户在移动应用程序上拍摄的查询图像,该管道利用视觉位置识别(VPR)算法在目标空间的参考图像数据库中找到相似的图像。这些相似图像的地理位置用于采用加权平均方法来估计最终用户的位置和透视N点(PNP)算法的下游任务中,以估计最终用户的方向。此外,该系统实现了Dijkstra的算法,以根据包括Trip Origin和目的地的可通航地图计算最短路径。用于本地化和导航的层压映射是使用定制的图形用户界面构建的,该图形用户界面投影了3D重建的稀疏映射,从一系列图像构建到相应的先验2D楼平面图。用于地图构造的顺序图像可以在预映射步骤中收集,也可以通过公共数据库/公民科学清除。端到端系统可以使用托管自定义移动应用程序的相机安装在任何可互联网的设备上。出于评估目的,在复杂的医院环境中测试了映射和定位。评估结果表明,我们的系统可以以少于1米的平均误差来实现本地化,而无需了解摄像机的固有参数,例如焦距。

CV-48-标题 Privacy-Preserving Person Detection Using Low-Resolution Infrared Cameras

链接: https://arxiv.org/abs/2209.11335
作者: Thomas Dubail, Fidel Alejandro Guerrero Peña, Heitor Rapela Medeiros, Masih Aminbeidokhti, Eric Granger, Marco Pedersoli


Abstract: In intelligent building management, knowing the number of people and their location in a room are important for better control of its illumination, ventilation, and heating with reduced costs and improved comfort. This is typically achieved by detecting people using compact embedded devices that are installed on the room’s ceiling, and that integrate low-resolution infrared camera, which conceals each person’s identity. However, for accurate detection, state-of-the-art deep learning models still require supervised training using a large annotated dataset of images. In this paper, we investigate cost-effective methods that are suitable for person detection based on low-resolution infrared images. Results indicate that for such images, we can reduce the amount of supervision and computation, while still achieving a high level of detection accuracy. Going from single-shot detectors that require bounding box annotations of each person in an image, to auto-encoders that only rely on unlabelled images that do not contain people, allows for considerable savings in terms of annotation costs, and for models with lower computational costs. We validate these experimental findings on two challenging top-view datasets with low-resolution infrared images.


CV-49-标题 FuTH-Net Fusing Temporal Relations and Holistic Features for Aerial Video Classification

链接: https://arxiv.org/abs/2209.11316
作者: Pu Jin, Lichao Mou, Yuansheng Hua, Gui-Song Xia, Xiao Xiang Zhu


Abstract: Unmanned aerial vehicles (UAVs) are now widely applied to data acquisition due to its low cost and fast mobility. With the increasing volume of aerial videos, the demand for automatically parsing these videos is surging. To achieve this, current researches mainly focus on extracting a holistic feature with convolutions along both spatial and temporal dimensions. However, these methods are limited by small temporal receptive fields and cannot adequately capture long-term temporal dependencies which are important for describing complicated dynamics. In this paper, we propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification. Furthermore, the holistic features are refined by the multi-scale temporal relations in a novel fusion module for yielding more discriminative video representations. More specially, FuTH-Net employs a two-pathway architecture: (1) a holistic representation pathway to learn a general feature of both frame appearances and shortterm temporal variations and (2) a temporal relation pathway to capture multi-scale temporal relations across arbitrary frames, providing long-term temporal dependencies. Afterwards, a novel fusion module is proposed to spatiotemporal integrate the two features learned from the two pathways. Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results. This demonstrates its effectiveness and good generalization capacity across different recognition tasks (event classification and human action recognition). To facilitate further research, we release the code at this https URL.

摘要:由于其低成本和快速移动性,无人驾驶汽车(UAV)现在已广泛应用于数据采集。随着航空视频量的增加,对这些视频自动解析的需求正在激增。为了实现这一目标,当前的研究主要集中于在空间和时间维度沿着卷积的整体特征提取整体特征。但是,这些方法受到小时接收场的限制,无法充分捕获长期的时间依赖性,这对于描述复杂动力学很重要。在本文中,我们提出了一个新颖的深神经网络,称为futh-net,不仅为整体特征建模,而且还模拟了空中视频分类的时间关系。此外,在新型融合模块中,多尺度的时间关系可以完善整体特征,以产生更具歧视性的视频表示。更特别地,FUTH-NET采用了两条道路架构:(1)学习框架外观和短期时间变化的一般特征的整体代表途径,以及(2)捕获跨任意跨越任意时间关系的时间关系途径框架,提供长期的时间依赖性。之后,提出了一个新型的融合模块,以时空整合从这两种途径中学到的两个特征。我们的模型对两个航空视频分类数据集进行了评估,即ERA和无人机操作,并实现了最新结果。这表明了其在不同识别任务(事件分类和人类行动识别)之间的有效性和良好的概括能力。为了促进进一步的研究,我们在此HTTPS URL上发布了代码。

CV-50-标题 Colonoscopy Landmark Detection using Vision Transformers

链接: https://arxiv.org/abs/2209.11304
作者: Aniruddha Tamhane, Tse’ela Mida, Erez Posner, Moshe Bouhnik


Abstract: Colonoscopy is a routine outpatient procedure used to examine the colon and rectum for any abnormalities including polyps, diverticula and narrowing of colon structures. A significant amount of the clinician’s time is spent in post-processing snapshots taken during the colonoscopy procedure, for maintaining medical records or further investigation. Automating this step can save time and improve the efficiency of the process. In our work, we have collected a dataset of 120 colonoscopy videos and 2416 snapshots taken during the procedure, that have been annotated by experts. Further, we have developed a novel, vision-transformer based landmark detection algorithm that identifies key anatomical landmarks (the appendiceal orifice, ileocecal valve/cecum landmark and rectum retroflexion) from snapshots taken during colonoscopy. Our algorithm uses an adaptive gamma correction during preprocessing to maintain a consistent brightness for all images. We then use a vision transformer as the feature extraction backbone and a fully connected network based classifier head to categorize a given frame into four classes: the three landmarks or a non-landmark frame. We compare the vision transformer (ViT-B/16) backbone with ResNet-101 and ConvNext-B backbones that have been trained similarly. We report an accuracy of 82% with the vision transformer backbone on a test dataset of snapshots.


CV-51-标题 Deep Domain Adaptation for Detecting Bomb Craters in Aerial Images

链接: https://arxiv.org/abs/2209.11299
作者: Marco Geiger, Dominik Martin, Niklas Kühl
备注: 56th Annual Hawaii International Conference on System Sciences (HICSS-56)


Abstract: The aftermath of air raids can still be seen for decades after the devastating events. Unexploded ordnance (UXO) is an immense danger to human life and the environment. Through the assessment of wartime images, experts can infer the occurrence of a dud. The current manual analysis process is expensive and time-consuming, thus automated detection of bomb craters by using deep learning is a promising way to improve the UXO disposal process. However, these methods require a large amount of manually labeled training data. This work leverages domain adaptation with moon surface images to address the problem of automated bomb crater detection with deep learning under the constraint of limited training data. This paper contributes to both academia and practice (1) by providing a solution approach for automated bomb crater detection with limited training data and (2) by demonstrating the usability and associated challenges of using synthetic images for domain adaptation.


CV-52-标题 T2FPV Constructing High-Fidelity First-Person View Datasets From Real-World Pedestrian Trajectories

链接: https://arxiv.org/abs/2209.11294
作者: Benjamin Stoler, Meghdeep Jana, Soonmin Hwang, Jean Oh


Abstract: Predicting pedestrian motion is essential for developing socially-aware robots that interact in a crowded environment. While the natural visual perspective for a social interaction setting is an egocentric view, the majority of existing work in trajectory prediction has been investigated purely in the top-down trajectory space. To support first-person view trajectory prediction research, we present T2FPV, a method for constructing high-fidelity first-person view datasets given a real-world, top-down trajectory dataset; we showcase our approach on the ETH/UCY pedestrian dataset to generate the egocentric visual data of all interacting pedestrians. We report that the bird’s-eye view assumption used in the original ETH/UCY dataset, i.e., an agent can observe everyone in the scene with perfect information, does not hold in the first-person views; only a fraction of agents are fully visible during each 20-timestep scene used commonly in existing work. We evaluate existing trajectory prediction approaches under varying levels of realistic perception – displacement errors suffer a 356% increase compared to the top-down, perfect information setting. To promote research in first-person view trajectory prediction, we release our T2FPV-ETH dataset and software tools.

摘要:预测行人运动对于开发在拥挤的环境中相互作用的社会意识的机器人至关重要。虽然社交互动环境的自然视觉观点是一种自然的观点,但轨迹预测中的大多数现有作品纯粹是在自上而下的轨迹空间中进行的。为了支持第一人称视图轨迹预测研究,我们提出了T2FPV,这是一种构建高保真的第一人称视图数据集的方法,给定真实的,自上而下的轨迹数据集;我们在ETH/UCY行人数据集上展示了我们的方法,以生成所有互动行人的以自我为中心的视觉数据。我们报告说,原始的ETH/UCY数据集中使用的鸟眼视图假设,即代理可以用完美的信息观察场景中的每个人,而不会在第一人称视图中保持;在现有作品中通常使用的每个20个磁场场景中,只有一小部分的代理都可以完全看到。我们评估现有的轨迹预测方法在不同的现实感知水平下 - 与自上而下的完美信息设置相比,位移错误增加了356%。为了促进第一人称视图轨迹预测的研究,我们发布了T2FPV-ETH数据集和软件工具。

CV-53-标题 FusionVAE A Deep Hierarchical Variational Autoencoder for RGB Image Fusion

链接: https://arxiv.org/abs/2209.11277
作者: Fabian Duffhauss, Ngo Anh Vien, Hanna Ziesche, Gerhard Neumann
备注: Accepted at ECCV 2022


Abstract: Sensor fusion can significantly improve the performance of many computer vision tasks. However, traditional fusion approaches are either not data-driven and cannot exploit prior knowledge nor find regularities in a given dataset or they are restricted to a single application. We overcome this shortcoming by presenting a novel deep hierarchical variational autoencoder called FusionVAE that can serve as a basis for many fusion tasks. Our approach is able to generate diverse image samples that are conditioned on multiple noisy, occluded, or only partially visible input images. We derive and optimize a variational lower bound for the conditional log-likelihood of FusionVAE. In order to assess the fusion capabilities of our model thoroughly, we created three novel datasets for image fusion based on popular computer vision datasets. In our experiments, we show that FusionVAE learns a representation of aggregated information that is relevant to fusion tasks. The results demonstrate that our approach outperforms traditional methods significantly. Furthermore, we present the advantages and disadvantages of different design choices.


CV-54-标题 Capsule Network based Contrastive Learning of Unsupervised Visual Representations

链接: https://arxiv.org/abs/2209.11276
作者: Harsh Panwar, Ioannis Patras


Abstract: Capsule Networks have shown tremendous advancement in the past decade, outperforming the traditional CNNs in various task due to it’s equivariant properties. With the use of vector I/O which provides information of both magnitude and direction of an object or it’s part, there lies an enormous possibility of using Capsule Networks in unsupervised learning environment for visual representation tasks such as multi class image classification. In this paper, we propose Contrastive Capsule (CoCa) Model which is a Siamese style Capsule Network using Contrastive loss with our novel architecture, training and testing algorithm. We evaluate the model on unsupervised image classification CIFAR-10 dataset and achieve a top-1 test accuracy of 70.50% and top-5 test accuracy of 98.10%. Due to our efficient architecture our model has 31 times less parameters and 71 times less FLOPs than the current SOTA in both supervised and unsupervised learning.


CV-55-标题 Optimization of FPGA-based CNN Accelerators Using Metaheuristics

链接: https://arxiv.org/abs/2209.11272
作者: Sadiq M. Sait, Aiman El-Maleh, Mohammad Altakrouri, Ahmad Shawahna
备注: 23 pages, 7 figures, 9 tables. in The Journal of Supercomputing, 2022


Abstract: In recent years, convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields and with accuracy that was not possible before. However, this comes with extensive computational requirements, which made general CPUs unable to deliver the desired real-time performance. At the same time, FPGAs have seen a surge in interest for accelerating CNN inference. This is due to their ability to create custom designs with different levels of parallelism. Furthermore, FPGAs provide better performance per watt compared to GPUs. The current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs), each of which is tailored for a subset of layers. However, the growing complexity of CNN architectures makes optimizing the resources available on the target FPGA device to deliver optimal performance more challenging. In this paper, we present a CNN accelerator and an accompanying automated design methodology that employs metaheuristics for partitioning available FPGA resources to design a Multi-CLP accelerator. Specifically, the proposed design tool adopts simulated annealing (SA) and tabu search (TS) algorithms to find the number of CLPs required and their respective configurations to achieve optimal performance on a given target FPGA device. Here, the focus is on the key specifications and hardware resources, including digital signal processors, block RAMs, and off-chip memory bandwidth. Experimental results and comparisons using four well-known benchmark CNNs are presented demonstrating that the proposed acceleration framework is both encouraging and promising. The SA-/TS-based Multi-CLP achieves 1.31x - 2.37x higher throughput than the state-of-the-art Single-/Multi-CLP approaches in accelerating AlexNet, SqueezeNet 1.1, VGGNet, and GoogLeNet architectures on the Xilinx VC707 and VC709 FPGA boards.

摘要:近年来,卷积神经网络(CNN)证明了它们在许多领域中解决问题的能力,并且以前无法进行准确性。但是,这带有广泛的计算要求,这使得普通CPU无法提供所需的实时性能。同时,FPGA对加速CNN推断的兴趣激增。这是由于他们有能力创建具有不同级别的并行性的自定义设计。此外,与GPU相比,FPGA提供每瓦的性能更好。基于FPGA的CNN加速器的当前趋势是实现多个卷积层处理器(CLP),每个处理器都针对一层层量身定制。但是,CNN体系结构的日益增长的复杂性使得优化目标FPGA设备上可用的资源,以使最佳性能更具挑战性。在本文中,我们提出了CNN加速器和随附的自动设计方法,该方法采用元启发式学来分区可用的FPGA资源来设计多CLP加速器。具体而言,提出的设计工具采用模拟退火(SA)和禁忌搜索(TS)算法来查找所需的CLP数量及其各自的配置,以在给定的目标FPGA设备上实现最佳性能。在这里,重点是关键规格和硬件资源,包括数字信号处理器,阻止RAM和芯片内存储器带宽。提出了使用四个众所周知的基准CNN的实验结果和比较,表明所提出的加速框架既令人鼓舞又有前途。基于SA-/TS的多CLP比在加速Alexnet,Squeezenet 1.1,VGGNET和Googlenet架构上的最新单个/多CLP方法高1.31x-2.37倍高2.37倍。和VC709 FPGA板。

CV-56-标题 Recurrence-free Survival Prediction under the Guidance of Automatic Gross Tumor Volume Segmentation for Head and Neck Cancers

链接: https://arxiv.org/abs/2209.11268
作者: Kai Wang, Yunxiang Li, Michael Dohopolski, Tao Peng, Weiguo Lu, You Zhang, Jing Wang
备注: MICCAI 2022, HECKTOR Challenge Submission


Abstract: For Head and Neck Cancers (HNC) patient management, automatic gross tumor volume (GTV) segmentation and accurate pre-treatment cancer recurrence prediction are of great importance to assist physicians in designing personalized management plans, which have the potential to improve the treatment outcome and quality of life for HNC patients. In this paper, we developed an automated primary tumor (GTVp) and lymph nodes (GTVn) segmentation method based on combined pre-treatment positron emission tomography/computed tomography (PET/CT) scans of HNC patients. We extracted radiomics features from the segmented tumor volume and constructed a multi-modality tumor recurrence-free survival (RFS) prediction model, which fused the prediction results from separate CT radiomics, PET radiomics, and clinical models. We performed 5-fold cross-validation to train and evaluate our methods on the MICCAI 2022 HEad and neCK TumOR segmentation and outcome prediction challenge (HECKTOR) dataset. The ensemble prediction results on the testing cohort achieved Dice scores of 0.77 and 0.73 for GTVp and GTVn segmentation, respectively, and a C-index value of 0.67 for RFS prediction. The code is publicly available (this https URL). Our team’s name is AIRT.

摘要:对于头颈癌(HNC)患者管理,自动总肿瘤体积(GTV)细分和准确的治疗前癌症复发预测对于协助医生设计个性化管理计划非常重要,这有可能改善治疗方法HNC患者的结果和生活质量。在本文中,我们基于HNC患者的组合预处理正电子发射断层扫描/计算机发射断层扫描(PET/CT)扫描,开发了一种自动原发性肿瘤(GTVP)和淋巴结(GTVN)分割方法。我们从分段的肿瘤体积中提取了放射素学特征,并构建了多模式肿瘤复发生存率(RFS)预测模型,该模型融合了预测由单独的CT放射线学,PET放射线学和临床模型融合在一起。我们进行了5倍的交叉验证,以训练和评估MICCAI 2022头和颈部肿瘤分割和结果预测挑战(Hecktor)数据集的方法。 GTVP和GTVN分割的测试队列的集合预测分别达到0.77和0.73,RFS预测的C-指数值为0.67。该代码公开可用(此HTTPS URL)。我们团队的名字叫艾特。

CV-57-标题 3DPCT 3D Point Cloud Transformer with Dual Self-attention

链接: https://arxiv.org/abs/2209.11255
作者: Dening Lu, Kyle Gao, Qian Xie, Linlin Xu, Jonathan Li
备注: 10 pages, 5 figures, 4 tables


Abstract: Transformers have resulted in remarkable achievements in the field of image processing. Inspired by this great success, the application of Transformers to 3D point cloud processing has drawn more and more attention. This paper presents a novel point cloud representational learning network, 3D Point Cloud Transformer with Dual Self-attention (3DPCT) and an encoder-decoder structure. Specifically, 3DPCT has a hierarchical encoder, which contains two local-global dual-attention modules for the classification task (three modules for the segmentation task), with each module consisting of a Local Feature Aggregation (LFA) block and a Global Feature Learning (GFL) block. The GFL block is dual self-attention, with both point-wise and channel-wise self-attention to improve feature extraction. Moreover, in LFA, to better leverage the local information extracted, a novel point-wise self-attention model, named as Point-Patch Self-Attention (PPSA), is designed. The performance is evaluated on both classification and segmentation datasets, containing both synthetic and real-world data. Extensive experiments demonstrate that the proposed method achieved state-of-the-art results on both classification and segmentation tasks.

摘要:变形金刚在图像处理领域取得了显着的成就。受到这一巨大成功的启发,变形金刚在3D点云处理中的应用引起了越来越多的关注。本文提出了一个新颖的点云表示学习网络,具有双重自我注意的3D点云变压器(3DPCT)和一个编码器解码器结构。具体而言,3DPCT具有一个层次编码器,该编码器包含两个用于分类任务的局部全球双重注意模块(分段任务的三个模块),每个模块都包含一个局部特征聚合(LFA)块和全局特征学习( GFL)块。 GFL块是双重的自我注意事项,既有在点上的自我注意力,又可以提高特征提取。此外,在LFA中,为更好地利用了提取的本地信息,设计了一种新颖的点自我发明模型,称为点斑点自我注意力(PPSA)。在分类和分割数据集上都评估了性能,其中包含合成数据和现实世界数据。广泛的实验表明,所提出的方法在分类和分割任务上都达到了最新的结果。

CV-58-标题 Dual-Cycle Self-Supervised Dual-View Fluorescence Microscopy Image Reconstruction using CycleGAN

链接: https://arxiv.org/abs/2209.11729
作者: Tomas Kerepecky, Jiaming Liu, Xue Wen Ng, David W. Piston, Ulugbek S. Kamilov
备注: 7 pages, 5 figures


Abstract: Three-dimensional fluorescence microscopy often suffers from anisotropy, where the resolution along the axial direction is lower than that within the lateral imaging plane. We address this issue by presenting Dual-Cycle, a new framework for joint deconvolution and fusion of dual-view fluorescence images. Inspired by the recent Neuroclear method, Dual-Cycle is designed as a cycle-consistent generative network trained in a self-supervised fashion by combining a dual-view generator and prior-guided degradation model. We validate Dual-Cycle on both synthetic and real data showing its state-of-the-art performance without any external training data.


CV-59-标题 Deep Learning-based Anonymization of Chest Radiographs A Utility-preserving Measure for Patient Privacy

链接: https://arxiv.org/abs/2209.11531
作者: Kai Packhäuser, Sebastian Gündel, Florian Thamm, Felix Denzinger, Andreas Maier
备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible


Abstract: Robust and reliable anonymization of chest radiographs constitutes an essential step before publishing large datasets of such for research purposes. The conventional anonymization process is carried out by obscuring personal information in the images with black boxes and removing or replacing meta-information. However, such simple measures retain biometric information in the chest radiographs, allowing patients to be re-identified by a linkage attack. Therefore, we see an urgent need to obfuscate the biometric information appearing in the images. To the best of our knowledge, we propose the first deep learning-based approach to targetedly anonymize chest radiographs while maintaining data utility for diagnostic and machine learning purposes. Our model architecture is a composition of three independent neural networks that, when collectively used, allow for learning a deformation field that is able to impede patient re-identification. The individual influence of each component is investigated with an ablation study. Quantitative results on the ChestX-ray14 dataset show a reduction of patient re-identification from 81.8% to 58.6% in the area under the receiver operating characteristic curve (AUC) with little impact on the abnormality classification performance. This indicates the ability to preserve underlying abnormality patterns while increasing patient privacy. Furthermore, we compare the proposed deep learning-based anonymization approach with differentially private image pixelization, and demonstrate the superiority of our method towards resolving the privacy-utility trade-off for chest radiographs.

摘要:胸部X光片的强大而可靠的匿名化构成了出于研究目的发布大量数据集之前的重要步骤。传统的匿名过程是通过在图像中使用黑匣子中遮盖个人信息并删除或替换元信息来执行的。但是,这种简单的措施将生物识别信息保留在胸部X光片中,从而使患者可以通过连锁攻击重新识别。因此,我们看到迫切需要混淆图像中出现的生物特征识别信息。据我们所知,我们提出了第一种基于深度学习的方法,以目标匿名化胸部X光片,同时维护数据实用程序以诊断和机器学习目的。我们的模型架构是三个独立神经网络的组成,当共同使用时,它可以学习能够阻碍患者重新识别的变形场。通过消融研究研究每个组件的个体影响。 CHESTX-RAY14数据集的定量结果显示,在接收器操作特征曲线(AUC)下,患者重新识别从81.8%降低至58.6%,对异常分类性能的影响很小。这表明能够保留潜在的异常模式,同时增加患者隐私。此外,我们将提出的基于学习的深度匿名方法与差异化图像像素化进行比较,并证明了我们方法在解决胸部X光片的隐私性权衡权衡方面的优越性。

CV-60-标题 Segmentation-based Information Extraction and Amalgamation in Fundus Images for Glaucoma Detection

链接: https://arxiv.org/abs/2209.11456
作者: Yanni Wang, Gang Yang, Dayong Ding, Jianchun Zao


Abstract: Glaucoma is a severe blinding disease, for which automatic detection methods are urgently needed to alleviate the scarcity of ophthalmologists. Many works have proposed to employ deep learning methods that involve the segmentation of optic disc and cup for glaucoma detection, in which the segmentation process is often considered merely as an upstream sub-task. The relationship between fundus images and segmentation masks in terms of joint decision-making in glaucoma assessment is rarely explored. We propose a novel segmentation-based information extraction and amalgamation method for the task of glaucoma detection, which leverages the robustness of segmentation masks without disregarding the rich information in the original fundus images. Experimental results on both private and public datasets demonstrate that our proposed method outperforms all models that utilize solely either fundus images or masks.


CV-61-标题 Modular Degradation Simulation and Restoration for Under-Display Camera

链接: https://arxiv.org/abs/2209.11455
作者: Yang Zhou, Yuda Song, Xin Du


Abstract: Under-display camera (UDC) provides an elegant solution for full-screen smartphones. However, UDC captured images suffer from severe degradation since sensors lie under the display. Although this issue can be tackled by image restoration networks, these networks require large-scale image pairs for training. To this end, we propose a modular network dubbed MPGNet trained using the generative adversarial network (GAN) framework for simulating UDC imaging. Specifically, we note that the UDC imaging degradation process contains brightness attenuation, blurring, and noise corruption. Thus we model each degradation with a characteristic-related modular network, and all modular networks are cascaded to form the generator. Together with a pixel-wise discriminator and supervised loss, we can train the generator to simulate the UDC imaging degradation process. Furthermore, we present a Transformer-style network named DWFormer for UDC image restoration. For practical purposes, we use depth-wise convolution instead of the multi-head self-attention to aggregate local spatial information. Moreover, we propose a novel channel attention module to aggregate global information, which is critical for brightness recovery. We conduct evaluations on the UDC benchmark, and our method surpasses the previous state-of-the-art models by 1.23 dB on the P-OLED track and 0.71 dB on the T-OLED track, respectively.

摘要:播放摄像头(UDC)为全屏智能手机提供了优雅的解决方案。但是,由于传感器位于显示屏下,UDC捕获的图像遭受了严重的降解。尽管可以通过图像恢复网络解决此问题,但这些网络需要大规模的图像对进行培训。为此,我们提出了一个模块化网络,称为MPGNET,该网络使用生成对抗网络(GAN)框架来模拟UDC成像。具体而言,我们注意到UDC成像降解过程包含亮度衰减,模糊和噪声损坏。因此,我们将每个降解与特征相关的模块化网络建模,并将所有模块化网络级联成型以形成生成器。加上像素的歧视器和受监督的损失,我们可以训练发电机以模拟UDC成像降解过程。此外,我们提出了一个用于UDC图像恢复的Dwformer的变压器式网络。出于实际目的,我们使用深度卷积而不是多头自我注意力来汇总本地空间信息。此外,我们提出了一个新型的渠道注意模块来汇总全局信息,这对于亮度恢复至关重要。我们对UDC基准进行了评估,我们的方法在P-Oled轨道上超过了先前的最新模型和T-Oled轨道上的0.71 dB。

CV-62-标题 Learning to screen Glaucoma like the ophthalmologists

链接: https://arxiv.org/abs/2209.11431
作者: Junde Wu, Huihui Fang, Fei Li, Huazhu Fu, Yanwu Xu


Abstract: GAMMA Challenge is organized to encourage the AI models to screen the glaucoma from a combination of 2D fundus image and 3D optical coherence tomography volume, like the ophthalmologists.


CV-63-标题 Automated detection of Alzheimer disease using MRI images and deep neural networks- A review

链接: https://arxiv.org/abs/2209.11282
作者: Narotam Singh, Patteshwari.D, Neha Soni, Amita Kapoor
备注: 22 Pages, 5 Figures, 7 Tables


Abstract: Early detection of Alzheimer disease is crucial for deploying interventions and slowing the disease progression. A lot of machine learning and deep learning algorithms have been explored in the past decade with the aim of building an automated detection for Alzheimer. Advancements in data augmentation techniques and advanced deep learning architectures have opened up new frontiers in this field, and research is moving at a rapid speed. Hence, the purpose of this survey is to provide an overview of recent research on deep learning models for Alzheimer disease diagnosis. In addition to categorizing the numerous data sources, neural network architectures, and commonly used assessment measures, we also classify implementation and reproducibility. Our objective is to assist interested researchers in keeping up with the newest developments and in reproducing earlier investigations as benchmarks. In addition, we also indicate future research directions for this topic.


CV-64-标题 Hierarchical Graph Convolutional Network Built by Multiscale Atlases for Brain Disorder Diagnosis Using Functional Connectivity

链接: https://arxiv.org/abs/2209.11232
作者: Mianxin Liu, Han Zhang, Feng Shi, Dinggang Shen


Abstract: Functional connectivity network (FCN) data from functional magnetic resonance imaging (fMRI) is increasingly used for the diagnoses of brain disorders. However, state-of-the-art studies used to build the FCN using a single brain parcellation atlas at a certain spatial scale, which largely neglected functional interactions across different spatial scales in hierarchical manners. In this study, we propose a novel framework to perform multiscale FCN analysis for brain disorder diagnosis. We first use a set of well-defined multiscale atlases to compute multiscale FCNs. Then, we utilize biologically meaningful brain hierarchical relationships among the regions in multiscale atlases to perform nodal pooling across multiple spatial scales, namely “Atlas-guided Pooling”. Accordingly, we propose a Multiscale-Atlases-based Hierarchical Graph Convolutional Network (MAHGCN), built on the stacked layers of graph convolution and the atlas-guided pooling, for a comprehensive extraction of diagnostic information from multiscale FCNs. Experiments on neuroimaging data from 1792 subjects demonstrate the effectiveness of our proposed method in the diagnoses of Alzheimer’s disease (AD), the prodromal stage of AD (i.e., mild cognitive impairment [MCI]), as well as autism spectrum disorder (ASD), with accuracy of 88.9%, 78.6%, and 72.7% respectively. All results show significant advantages of our proposed method over other competing methods. This study not only demonstrates the feasibility of brain disorder diagnosis using resting-state fMRI empowered by deep learning, but also highlights that the functional interactions in the multiscale brain hierarchy are worth being explored and integrated into deep learning network architectures for better understanding the neuropathology of brain disorders.

摘要:功能磁共振成像(FMRI)的功能连接网络(FCN)数据越来越多地用于诊断脑疾病。然而,最新的研究用来使用单个脑部分析地图集以一定的空间尺度构建FCN,该空间尺度很大程度上忽略了层次范围内不同空间尺度的功能相互作用。在这项研究中,我们提出了一个新型框架,以对脑部疾病诊断进行多尺度FCN分析。我们首先使用一组定义明确的多尺地图像来计算多尺度FCN。然后,我们利用多尺度地图集中各个区域之间具有生物学意义的大脑分层关系,以跨多个空间尺度进行淋巴结池,即“ Atlas指导的池”。因此,我们提出了一个基于多尺度的层次图形卷积网络(MAHGCN),该网络(MAHGCN)建立在图形卷积和ATLAS引导的池上,以全面地从多尺度FCN中详细提取诊断信息。关于1792名受试者的神经影像数据的实验证明了我们提出的方法在诊断阿尔茨海默氏病(AD),AD的前驱阶段(即轻度认知障碍[MCI])以及自闭症谱系障碍(ASD),,AD的前瞻性阶段(即,轻度认知障碍[MCI]),,精度分别为88.9%,78.6%和72.7%。所有结果都显示出我们提出的方法比其他竞争方法具有显着优势。这项研究不仅证明了使用深度学习增强的静止状态fMRI诊断的可行性,而且还强调,值得探索多尺度脑层次结构中的功能相互作用,并将其整合到深度学习网络体系结构中,以更好地理解有关的神经病理学。脑疾病。

CV-65-标题 A Trio-Method for Retinal Vessel Segmentation using Image Processing

链接: https://arxiv.org/abs/2209.11230
作者: Mahendra Kumar Gourisaria, Vinayak Singh, Manoj Sahni
备注: Accepted at 26th UK Conference on Medical Image Understanding and Analysis (MIUA-2022) (Abstract short paper)


Abstract: Inner Retinal neurons are a most essential part of the retina and they are supplied with blood via retinal vessels. This paper primarily focuses on the segmentation of retinal vessels using a triple preprocessing approach. DRIVE database was taken into consideration and preprocessed by Gabor Filtering, Gaussian Blur, and Edge Detection by Sobel and Pruning. Segmentation was driven out by 2 proposed U-Net architectures. Both the architectures were compared in terms of all the standard performance metrics. Preprocessing generated varied interesting results which impacted the results shown by the UNet architectures for segmentation. This real-time deployment can help in the efficient pre-processing of images with better segmentation and detection.



AI-0-标题 Evaluating Agent Interactions Through Episodic Knowledge Graphs

链接: https://arxiv.org/abs/2209.11746
作者: Selene Báez Santamaría, Piek Vossen, Thomas Baier


Abstract: We present a new method based on episodic Knowledge Graphs (eKGs) for evaluating (multimodal) conversational agents in open domains. This graph is generated by interpreting raw signals during conversation and is able to capture the accumulation of knowledge over time. We apply structural and semantic analysis of the resulting graphs and translate the properties into qualitative measures. We compare these measures with existing automatic and manual evaluation metrics commonly used for conversational agents. Our results show that our Knowledge-Graph-based evaluation provides more qualitative insights into interaction and the agent’s behavior.


AI-1-标题 The "Beatrix Resurrections Robust Backdoor Detection via Gram Matrices

链接: https://arxiv.org/abs/2209.11715
作者: Wanlun Ma, Derui Wang, Ruoxi Sun, Minhui Xue, Sheng Wen, Yang Xiang
备注: 19 pages, 23 figures. Code availability: this https URL


Abstract: Deep Neural Networks (DNNs) are susceptible to backdoor attacks during training. The model corrupted in this way functions normally, but when triggered by certain patterns in the input, produces a predefined target label. Existing defenses usually rely on the assumption of the universal backdoor setting in which poisoned samples share the same uniform trigger. However, recent advanced backdoor attacks show that this assumption is no longer valid in dynamic backdoors where the triggers vary from input to input, thereby defeating the existing defenses. In this work, we propose a novel technique, Beatrix (backdoor detection via Gram matrix). Beatrix utilizes Gram matrix to capture not only the feature correlations but also the appropriately high-order information of the representations. By learning class-conditional statistics from activation patterns of normal samples, Beatrix can identify poisoned samples by capturing the anomalies in activation patterns. To further improve the performance in identifying target labels, Beatrix leverages kernel-based testing without making any prior assumptions on representation distribution. We demonstrate the effectiveness of our method through extensive evaluation and comparison with state-of-the-art defensive techniques. The experimental results show that our approach achieves an F1 score of 91.1% in detecting dynamic backdoors, while the state of the art can only reach 36.9%.

摘要:深度神经网络(DNNS)在训练过程中容易受到后门攻击的影响。该模型以这种方式损坏正常起作用,但是当输入中的某些模式触发时,会产生预定义的目标标签。现有防御通常依赖于通用后门设置的假设,其中有毒样品共享相同的均匀扳机。但是,最近的高级后门攻击表明,这种假设在动态后门中不再有效,在动态后门中,触发者因输入而异,从而击败了现有的防御。在这项工作中,我们提出了一种新颖的技术BEATRIX(通过革兰氏矩阵检测)。 BEATRIX利用革兰氏矩阵不仅捕获特征相关性,还可以捕获表示形式的适当高阶信息。通过从正常样本的激活模式中学习类条件统计,BEATRIX可以通过捕获激活模式中的异常来识别中毒样品。为了进一步提高识别目标标签的性能,BEATRIX利用基于内核的测试,而无需对表示分布进行任何先前的假设。我们通过与最先进的防御技术进行了广泛的评估和比较来证明我们的方法的有效性。实验结果表明,我们的方法在检测动态后门时达到了91.1%的F1得分,而最新技术只能达到36.9%。

AI-2-标题 Rethinking Missing Data Aleatoric Uncertainty-Aware Recommendation

链接: https://arxiv.org/abs/2209.11679
作者: Chenxu Wang, Fuli Feng, Yang Zhang, Qifan Wang, Xunhan Hu, Xiangnan He


Abstract: Historical interactions are the default choice for recommender model training, which typically exhibit high sparsity, i.e., most user-item pairs are unobserved missing data. A standard choice is treating the missing data as negative training samples and estimating interaction likelihood between user-item pairs along with the observed interactions. In this way, some potential interactions are inevitably mislabeled during training, which will hurt the model fidelity, hindering the model to recall the mislabeled items, especially the long-tail ones. In this work, we investigate the mislabeling issue from a new perspective of aleatoric uncertainty, which describes the inherent randomness of missing data. The randomness pushes us to go beyond merely the interaction likelihood and embrace aleatoric uncertainty modeling. Towards this end, we propose a new Aleatoric Uncertainty-aware Recommendation (AUR) framework that consists of a new uncertainty estimator along with a normal recommender model. According to the theory of aleatoric uncertainty, we derive a new recommendation objective to learn the estimator. As the chance of mislabeling reflects the potential of a pair, AUR makes recommendations according to the uncertainty, which is demonstrated to improve the recommendation performance of less popular items without sacrificing the overall performance. We instantiate AUR on three representative recommender models: Matrix Factorization (MF), LightGCN, and VAE from mainstream model architectures. Extensive results on two real-world datasets validate the effectiveness of AUR w.r.t. better recommendation results, especially on long-tail items.

摘要:历史互动是推荐模型训练的默认选择,通常表现出较高的稀疏性,即大多数用户项目对都是未观察到的缺失数据。标准选择是将缺失的数据视为负训练样本,并估计用户项目对之间的相互作用以及观察到的相互作用。通过这种方式,在训练过程中不可避免地会误标记一些潜在的互动,这将损害模型的保真度,阻碍模型回忆起错误标签的项目,尤其是长尾尾。在这项工作中,我们从新的不确定性的新角度研究了标签的问题,该问题描述了缺失数据的固有随机性。随机性促使我们超越了相互作用的可能性,并接受了不确定性建模。为此,我们提出了一个新的不确定性不确定性建议(AUR)框架,该框架由新的不确定性估计器以及正常的推荐模型组成。根据核心不确定性理论,我们得出了一个新的建议目标来学习估计量。由于错误标签的机会反映了一对的潜力,因此AUR根据不确定性提出了建议,该建议被证明是为了改善较不受欢迎的项目的建议性能而不会牺牲整体性能。我们在三个代表性推荐模型上实例化AUR:来自主流模型体系结构的矩阵分解(MF),LightGCN和VAE。两个现实世界数据集的广泛结果验证了AUR W.R.T.的有效性。更好的建议结果,尤其是在长尾项目上。

AI-3-标题 The SpeakIn Speaker Verification System for Far-Field Speaker Verification Challenge 2022

链接: https://arxiv.org/abs/2209.11625
作者: Yu Zheng, Jinghan Peng, Yihao Chen, Yajun Zhang, Jialong Wang, Min Liu, Minqiang Xu
备注: 5 pages. arXiv admin note: text overlap with arXiv:2209.10846


Abstract: This paper describes speaker verification (SV) systems submitted by the SpeakIn team to the Task 1 and Task 2 of the Far-Field Speaker Verification Challenge 2022 (FFSVC2022). SV tasks of the challenge focus on the problem of fully supervised far-field speaker verification (Task 1) and semi-supervised far-field speaker verification (Task 2). In Task 1, we used the VoxCeleb and FFSVC2020 datasets as train datasets. And for Task 2, we only used the VoxCeleb dataset as train set. The ResNet-based and RepVGG-based architectures were developed for this challenge. Global statistic pooling structure and MQMHA pooling structure were used to aggregate the frame-level features across time to obtain utterance-level representation. We adopted AM-Softmax and AAM-Softmax to classify the resulting embeddings. We innovatively propose a staged transfer learning method. In the pre-training stage we reserve the speaker weights, and there are no positive samples to train them in this stage. Then we fine-tune these weights with both positive and negative samples in the second stage. Compared with the traditional transfer learning strategy, this strategy can better improve the model performance. The Sub-Mean and AS-Norm backend methods were used to solve the problem of domain mismatch. In the fusion stage, three models were fused in Task1 and two models were fused in Task2. On the FFSVC2022 leaderboard, the EER of our submission is 3.0049% and the corresponding minDCF is 0.2938 in Task1. In Task2, EER and minDCF are 6.2060% and 0.5232 respectively. Our approach leads to excellent performance and ranks 1st in both challenge tasks.


AI-4-标题 involve-MI Informative Planning with High-Dimensional Non-Parametric Beliefs

链接: https://arxiv.org/abs/2209.11591
作者: Gilad Rotman, Vadim Indelman


Abstract: One of the most complex tasks of decision making and planning is to gather information. This task becomes even more complex when the state is high-dimensional and its belief cannot be expressed with a parametric distribution. Although the state is high-dimensional, in many problems only a small fraction of it might be involved in transitioning the state and generating observations. We exploit this fact to calculate an information-theoretic expected reward, mutual information (MI), over a much lower-dimensional subset of the state, to improve efficiency and without sacrificing accuracy. A similar approach was used in previous works, yet specifically for Gaussian distributions, and we here extend it for general distributions. Moreover, we apply the dimensionality reduction for cases in which the new states are augmented to the previous, yet again without sacrificing accuracy. We then continue by developing an estimator for the MI which works in a Sequential Monte Carlo (SMC) manner, and avoids the reconstruction of future belief’s surfaces. Finally, we show how this work is applied to the informative planning optimization problem. This work is then evaluated in a simulation of an active SLAM problem, where the improvement in both accuracy and timing is demonstrated.