友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。


概览 (2023-05-25)


  • 279篇自然语言处理(NLP: cs.CL)
  • 122篇计算机视觉(CV: cs.CV)
  • 137篇机器学习(ML: cs.LG)
  • 18篇人工智能(AI: cs.AI)
  • 8篇信息检索(IR: cs.IR)
  • 其它主题127篇


NLP-0-标题: Towards Revealing the Mystery behind Chain of Thought: a Theoretical Perspective

链接: https://arxiv.org/abs/2305.15408
作者: Guhao Feng, Yuntian Gu, Bohang Zhang, Haotian Ye, Di He, Liwei Wang
备注: 33 pages


Abstract: Recent studies have discovered that Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs), particularly when dealing with complex tasks involving mathematics or reasoning. Despite the enormous empirical success, the underlying mechanisms behind CoT and how it unlocks the potential of LLMs remain elusive. In this paper, we take a first step towards theoretically answering these questions. Specifically, we examine the capacity of LLMs with CoT in solving fundamental mathematical and decision-making problems. We start by giving an impossibility result showing that any bounded-depth Transformer cannot directly output correct answers for basic arithmetic/equation tasks unless the model size grows super-polynomially with respect to the input length. In contrast, we then prove by construction that autoregressive Transformers of a constant size suffice to solve both tasks by generating CoT derivations using a commonly-used math language format. Moreover, we show LLMs with CoT are capable of solving a general class of decision-making problems known as Dynamic Programming, thus justifying its power in tackling complex real-world tasks. Finally, extensive experiments on four tasks show that, while Transformers always fail to predict the answers directly, they can consistently learn to generate correct solutions step-by-step given sufficient CoT demonstrations.

NLP-1-标题: Unit-based Speech-to-Speech Translation Without Parallel Data

链接: https://arxiv.org/abs/2305.15405
作者: Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi
备注: 17 pages, 3 figures


Abstract: We propose an unsupervised speech-to-speech translation (S2ST) system that does not rely on parallel data between the source and target languages. Our approach maps source and target language speech signals into automatically discovered, discrete units and reformulates the problem as unsupervised unit-to-unit machine translation. We develop a three-step training procedure that involves (a) pre-training an unit-based encoder-decoder language model with a denoising objective (b) training it with word-by-word translated utterance pairs created by aligning monolingual text embedding spaces and © running unsupervised backtranslation bootstrapping off of the initial translation model. Our approach avoids mapping the speech signal into text and uses speech-to-unit and unit-to-speech models instead of automatic speech recognition and text to speech models. We evaluate our model on synthetic-speaker Europarl-ST English-German and German-English evaluation sets, finding that unit-based translation is feasible under this constrained scenario, achieving 9.29 ASR-BLEU in German to English and 8.07 in English to German.

NLP-2-标题: AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation ACL2023

链接: https://arxiv.org/abs/2305.15403
作者: Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao
备注: Accepted to ACL 2023


Abstract: Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual speech-to-speech (AV-S2ST) translation model without relying on intermediate text. AV-TranSpeech complements the audio stream with visual information to promote system robustness and opens up a host of practical applications: dictation or dubbing archival films. To mitigate the data scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised pre-training with unlabeled audio-visual data to learn contextual representation, and 2) introduce cross-modal distillation with S2ST models trained on the audio-only corpus to further reduce the requirements of visual data. Experimental results on two language pairs demonstrate that AV-TranSpeech outperforms audio-only models under all settings regardless of the type of noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation yields an improvement of 7.6 BLEU on average compared with baselines. Audio samples are available at this https URL

NLP-3-标题: Comparing Humans and Models on a Similar Scale: Towards Cognitive Gender Bias Evaluation in Coreference Resolution

链接: https://arxiv.org/abs/2305.15389
作者: Gili Lior, Gabriel Stanovsky


Abstract: Spurious correlations were found to be an important factor explaining model performance in various NLP tasks (e.g., gender or racial artifacts), often considered to be ‘‘shortcuts’’ to the actual task. However, humans tend to similarly make quick (and sometimes wrong) predictions based on societal and cognitive presuppositions. In this work we address the question: can we quantify the extent to which model biases reflect human behaviour? Answering this question will help shed light on model performance and provide meaningful comparisons against humans. We approach this question through the lens of the dual-process theory for human decision-making. This theory differentiates between an automatic unconscious (and sometimes biased) ‘‘fast system’’ and a ‘‘slow system’’, which when triggered may revisit earlier automatic reactions. We make several observations from two crowdsourcing experiments of gender bias in coreference resolution, using self-paced reading to study the ‘‘fast’’ system, and question answering to study the ‘‘slow’’ system under a constrained time setting. On real-world data humans make \sim 3% more gender-biased decisions compared to models, while on synthetic data models are \sim 12% more biased.

NLP-4-标题: Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering ACL2023

链接: https://arxiv.org/abs/2305.15387
作者: Avi Caciularu, Matthew E. Peters, Jacob Goldberger, Ido Dagan, Arman Cohan
备注: Accepted at ACL 2023; camera-ready version


Abstract: The integration of multi-document pre-training objectives into language models has resulted in remarkable improvements in multi-document downstream tasks. In this work, we propose extending this idea by pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. To that end, given a set (or cluster) of topically-related documents, we systematically generate semantically-oriented questions from a salient sentence in one document and challenge the model, during pre-training, to answer these questions while “peeking” into other topically-related documents. In a similar manner, the model is also challenged to recover the sentence from which the question was generated, again while leveraging cross-document information. This novel multi-document QA formulation directs the model to better recover cross-text informational relations, and introduces a natural augmentation that artificially increases the pre-training data. Further, unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation (e.g., QA) and long text generation (e.g., summarization). Following this scheme, we pre-train our model – termed QAmden – and evaluate its performance across several multi-document tasks, including multi-document QA, summarization, and query-focused summarization, yielding improvements of up to 7%, and significantly outperforms zero-shot GPT-3.5 and GPT-4.

NLP-5-标题: Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR INTERSPEECH2023

链接: https://arxiv.org/abs/2305.15386
作者: Kaushal Santosh Bhogale, Sai Sundaresan, Abhigyan Raman, Tahir Javed, Mitesh M. Khapra, Pratyush Kumar
备注: Accepted in INTERSPEECH 2023


Abstract: Improving ASR systems is necessary to make new LLM-based use-cases accessible to people across the globe. In this paper, we focus on Indian languages, and make the case that diverse benchmarks are required to evaluate and improve ASR systems for Indian languages. To address this, we collate Vistaar as a set of 59 benchmarks across various language and domain combinations, on which we evaluate 3 publicly available ASR systems and 2 commercial systems. We also train IndicWhisper models by fine-tuning the Whisper models on publicly available training datasets across 12 Indian languages totalling to 10.7K hours. We show that IndicWhisper significantly improves on considered ASR systems on the Vistaar benchmark. Indeed, IndicWhisper has the lowest WER in 39 out of the 59 benchmarks, with an average reduction of 4.1 WER. We open-source all datasets, code and models.

NLP-6-标题: Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages

链接: https://arxiv.org/abs/2305.15380
作者: Khalid Alnajjar, Mika Hämäläinen, Jack Rueter
备注: Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)


Abstract: In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test our model, we annotated a small sentiment analysis corpus for the 4 endangered languages and Finnish. Our method reached at least 56% accuracy for each endangered language. The models and the sentiment corpus will be released together with this paper. Our research shows that state-of-the-art neural models can be used with endangered languages with the only requirement being a dictionary between the endangered language and a majority language.

NLP-7-标题: Uncovering and Quantifying Social Biases in Code Generation

链接: https://arxiv.org/abs/2305.15377
作者: Yan Liu, Xiaokang Chen, Yan Gao, Zhe Su, Fengji Zhang, Daoguang Zan, Jian-Guang Lou, Pin-Yu Chen, Tsung-Yi Ho


Abstract: With the popularity of automatic code generation tools, such as Copilot, the study of the potential hazards of these tools is gaining importance. In this work, we explore the social bias problem in pre-trained code generation models. We propose a new paradigm to construct code prompts and successfully uncover social biases in code generation models. To quantify the severity of social biases in generated code, we develop a dataset along with three metrics to evaluate the overall social bias and fine-grained unfairness across different demographics. Experimental results on three pre-trained code generation models (Codex, InCoder, and CodeGen) with varying sizes, reveal severe social biases. Moreover, we conduct analysis to provide useful insights for further choice of code generation models with low social bias. (This work contains examples that potentially implicate stereotypes, associations, and other harms that could be offensive to individuals in certain social groups.)

NLP-8-标题: ASPER: Answer Set Programming Enhanced Neural Network Models for Joint Entity -Relation Extraction

链接: https://arxiv.org/abs/2305.15374
作者: Trung Hoang Le, Huiping Cao, Tran Cao Son


Abstract: A plethora of approaches have been proposed for joint entity-relation (ER) extraction. Most of these methods largely depend on a large amount of manually annotated training data. However, manual data annotation is time consuming, labor intensive, and error prone. Human beings learn using both data (through induction) and knowledge (through deduction). Answer Set Programming (ASP) has been a widely utilized approach for knowledge representation and reasoning that is elaboration tolerant and adept at reasoning with incomplete information. This paper proposes a new approach, ASP-enhanced Entity-Relation extraction (ASPER), to jointly recognize entities and relations by learning from both data and domain knowledge. In particular, ASPER takes advantage of the factual knowledge (represented as facts in ASP) and derived knowledge (represented as rules in ASP) in the learning process of neural network models. We have conducted experiments on two real datasets and compare our method with three baselines. The results show that our ASPER model consistently outperforms the baselines.

NLP-9-标题: Context-Aware Transformer Pre-Training for Answer Sentence Selection ACL2023

链接: https://arxiv.org/abs/2305.15358
作者: Luca Di Liello, Siddhant Garg, Alessandro Moschitti
备注: Accepted at ACL 2023


Abstract: Answer Sentence Selection (AS2) is a core component for building an accurate Question Answering pipeline. AS2 models rank a set of candidate sentences based on how likely they answer a given question. The state of the art in AS2 exploits pre-trained transformers by transferring them on large annotated datasets, while using local contextual information around the candidate sentence. In this paper, we propose three pre-training objectives designed to mimic the downstream fine-tuning task of contextual AS2. This allows for specializing LMs when fine-tuning for contextual AS2. Our experiments on three public and two large-scale industrial datasets show that our pre-training approaches (applied to RoBERTa and ELECTRA) can improve baseline contextual AS2 accuracy by up to 8% on some datasets.

NLP-10-标题: Learning Answer Generation using Supervision from Automatic Question Answering Evaluators ACL2023

链接: https://arxiv.org/abs/2305.15344
作者: Matteo Gabburo, Siddhant Garg, Rik Koncel-Kedziorski, Alessandro Moschitti
备注: Accepted at ACL 2023


Abstract: Recent studies show that sentence-level extractive QA, i.e., based on Answer Sentence Selection (AS2), is outperformed by Generation-based QA (GenQA) models, which generate answers using the top-k answer sentences ranked by AS2 models (a la retrieval-augmented generation style). In this paper, we propose a novel training paradigm for GenQA using supervision from automatic QA evaluation models (GAVA). Specifically, we propose three strategies to transfer knowledge from these QA evaluation models to a GenQA model: (i) augmenting training data with answers generated by the GenQA model and labelled by GAVA (either statically, before training, or (ii) dynamically, at every training epoch); and (iii) using the GAVA score for weighting the generator loss during the learning of the GenQA model. We evaluate our proposed methods on two academic and one industrial dataset, obtaining a significant improvement in answering accuracy over the previous state of the art.

NLP-11-标题: Measuring and Mitigating Constraint Violations of In-Context Learning for Utterance-to-API Semantic Parsing

链接: https://arxiv.org/abs/2305.15338
作者: Shufan Wang, Sebastien Jean, Sailik Sengupta, James Gung, Nikolaos Pappas, Yi Zhang


Abstract: In executable task-oriented semantic parsing, the system aims to translate users’ utterances in natural language to machine-interpretable programs (API calls) that can be executed according to pre-defined API specifications. With the popularity of Large Language Models (LLMs), in-context learning offers a strong baseline for such scenarios, especially in data-limited regimes. However, LLMs are known to hallucinate and therefore pose a formidable challenge in constraining generated content. Thus, it remains uncertain if LLMs can effectively perform task-oriented utterance-to-API generation where respecting API’s structural and task-specific constraints is crucial. In this work, we seek to measure, analyze and mitigate such constraints violations. First, we identify the categories of various constraints in obtaining API-semantics from task-oriented utterances, and define fine-grained metrics that complement traditional ones. Second, we leverage these metrics to conduct a detailed error analysis of constraints violations seen in state-of-the-art LLMs, which motivates us to investigate two mitigation strategies: Semantic-Retrieval of Demonstrations (SRD) and API-aware Constrained Decoding (API-CD). Our experiments show that these strategies are effective at reducing constraints violations and improving the quality of the generated API calls, but require careful consideration given their implementation complexity and latency.

NLP-12-标题: Gorilla: Large Language Model Connected with Massive APIs

链接: https://arxiv.org/abs/2305.15334
作者: Shishir G. Patil, Tianjun Zhang, Xin Wang, Joseph E. Gonzalez


Abstract: Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today’s state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model’s ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla’s code, model, data, and demo are available at this https URL

NLP-13-标题: Visual Programming for Text-to-Image Generation and Evaluation

链接: https://arxiv.org/abs/2305.15328
作者: Jaemin Cho, Abhay Zala, Mohit Bansal
备注: 18 pages; Project website: this https URL


Abstract: As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation. We hope our work encourages future progress on interpretable/explainable generation and evaluation for T2I models. Website: this https URL

NLP-14-标题: ChatGPT : Vision and Challenges

链接: https://arxiv.org/abs/2305.15323
作者: Sukhpal Singh Gill, Rupinder Kaur


Abstract: Artificial intelligence (AI) and machine learning have changed the nature of scientific inquiry in recent years. Of these, the development of virtual assistants has accelerated greatly in the past few years, with ChatGPT becoming a prominent AI language model. In this study, we examine the foundations, vision, research challenges of ChatGPT. This article investigates into the background and development of the technology behind it, as well as its popular applications. Moreover, we discuss the advantages of bringing everything together through ChatGPT and Internet of Things (IoT). Further, we speculate on the future of ChatGPT by considering various possibilities for study and development, such as energy-efficiency, cybersecurity, enhancing its applicability to additional technologies (Robotics and Computer Vision), strengthening human-AI communications, and bridging the technological gap. Finally, we discuss the important ethics and current trends of ChatGPT.

NLP-15-标题: Towards Foundation Models for Relational Databases [Vision Paper] NEURIPS2022

链接: https://arxiv.org/abs/2305.15321
作者: Liane Vogel, Benjamin Hilprecht, Carsten Binnig
备注: Accepted at the Tabular Representation Learning Workshop at NeurIPS 2022 (TRL@NeurIPS2022)


Abstract: Tabular representation learning has recently gained a lot of attention. However, existing approaches only learn a representation from a single table, and thus ignore the potential to learn from the full structure of relational databases, including neighboring tables that can contain important information for a contextualized representation. Moreover, current models are significantly limited in scale, which prevents that they learn from large databases. In this paper, we thus introduce our vision of relational representation learning, that can not only learn from the full relational structure, but also can scale to larger database sizes that are commonly found in real-world. Moreover, we also discuss opportunities and challenges we see along the way to enable this vision and present initial very promising results. Overall, we argue that this direction can lead to foundation models for relational databases that are today only available for text and images.

NLP-16-标题: Science in the Era of ChatGPT Large Language Models and AI: Challenges for Research Ethics Review and How to Respond

链接: https://arxiv.org/abs/2305.15299
作者: Evangelos Pournaras


Abstract: Large language models of artificial intelligence (AI) such as ChatGPT find remarkable but controversial applicability in science and research. This paper reviews epistemological challenges, ethical and integrity risks in science conduct. This is with the aim to lay new timely foundations for a high-quality research ethics review in the era of AI. The role of AI language models as a research instrument and subject is scrutinized along with ethical implications for scientists, participants and reviewers. Ten recommendations shape a response for a more responsible research conduct with AI language models.

NLP-17-标题: Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

链接: https://arxiv.org/abs/2305.15294
作者: Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, Weizhu Chen
备注: Preprint


Abstract: Large language models are powerful text processors and reasoners, but are still subject to limitations including outdated knowledge and hallucinations, which necessitates connecting them to the world. Retrieval-augmented large language models have raised extensive attention for grounding model generation on external knowledge. However, retrievers struggle to capture relevance, especially for queries with complex information needs. Recent work has proposed to improve relevance modeling by having large language models actively involved in retrieval, i.e., to improve retrieval with generation. In this paper, we show that strong performance can be achieved by a method we call Iter-RetGen, which synergizes retrieval and generation in an iterative manner. A model output shows what might be needed to finish a task, and thus provides an informative context for retrieving more relevant knowledge which in turn helps generate a better output in the next iteration. Compared with recent work which interleaves retrieval with generation when producing an output, Iter-RetGen processes all retrieved knowledge as a whole and largely preserves the flexibility in generation without structural constraints. We evaluate Iter-RetGen on multi-hop question answering, fact verification, and commonsense reasoning, and show that it can flexibly leverage parametric knowledge and non-parametric knowledge, and is superior to or competitive with state-of-the-art retrieval-augmented baselines while causing fewer overheads of retrieval and generation. We can further improve performance via generation-augmented retrieval adaptation.

NLP-18-标题: A Simple and Effective Framework for Strict Zero-Shot Hierarchical Classification

链接: https://arxiv.org/abs/2305.15282
作者: Rohan Bhambhoria, Lei Chen, Xiaodan Zhu


Abstract: In recent years, large language models (LLMs) have achieved strong performance on benchmark tasks, especially in zero or few-shot settings. However, these benchmarks often do not adequately address the challenges posed in the real-world, such as that of hierarchical classification. In order to address this challenge, we propose refactoring conventional tasks on hierarchical datasets into a more indicative long-tail prediction task. We observe LLMs are more prone to failure in these cases. To address these limitations, we propose the use of entailment-contradiction prediction in conjunction with LLMs, which allows for strong performance in a strict zero-shot setting. Importantly, our method does not require any parameter updates, a resource-intensive process and achieves strong performance across multiple datasets.

NLP-19-标题: Self-Evolution Learning for Discriminative Language Model Pretrain ing ACL2023

链接: https://arxiv.org/abs/2305.15275
作者: Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao
备注: Accepted to Findings of ACL2023


Abstract: Masked language modeling, widely used in discriminative language model (e.g., BERT) pretraining, commonly adopts a random masking strategy. However, random masking does not consider the importance of the different words in the sentence meaning, where some of them are more worthy to be predicted. Therefore, various masking strategies (e.g., entity-level masking) are proposed, but most of them require expensive prior knowledge and generally train from scratch without reusing existing model weights. In this paper, we present Self-Evolution learning (SE), a simple and effective token masking and learning method to fully and wisely exploit the knowledge from data. SE focuses on learning the informative yet under-explored tokens and adaptively regularizes the training by introducing a novel Token-specific Label Smoothing approach. Experiments on 10 tasks show that our SE brings consistent and significant improvements (+1.43~2.12 average scores) upon different PLMs. In-depth analyses demonstrate that SE improves linguistic knowledge learning and generalization.

NLP-20-标题: Revisiting Token Dropping Strategy in Efficient BERT Pretrain ing ACL2023

链接: https://arxiv.org/abs/2305.15273
作者: Qihuang Zhong, Liang Ding, Juhua Liu, Xuebo Liu, Min Zhang, Bo Du, Dacheng Tao
备注: Accepted to ACL2023 Main Conference


Abstract: Token dropping is a recently-proposed strategy to speed up the pretraining of masked language models, such as BERT, by skipping the computation of a subset of the input tokens at several middle layers. It can effectively reduce the training time without degrading much performance on downstream tasks. However, we empirically find that token dropping is prone to a semantic loss problem and falls short in handling semantic-intense tasks. Motivated by this, we propose a simple yet effective semantic-consistent learning method (ScTD) to improve the token dropping. ScTD aims to encourage the model to learn how to preserve the semantic information in the representation space. Extensive experiments on 12 tasks show that, with the help of our ScTD, token dropping can achieve consistent and significant performance gains across all task types and model sizes. More encouragingly, ScTD saves up to 57% of pretraining time and brings up to +1.56% average improvement over the vanilla token dropping.

NLP-21-标题: Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

链接: https://arxiv.org/abs/2305.15269
作者: Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, Najoung Kim, He He


Abstract: Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to longer and compositional proofs. However, they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.

NLP-22-标题: EvEval: A Comprehensive Evaluation of Event Semantics for Large Language Models

链接: https://arxiv.org/abs/2305.15268
作者: Zhengwei Tao, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Yanlin Feng, Jia Li, Wenpeng Hu


Abstract: Events serve as fundamental units of occurrence within various contexts. The processing of event semantics in textual information forms the basis of numerous natural language processing (NLP) applications. Recent studies have begun leveraging large language models (LLMs) to address event semantic processing. However, the extent that LLMs can effectively tackle these challenges remains uncertain. Furthermore, the lack of a comprehensive evaluation framework for event semantic processing poses a significant challenge in evaluating these capabilities. In this paper, we propose an overarching framework for event semantic processing, encompassing understanding, reasoning, and prediction, along with their fine-grained aspects. To comprehensively evaluate the event semantic processing abilities of models, we introduce a novel benchmark called EVEVAL. We collect 8 datasets that cover all aspects of event semantic processing. Extensive experiments are conducted on EVEVAL, leading to several noteworthy findings based on the obtained results.

NLP-23-标题: Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model

链接: https://arxiv.org/abs/2305.15265
作者: Zirui Liu, Guanchu Wang, Shaochen Zhong, Zhaozhuo Xu, Daochen Zha, Ruixiang Tang, Zhimeng Jiang, Kaixiong Zhou, Vipin Chaudhary, Shuai Xu, Xia Hu


Abstract: With the rapid growth in model size, fine-tuning the large pre-trained language model has become increasingly difficult due to its extensive memory usage. Previous works usually focus on reducing the number of trainable parameters in the network. While the model parameters do contribute to memory usage, the primary memory bottleneck during training arises from storing feature maps, also known as activations, as they are crucial for gradient calculation. Notably, neural networks are usually trained using stochastic gradient descent. We argue that in stochastic optimization, models can handle noisy gradients as long as the gradient estimator is unbiased with reasonable variance. Following this motivation, we propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance, which only requires storing the sub-sampled activations for calculating the gradient. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones. By replacing the linear operation with our approximated one in transformers, we can achieve up to 2.7 \times peak memory reduction with almost no accuracy drop and enables up to 6.4\times larger batch size. Under the same hardware, WTA-CRS enables better down-streaming task performance by applying larger models and/or faster training speed with larger batch sizes.

NLP-24-标题: Revisiting Parallel Context Windows: A Frustratingly Simple Alternative and Chain-of-Thought Deterioration

链接: https://arxiv.org/abs/2305.15262
作者: Kejuan Yang, Xiao Liu, Kaiwen Men, Aohan Zeng, Yuxiao Dong, Jie Tang


Abstract: We identify two crucial limitations in the evaluation of recent parallel-integrated method Parallel Context Windows (PCW), which extends the maximum context lengths of language models, e.g., 2048 for LLaMA, by harnessing window-wise attention and positional embedding techniques. We first show that a simple yet strong baseline, weighted sum ensemble, is missing for the in-context few-shot classification. Moreover, on more challenging Chain-of-Thought (CoT) reasoning (e.g., HotpotQA), PCW would present unexpected deterioration regarding question miscomprehension and false inference. Based on our findings, we suggest that the existing PCW design may not guarantee sufficient improvement and practicality in handling lengthy documents in real-world applications. More community efforts on enabling language models’ long context understanding ability should be paid.

NLP-25-标题: LMs with a Voice: Spoken Language Modeling beyond Speech Tokens

链接: https://arxiv.org/abs/2305.15255
作者: Eliya Nachmani, Alon Levkovitch, Julian Salazar, Chulayutsh Asawaroengchai, Soroosh Mariooryad, RJ Skerry-Ryan, Michelle Tadmor Ramanovich


Abstract: We present SPECTRON, a novel approach to adapting pre-trained language models (LMs) to perform speech continuation. By leveraging pre-trained speech encoders, our model generates both text and speech outputs with the entire system being trained end-to-end operating directly on spectrograms. Training the entire model in the spectrogram domain simplifies our speech continuation system versus existing cascade methods which use discrete speech representations. We further show our method surpasses existing spoken language models both in semantic content and speaker preservation while also benefiting from the knowledge transferred from pre-existing models. Audio samples can be found in our website this https URL

NLP-26-标题: Boosting Cross-lingual Transferability in Multilingual Models via In-Context Learning

链接: https://arxiv.org/abs/2305.15233
作者: Sunkyoung Kim, Dayeon Ki, Yireun Kim, Jinsik Lee
备注: Work In Progress


Abstract: Existing cross-lingual transfer (CLT) prompting methods are only concerned with monolingual demonstration examples in the source language. In this paper, we propose In-CLT, a novel cross-lingual transfer prompting method that leverages both source and target languages to construct the demonstration examples. We conduct comprehensive evaluations on multilingual benchmarks, focusing on question answering tasks. Experiment results show that In-CLT prompt not only improves multilingual models’ cross-lingual transferability, but also demonstrates remarkable unseen language generalization ability. In-CLT prompting, in particular, improves model performance by 10 to 20% points on average when compared to prior cross-lingual transfer approaches. We also observe the surprising performance gain on the other multilingual benchmarks, especially in reasoning tasks. Furthermore, we investigate the relationship between lexical similarity and pre-training corpora in terms of the cross-lingual transfer gap.

NLP-27-标题: SAIL: Search-Augmented Instruction Learning

链接: https://arxiv.org/abs/2305.15225
作者: Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Danny Fox, Helen Meng, James Glass


Abstract: Large language models (LLMs) have been significantly improved by instruction fine-tuning, but still lack transparency and the ability to utilize up-to-date knowledge and information. In this work, we propose search-augmented instruction learning (SAIL), which grounds the language generation and instruction following abilities on complex search results generated by in-house and external search engines. With an instruction tuning corpus, we collect search results for each training case from different search APIs and domains, and construct a new search-grounded training set containing \textit(instruction, grounding information, response) triplets. We then fine-tune the LLaMA-7B model on the constructed training set. Since the collected results contain unrelated and disputing languages, the model needs to learn to ground on trustworthy search results, filter out distracting passages, and generate the target response. The search result-denoising process entails explicit trustworthy information selection and multi-hop reasoning, since the retrieved passages might be informative but not contain the instruction-following answer. Experiments show that the fine-tuned SAIL-7B model has a strong instruction-following ability, and it performs significantly better on transparency-sensitive tasks, including open-ended question answering and fact checking.

NLP-28-标题: Neural Summarization of Electronic Health Records

链接: https://arxiv.org/abs/2305.15222
作者: Koyena Pal, Seyed Ali Bahrainian, Laura Mercurio, Carsten Eickhoff


Abstract: Hospital discharge documentation is among the most essential, yet time-consuming documents written by medical practitioners. The objective of this study was to automatically generate hospital discharge summaries using neural network summarization models. We studied various data preparation and neural network training techniques that generate discharge summaries. Using nursing notes and discharge summaries from the MIMIC-III dataset, we studied the viability of the automatic generation of various sections of a discharge summary using four state-of-the-art neural network summarization models (BART, T5, Longformer and FLAN-T5). Our experiments indicated that training environments including nursing notes as the source, and discrete sections of the discharge summary as the target output (e.g. “History of Present Illness”) improve language model efficiency and text quality. According to our findings, the fine-tuned BART model improved its ROUGE F1 score by 43.6% against its standard off-the-shelf version. We also found that fine-tuning the baseline BART model with other setups caused different degrees of improvement (up to 80% relative improvement). We also observed that a fine-tuned T5 generally achieves higher ROUGE F1 scores than other fine-tuned models and a fine-tuned FLAN-T5 achieves the highest ROUGE score overall, i.e., 45.6. For majority of the fine-tuned language models, summarizing discharge summary report sections separately outperformed the summarization the entire report quantitatively. On the other hand, fine-tuning language models that were previously instruction fine-tuned showed better performance in summarizing entire reports. This study concludes that a focused dataset designed for the automatic generation of discharge summaries by a language model can produce coherent Discharge Summary sections.

NLP-29-标题: Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model Fine-tuning ACL2023

链接: https://arxiv.org/abs/2305.15212
作者: Zhen-Ru Zhang, Chuanqi Tan, Haiyang Xu, Chengyu Wang, Jun Huang, Songfang Huang
备注: Accepted to ACL 2023 (Main conference)


Abstract: Fine-tuning large pre-trained language models on various downstream tasks with whole parameters is prohibitively expensive. Hence, Parameter-efficient fine-tuning has attracted attention that only optimizes a few task-specific parameters with the frozen pre-trained model. In this work, we focus on prefix tuning, which only optimizes continuous prefix vectors (i.e. pseudo tokens) inserted into Transformer layers. Based on the observation that the learned syntax and semantics representation varies a lot at different layers, we argue that the adaptive prefix will be further tailored to each layer than the fixed one, enabling the fine-tuning more effective and efficient. Thus, we propose Adaptive Prefix Tuning (APT) to adjust the prefix in terms of both fine-grained token level and coarse-grained layer level with a gate mechanism. Experiments on the SuperGLUE and NER datasets show the effectiveness of APT. In addition, taking the gate as a probing, we validate the efficiency and effectiveness of the variable prefix.

NLP-30-标题: Sci Review Gen: A Large-scale Dataset for Automatic Literature Review Generation ACL

链接: https://arxiv.org/abs/2305.15186
作者: Tetsu Kasanishi, Masaru Isonuma, Junichiro Mori, Ichiro Sakata
备注: ACL findings 2023 (to be appeared). arXiv admin note: text overlap with arXiv:1810.04020 by other authors


Abstract: Automatic literature review generation is one of the most challenging tasks in natural language processing. Although large language models have tackled literature review generation, the absence of large-scale datasets has been a stumbling block to the progress. We release SciReviewGen, consisting of over 10,000 literature reviews and 690,000 papers cited in the reviews. Based on the dataset, we evaluate recent transformer-based summarization models on the literature review generation task, including Fusion-in-Decoder extended for literature review generation. Human evaluation results show that some machine-generated summaries are comparable to human-written reviews, while revealing the challenges of automatic literature review generation such as hallucinations and a lack of detailed information. Our dataset and code are available at this https URL.

NLP-31-标题: Are Pre-trained Language Models Useful for Model Ensemble in Chinese Grammatical Error Correction? ACL2023

链接: https://arxiv.org/abs/2305.15183
作者: Chenming Tang, Xiuyu Wu, Yunfang Wu
备注: 7 pages, 1 figure. Accepted by ACL 2023 (main conference, short paper)


Abstract: Model ensemble has been in widespread use for Grammatical Error Correction (GEC), boosting model performance. We hypothesize that model ensemble based on the perplexity (PPL) computed by pre-trained language models (PLMs) should benefit the GEC system. To this end, we explore several ensemble strategies based on strong PLMs with four sophisticated single models. However, the performance does not improve but even gets worse after the PLM-based ensemble. This surprising result sets us doing a detailed analysis on the data and coming up with some insights on GEC. The human references of correct sentences is far from sufficient in the test data, and the gap between a correct sentence and an idiomatic one is worth our attention. Moreover, the PLM-based ensemble strategies provide an effective way to extend and improve GEC benchmark data. Our source code is available at this https URL.

NLP-32-标题: HiTIN: Hierarchy-aware Tree Isomorphism Network for Hierarchical Text Classification ACL’23

链接: https://arxiv.org/abs/2305.15182
作者: He Zhu, Chong Zhang, Junjie Huang, Junran Wu, Ke Xu
备注: Accepted by ACL’23


Abstract: Hierarchical text classification (HTC) is a challenging subtask of multi-label classification as the labels form a complex hierarchical structure. Existing dual-encoder methods in HTC achieve weak performance gains with huge memory overheads and their structure encoders heavily rely on domain knowledge. Under such observation, we tend to investigate the feasibility of a memory-friendly model with strong generalization capability that could boost the performance of HTC without prior statistics or label semantics. In this paper, we propose Hierarchy-aware Tree Isomorphism Network (HiTIN) to enhance the text representations with only syntactic information of the label hierarchy. Specifically, we convert the label hierarchy into an unweighted tree structure, termed coding tree, with the guidance of structural entropy. Then we design a structure encoder to incorporate hierarchy-aware information in the coding tree into text representations. Besides the text encoder, HiTIN only contains a few multi-layer perceptions and linear transformations, which greatly saves memory. We conduct experiments on three commonly used datasets and the results demonstrate that HiTIN could achieve better test performance and less memory consumption than state-of-the-art (SOTA) methods.

NLP-33-标题: Pre-training Multi-party Dialogue Models with Latent Discourse Inference ACL2023

链接: https://arxiv.org/abs/2305.15175
作者: Yiyang Li, Xinting Huang, Wei Bi, Hai Zhao
备注: Accepted by ACL 2023


Abstract: Multi-party dialogues are more difficult for models to understand than one-to-one two-party dialogues, since they involve multiple interlocutors, resulting in interweaving reply-to relations and information flows. To step over these obstacles, an effective way is to pre-train a model that understands the discourse structure of multi-party dialogues, namely, to whom each utterance is replying. However, due to the lack of explicitly annotated discourse labels in multi-party dialogue corpora, previous works fail to scale up the pre-training process by putting aside the unlabeled multi-party conversational data for nothing. To fully utilize the unlabeled data, we propose to treat the discourse structures as latent variables, then jointly infer them and pre-train the discourse-aware model by unsupervised latent variable inference methods. Experiments on multiple downstream tasks show that our pre-trained model outperforms strong baselines by large margins and achieves state-of-the-art (SOTA) results, justifying the effectiveness of our method. The official implementation of this paper is available at this https URL.

NLP-34-标题: Topic-Guided Self-Introduction Generation for Social Media Users

链接: https://arxiv.org/abs/2305.15138
作者: Chunpu Xu, Jing Li, Piji Li, Min Yang


Abstract: Millions of users are active on social media. To allow users to better showcase themselves and network with others, we explore the auto-generation of social media self-introduction, a short sentence outlining a user’s personal interests. While most prior work profiles users with tags (e.g., ages), we investigate sentence-level self-introductions to provide a more natural and engaging way for users to know each other. Here we exploit a user’s tweeting history to generate their self-introduction. The task is non-trivial because the history content may be lengthy, noisy, and exhibit various personal interests. To address this challenge, we propose a novel unified topic-guided encoder-decoder (UTGED) framework; it models latent topics to reflect salient user interest, whose topic mixture then guides encoding a user’s history and topic words control decoding their self-introduction. For experiments, we collect a large-scale Twitter dataset, and extensive results show the superiority of our UTGED to the advanced encoder-decoder models without topic modeling.

NLP-35-标题: On Degrees of Freedom in Defining and Testing Natural Language Understanding ACL2023

链接: https://arxiv.org/abs/2305.15130
作者: Saku Sugawara, Shun Tsugita
备注: Accepted to Findings of ACL 2023


Abstract: Natural language understanding (NLU) studies often exaggerate or underestimate the capabilities of systems, thereby limiting the reproducibility of their findings. These erroneous evaluations can be attributed to the difficulty of defining and testing NLU adequately. In this position paper, we reconsider this challenge by identifying two types of researcher degrees of freedom. We revisit Turing’s original interpretation of the Turing test and indicate that an NLU test does not provide an operational definition; it merely provides inductive evidence that the test subject understands the language sufficiently well to meet stakeholder objectives. In other words, stakeholders are free to arbitrarily define NLU through their objectives. To use the test results as inductive evidence, stakeholders must carefully assess if the interpretation of test scores is valid or not. However, designing and using NLU tests involve other degrees of freedom, such as specifying target skills and defining evaluation metrics. As a result, achieving consensus among stakeholders becomes difficult. To resolve this issue, we propose a validity argument, which is a framework comprising a series of validation criteria across test components. By demonstrating that current practices in NLU studies can be associated with those criteria and organizing them into a comprehensive checklist, we prove that the validity argument can serve as a coherent guideline for designing credible test sets and facilitating scientific communication.

NLP-36-标题: Another Dead End for Morphological Tags? Perturbed Inputs and Parsing ACL2023

链接: https://arxiv.org/abs/2305.15119
作者: Alberto Muñoz-Ortiz, David Vilares
备注: Accepted at Findings of ACL 2023


Abstract: The usefulness of part-of-speech tags for parsing has been heavily questioned due to the success of word-contextualized parsers. Yet, most studies are limited to coarse-grained tags and high quality written content; while we know little about their influence when it comes to models in production that face lexical errors. We expand these setups and design an adversarial attack to verify if the use of morphological information by parsers: (i) contributes to error propagation or (ii) if on the other hand it can play a role to correct mistakes that word-only neural parsers make. The results on 14 diverse UD treebanks show that under such attacks, for transition- and graph-based models their use contributes to degrade the performance even faster, while for the (lower-performing) sequence labeling parsers they are helpful. We also show that if morphological tags were utopically robust against lexical perturbations, they would be able to correct parsing mistakes.

NLP-37-标题: The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing ACL2023

链接: https://arxiv.org/abs/2305.15108
作者: Debayan Banerjee, Pranav Ajit Nair, Ricardo Usbeck, Chris Biemann
备注: Accepted as a short paper to ACL 2023 findings


Abstract: In this work, we analyse the role of output vocabulary for text-to-text (T2T) models on the task of SPARQL semantic parsing. We perform experiments within the the context of knowledge graph question answering (KGQA), where the task is to convert questions in natural language to the SPARQL query language. We observe that the query vocabulary is distinct from human vocabulary. Language Models (LMs) are pre-dominantly trained for human language tasks, and hence, if the query vocabulary is replaced with a vocabulary more attuned to the LM tokenizer, the performance of models may improve. We carry out carefully selected vocabulary substitutions on the queries and find absolute gains in the range of 17% on the GrailQA dataset.

NLP-38-标题: Fourier Transformer : Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

链接: https://arxiv.org/abs/2305.15099
作者: Ziwei He, Meng Yang, Minwei Feng, Jingcheng Yin, Xinbing Wang, Jingwen Leng, Zhouhan Lin


Abstract: The transformer model is known to be computationally demanding, and prohibitively costly for long sequences, as the self-attention module uses a quadratic time and space complexity with respect to sequence length. Many researchers have focused on designing new forms of self-attention or introducing new parameters to overcome this limitation, however a large portion of them prohibits the model to inherit weights from large pretrained models. In this work, the transformer’s inefficiency has been taken care of from another perspective. We propose Fourier Transformer, a simple yet effective approach by progressively removing redundancies in hidden sequence using the ready-made Fast Fourier Transform (FFT) operator to perform Discrete Cosine Transformation (DCT). Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Experiments show that our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA with significant improvement in both speed and space. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART and other efficient models. \footnoteOur code is publicly available at \urlthis https URL

NLP-39-标题: Referral Augmentation for Zero-Shot Information Retrieval

链接: https://arxiv.org/abs/2305.15098
作者: Michael Tang, Shunyu Yao, John Yang, Karthik Narasimhan


Abstract: We propose Referral-Augmented Retrieval (RAR), a simple technique that concatenates document indices with referrals, i.e. text from other documents that cite or link to the given document, to provide significant performance gains for zero-shot information retrieval. The key insight behind our method is that referrals provide a more complete, multi-view representation of a document, much like incoming page links in algorithms like PageRank provide a comprehensive idea of a webpage’s importance. RAR works with both sparse and dense retrievers, and outperforms generative text expansion techniques such as DocT5Query and Query2Doc a 37% and 21% absolute improvement on ACL paper retrieval Recall@10 – while also eliminating expensive model training and inference. We also analyze different methods for multi-referral aggregation and show that RAR enables up-to-date information retrieval without re-training.

NLP-40-标题: Dynamic Masking Rate Schedules for MLM Pretrain ing

链接: https://arxiv.org/abs/2305.15096
作者: Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, Matthew L. Leavitt


Abstract: Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model’s fixed masking rate of 15%. Our work instead dynamically schedules the masking ratio throughout training. We found that linearly decreasing the masking rate from 30% to 15% over the course of pretraining improves average GLUE accuracy by 0.46% in BERT-base, compared to a standard 15% fixed rate. Further analyses demonstrate that the gains from scheduling come from being exposed to both high and low masking rate regimes. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models and achieve up to a 1.89x speedup in pretraining.

NLP-41-标题: CSTS: Conditional Semantic Textual Similarity

链接: https://arxiv.org/abs/2305.15093
作者: Ameet Deshpande, Carlos E. Jimenez, Howard Chen, Vishvak Murahari, Victoria Graf, Tanmay Rajpurohit, Ashwin Kalyan, Danqi Chen, Karthik Narasimhan


Abstract: Semantic textual similarity (STS) has been a cornerstone task in NLP that measures the degree of similarity between a pair of sentences, with applications in information retrieval, question answering, and embedding methods. However, it is an inherently ambiguous task, with the sentence similarity depending on the specific aspect of interest. We resolve this ambiguity by proposing a novel task called conditional STS (C-STS) which measures similarity conditioned on an aspect elucidated in natural language (hereon, condition). As an example, the similarity between the sentences “The NBA player shoots a three-pointer.” and “A man throws a tennis ball into the air to serve.” is higher for the condition “The motion of the ball.” (both upward) and lower for “The size of the ball.” (one large and one small). C-STS’s advantages are two-fold: (1) it reduces the subjectivity and ambiguity of STS, and (2) enables fine-grained similarity evaluation using diverse conditions. C-STS contains almost 20,000 instances from diverse domains and we evaluate several state-of-the-art models to demonstrate that even the most performant fine-tuning and in-context learning models (GPT-4, Flan, SimCSE) find it challenging, with Spearman correlation scores of <50. We encourage the community to evaluate their models on C-STS to provide a more holistic view of semantic similarity and natural language understanding.

NLP-42-标题: STAR: Boosting Low-Resource Event Extraction by Structure-to-Text Data Generation with Large Language Models

链接: https://arxiv.org/abs/2305.15090
作者: Mingyu Derek Ma, Xiaoxuan Wang, Po-Nien Kung, P. Jeffrey Brantingham, Nanyun Peng, Wei Wang


Abstract: Structure prediction tasks such as event extraction require an in-depth understanding of the output structure and sub-task dependencies, thus they still heavily rely on task-specific training data to obtain reasonable performance. Due to the high cost of human annotation, low-resource event extraction, which requires minimal human cost, is urgently needed in real-world information extraction applications. We propose to synthesize data instances given limited seed demonstrations to boost low-resource event extraction performance. We propose STAR, a structure-to-text data generation method that first generates complicated event structures (Y) and then generates input passages (X), all with Large Language Models. We design fine-grained step-by-step instructions and the error cases and quality issues identified through self-reflection can be self-refined. Our experiments indicate that data generated by STAR can significantly improve the low-resource event extraction performance and they are even more effective than human-curated data points in some cases.

NLP-43-标题: Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples EACL2023

链接: https://arxiv.org/abs/2305.15087
作者: Philipp Sadler, David Schlangen
备注: 9 pages, Accepted to EACL 2023


Abstract: NLP tasks are typically defined extensionally through datasets containing example instantiations (e.g., pairs of image i and text t), but motivated intensionally through capabilities invoked in verbal descriptions of the task (e.g., “t is a description of i, for which the content of i needs to be recognised and understood”). We present Pento-DIARef, a diagnostic dataset in a visual domain of puzzle pieces where referring expressions are generated by a well-known symbolic algorithm (the “Incremental Algorithm”), which itself is motivated by appeal to a hypothesised capability (eliminating distractors through application of Gricean maxims). Our question then is whether the extensional description (the dataset) is sufficient for a neural model to pick up the underlying regularity and exhibit this capability given the simple task definition of producing expressions from visual inputs. We find that a model supported by a vision detection step and a targeted data generation scheme achieves an almost perfect BLEU@1 score and sentence accuracy, whereas simpler baselines do not.

NLP-44-标题: Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions

链接: https://arxiv.org/abs/2305.15083
作者: Jiahuan Li, Hao Zhou, Shujian Huang, Shanbo Chen, Jiajun Chen


Abstract: Large-scale Pretrained Language Models~(LLMs), such as ChatGPT and GPT4, have shown strong abilities in multilingual translations, without being explicitly trained on parallel corpora. It is interesting how the LLMs obtain their ability to carry out translation instructions for different languages. In this paper, we present a detailed analysis by finetuning a multilingual pretrained language model, XGLM-7B, to perform multilingual translation following given instructions. Firstly, we show that the multilingual LLMs have stronger translation abilities than previously demonstrated. For a certain language pair, the performance depends on both the language families and the amount of data used in the pretraining phase. Secondly, we find that LLMs’ ability to carry out translation instructions relies on the understanding of translation instruction and the alignment among different languages. With proper enhancement, LLMs could perform the translation task well even for those language pairs unseen during the instruction tuning phase.

NLP-45-标题: Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

链接: https://arxiv.org/abs/2305.15080
作者: Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, Seunghyun Park


Abstract: Advances in Large Language Models (LLMs) have inspired a surge of research exploring their expansion into the visual domain. While recent models exhibit promise in generating abstract captions for images and conducting natural conversations, their performance on text-rich images leaves room for improvement. In this paper, we propose the Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details typically overlooked by existing methods. Cream integrates vision and auxiliary encoders, complemented by a contrastive feature alignment technique, resulting in a more effective understanding of textual information within document images. Our approach, thus, seeks to bridge the gap between vision and language understanding, paving the way for more sophisticated Document Intelligence Assistants. Rigorous evaluations across diverse tasks, such as visual question answering on document images, demonstrate the efficacy of Cream as a state-of-the-art model in the field of visual document understanding. We provide our codebase and newly-generated datasets at this https URL

NLP-46-标题: Contrastive Learning of Sentence Embeddings from Scratch

链接: https://arxiv.org/abs/2305.15077
作者: Junlei Zhang, Zhenzhong Lan, Junxian He
备注: Preprint


Abstract: Contrastive learning has been the dominant approach to train state-of-the-art sentence embeddings. Previous studies have typically learned sentence embeddings either through the use of human-annotated natural language inference (NLI) data or via large-scale unlabeled sentences in an unsupervised manner. However, even in the case of unlabeled data, their acquisition presents challenges in certain domains due to various reasons. To address these issues, we present SynCSE, a contrastive learning framework that trains sentence embeddings with synthesized data. Specifically, we explore utilizing large language models to synthesize the required data samples for contrastive learning, including (1) producing positive and negative annotations given unlabeled sentences (SynCSE-partial), and (2) generating sentences along with their corresponding annotations from scratch (SynCSE-scratch). Experimental results on sentence similarity and reranking tasks indicate that both SynCSE-partial and SynCSE-scratch greatly outperform unsupervised baselines, and SynCSE-partial even achieves comparable performance to the supervised models in most settings.

NLP-47-标题: Meta-Learning Online Adaptation of Language Models

链接: https://arxiv.org/abs/2305.15076
作者: Nathan Hu, Eric Mitchell, Christopher D. Manning, Chelsea Finn


Abstract: Large language models encode surprisingly broad knowledge about the world into their parameters. However, the knowledge in static language models can fall out of date, limiting the model’s effective “shelf life.” While online fine-tuning can reduce this degradation, we find that fine-tuning on a stream of documents using standard optimizers such as Adam leads to a disappointingly low level of information uptake. We hypothesize that online fine-tuning does not sufficiently ‘attend’ to important information. That is, the gradient signal from important tokens representing factual information is drowned out by the gradient from inherently noisy tokens, suggesting a dynamic, context-aware learning rate may be beneficial. To test this hypothesis, we meta-train a small, autoregressive model to reweight the language modeling loss for each token during online fine-tuning, with the objective of maximizing the out-of-date base language model’s ability to answer questions about a document after a single weighted gradient step. We call this approach Context-aware Meta-learned Loss Scaling (CaMeLS). Across three different distributions of documents, our experiments find that fine-tuning on streams of thousands of documents with CaMeLS substantially improves knowledge retention compared to standard online fine-tuning. Finally, we find that the meta-learned weights are general, and that a single reweighting model can be used to enhance the online adaptation of many LMs.

NLP-48-标题: Huatuo GPT towards Taming Language Model to Be a Doctor

链接: https://arxiv.org/abs/2305.15075
作者: Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, Haizhou Li


Abstract: In this paper, we present HuatuoGPT, a large language model (LLM) for medical consultation. The core recipe of HuatuoGPT is to leverage both \textitdistilled data from ChatGPT and \textitreal-world data from doctors in the supervised fine-tuned stage. The responses of ChatGPT are usually detailed, well-presented and informative while it cannot perform like a doctor in many aspects, e.g. for integrative diagnosis. We argue that real-world data from doctors would be complementary to distilled data in the sense the former could tame a distilled language model to perform like doctors. To better leverage the strengths of both data, we train a reward model to align the language model with the merits that both data bring, following an RLAIF (reinforced learning from AI feedback) fashion. To evaluate and benchmark the models, we propose a comprehensive evaluation scheme (including automatic and manual metrics). Experimental results demonstrate that HuatuoGPT achieves state-of-the-art results in performing medical consultation among open-source LLMs in GPT-4 evaluation, human evaluation, and medical benchmark datasets. It is worth noting that by using additional real-world data and RLAIF, the distilled language model (i.e., HuatuoGPT) outperforms its teacher model ChatGPT in most cases. Our code, data, and models are publicly available at \urlthis https URL. The online demo is available at \urlthis https URL.

NLP-49-标题: Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

链接: https://arxiv.org/abs/2305.15074
作者: Daman Arora, Himanshu Gaurav Singh, Mausam


Abstract: The performance on Large Language Models (LLMs) on existing reasoning benchmarks has shot up considerably over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem solving abilities of LLMs. We curate 450 challenging pre-engineering mathematics, physics and chemistry problems from the IIT JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is essential for solving problems in this benchmark. Our evaluation on the GPT series of models reveals that although performance improves with newer models, the best being GPT-4, the highest performance, even after using techniques like Self-Consistency and Chain-of-Thought prompting is less than 40 percent. Our analysis demonstrates that errors in algebraic manipulation and failure in retrieving relevant domain specific concepts are primary contributors to GPT4’s low performance. Given the challenging nature of the benchmark, we hope that it can guide future research in problem solving using LLMs. Our code and dataset is available here.

NLP-50-标题: Annotation Imputation to Individualize Predictions: Initial Studies on Distribution Dynamics and Model Predictions

链接: https://arxiv.org/abs/2305.15070
作者: London Lowmanstone, Ruyuan Wan, Risako Owan, Jaehyung Kim, Dongyeop Kang
备注: 12 pages, 5 figures


Abstract: Annotating data via crowdsourcing is time-consuming and expensive. Owing to these costs, dataset creators often have each annotator label only a small subset of the data. This leads to sparse datasets with examples that are marked by few annotators; if an annotator is not selected to label an example, their opinion regarding it is lost. This is especially concerning for subjective NLP datasets where there is no correct label: people may have different valid opinions. Thus, we propose using imputation methods to restore the opinions of all annotators for all examples, creating a dataset that does not leave out any annotator’s view. We then train and prompt models with data from the imputed dataset (rather than the original sparse dataset) to make predictions about majority and individual annotations. Unfortunately, the imputed data provided by our baseline methods does not improve predictions. However, through our analysis of it, we develop a strong understanding of how different imputation methods impact the original data in order to inform future imputation techniques. We make all of our code and data publicly available.

NLP-51-标题: ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind

链接: https://arxiv.org/abs/2305.15068
作者: Xiaomeng Ma, Lingyu Gao, Qihui Xu
备注: work in progress


Abstract: Theory of Mind (ToM), the capacity to comprehend the mental states of distinct individuals, is essential for numerous practical applications. With the development of large language models, there is a heated debate about whether they are able to perform ToM tasks. Previous studies have used different tasks and prompts to test the ToM on large language models and the results are inconsistent: some studies asserted these models are capable of exhibiting ToM, while others suggest the opposite. In this study, We present ToMChallenges, a dataset for comprehensively evaluating Theory of Mind based on Sally-Anne and Smarties tests. We created 30 variations of each test (e.g., changing the person’s name, location, and items). For each variation, we test the model’s understanding of different aspects: reality, belief, 1st order belief, and 2nd order belief. We adapt our data for various tasks by creating unique prompts tailored for each task category: Fill-in-the-Blank, Multiple Choice, True/False, Chain-of-Thought True/False, Question Answering, and Text Completion. If the model has a robust ToM, it should be able to achieve good performance for different prompts across different tests. We evaluated two GPT-3.5 models, text-davinci-003 and gpt-3.5-turbo-0301, with our datasets. Our results indicate that consistent performance in ToM tasks remains a challenge.

NLP-52-标题: Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing

链接: https://arxiv.org/abs/2305.15067
作者: Tianyi Tang, Hongyuan Lu, Yuchen Eleanor Jiang, Haoyang Huang, Dongdong Zhang, Wayne Xin Zhao, Furu Wei


Abstract: Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually be expressed in different forms, and the evaluation with a single or few references may not accurately reflect the quality of the model’s hypotheses. To address this issue, this paper presents a novel method, named Para-Ref, to enhance existing evaluation benchmarks by enriching the number of references. We leverage large language models (LLMs) to paraphrase a single reference into multiple high-quality ones in diverse expressions. Experimental results on representative NLG tasks of machine translation, text summarization, and image caption demonstrate that our method can effectively improve the correlation with human evaluation for sixteen automatic evaluation metrics by +7.82% in ratio. We release the code and data at this https URL.

NLP-53-标题: GPT 4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking

链接: https://arxiv.org/abs/2305.15066
作者: Jiayan Guo, Lun Du, Hengyu Liu


Abstract: Large language models~(LLM) like ChatGPT have become indispensable to artificial general intelligence~(AGI), demonstrating excellent performance in various natural language processing tasks. In the real world, graph data is ubiquitous and an essential part of AGI and prevails in domains like social network analysis, bioinformatics and recommender systems. The training corpus of large language models often includes some algorithmic components, which allows them to achieve certain effects on some graph data-related problems. However, there is still little research on their performance on a broader range of graph-structured data. In this study, we conduct an extensive investigation to assess the proficiency of LLMs in comprehending graph data, employing a diverse range of structural and semantic-related tasks. Our analysis encompasses 10 distinct tasks that evaluate the LLMs’ capabilities in graph understanding. Through our study, we not only uncover the current limitations of language models in comprehending graph structures and performing associated reasoning tasks but also emphasize the necessity for further advancements and novel approaches to enhance their graph processing capabilities. Our findings contribute valuable insights towards bridging the gap between language models and graph understanding, paving the way for more effective graph mining and knowledge extraction.

NLP-54-标题: Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning

链接: https://arxiv.org/abs/2305.15065
作者: Ximing Lu, Faeze Brahman, Peter West, Jaehun Jang, Khyathi Chandu, Abhilasha Ravichander, Lianhui Qin, Prithviraj Ammanabrolu, Liwei Jiang, Sahana Ramnath, Nouha Dziri, Jillian Fisher, Bill Yuchen Lin, Skyler Hallinan, Xiang Ren, Sean Welleck, Yejin Choi


Abstract: Large language models excel at a variety of language tasks when prompted with examples or instructions. Yet controlling these models through prompting alone is limited. Tailoring language models through fine-tuning (e.g., via reinforcement learning) can be effective, but it is expensive and requires model access. We propose Inference-time Policy Adapters (IPA), which efficiently tailors a language model such as GPT-3 without fine-tuning it. IPA guides a large base model during decoding time through a lightweight policy adaptor trained to optimize an arbitrary user objective with reinforcement learning. On five challenging text generation tasks, such as toxicity reduction and open-domain generation, IPA consistently brings significant improvements over off-the-shelf language models. It outperforms competitive baseline methods, sometimes even including expensive fine-tuning. In particular, tailoring GPT-2 with IPA can outperform GPT-3, while tailoring GPT- 3 with IPA brings a major performance boost over GPT-3 (and sometimes even over GPT-4). Our promising results highlight the potential of IPA as a lightweight alternative to tailoring extreme-scale language models.

NLP-55-标题: Prompt Optimization of Large Language Model for Interactive Tasks without Gradient and Demonstrations

链接: https://arxiv.org/abs/2305.15064
作者: Siqi Ouyang, Lei Li
备注: Draft. Work in Progress


Abstract: Large language models (LLMs) have demonstrated remarkable language proficiency, but they face challenges when solving interactive tasks independently. Existing methods either rely on gradient access, which is often inaccessible in state-of-the-art LLMs like GPT-4, or necessitate diverse and high-quality in-context demonstrations. In this study, we propose LLM-PO, a novel approach that enables LLMs to address these tasks without gradient access or extensive demonstrations. The key idea is to maintain a text-based plan and ask LLMs to reflect on pros and cons of the current plan based on experience collected with it, to update the plan, and to collect more experiences with the new plan. Experiments on HotpotQA demonstrate that LLM-PO achieves higher or on par success rates compared to in-context learning (ICL) baselines while requiring less inference cost.

NLP-56-标题: Lawyer LLaMA Technical Report

链接: https://arxiv.org/abs/2305.15062
作者: Quzhe Huang, Mingxu Tao, Zhenwei An, Chen Zhang, Cong Jiang, Zhibin Chen, Zirui Wu, Yansong Feng
备注: Work in progress


Abstract: Large Language Models (LLMs), like LLaMA, have exhibited remarkable performances across various tasks. Nevertheless, when deployed to specific domains such as law or medicine, the models still confront the challenge of a deficiency in domain-specific knowledge and an inadequate capability to leverage that knowledge to resolve domain-related problems. In this paper, we focus on the legal domain and explore how to inject domain knowledge during the continual training stage and how to design proper supervised finetune tasks to help the model tackle practical issues. Moreover, to alleviate the hallucination problem during model’s generation, we add a retrieval module and extract relevant articles before the model answers any queries. Augmenting with the extracted evidence, our model could generate more reliable responses. We release our data and model at this https URL.

NLP-57-标题: Who Wrote this Code? Watermarking for Code Generation

链接: https://arxiv.org/abs/2305.15060
作者: Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, Gunhee Kim


Abstract: Large language models for code have recently shown remarkable performance in generating executable code. However, this rapid advancement has been accompanied by many legal and ethical concerns, such as code licensing issues, code plagiarism, and malware generation, making watermarking machine-generated code a very timely problem. Despite such imminent needs, we discover that existing watermarking and machine-generated text detection methods for LLMs fail to function with code generation tasks properly. Hence, in this work, we propose a new watermarking method, SWEET, that significantly improves upon previous approaches when watermarking machine-generated code. Our proposed method selectively applies watermarking to the tokens with high enough entropy, surpassing a defined threshold. The experiments on code generation benchmarks show that our watermarked code has superior quality compared to code produced by the previous state-of-the-art LLM watermarking method. Furthermore, our watermark method also outperforms DetectGPT for the task of machine-generated code detection.

NLP-58-标题: Learning the String Partial Order

链接: https://arxiv.org/abs/2305.15057
作者: Tianyu Liu, Afra Amini, Mrinmaya Sachan, Ryan Cotterell
备注: 12 pages


Abstract: We show that most structured prediction problems can be solved in linear time and space by considering them as partial orderings of the tokens in the input string. Our method computes real numbers for each token in an input string and sorts the tokens accordingly, resulting in as few as 2 total orders of the tokens in the string. Each total order possesses a set of edges oriented from smaller to greater tokens. The intersection of total orders results in a partial order over the set of input tokens, which is then decoded into a directed graph representing the desired structure. Experiments show that our method achieves 95.4 LAS and 96.9 UAS by using an intersection of 2 total orders, 95.7 LAS and 97.1 UAS with 4 on the English Penn Treebank dependency parsing benchmark. Our method is also the first linear-complexity coreference resolution model and achieves 79.2 F1 on the English OntoNotes benchmark, which is comparable with state of the art.

NLP-59-标题: Reasoning over Hierarchical Question Decomposition Tree for Explainable Question Answering ACL2023

链接: https://arxiv.org/abs/2305.15056
作者: Jiajie Zhang, Shulin Cao, Tingjia Zhang, Xin Lv, Jiaxin Shi, Qi Tian, Juanzi Li, Lei Hou
备注: has been accepted by ACL2023


Abstract: Explainable question answering (XQA) aims to answer a given question and provide an explanation why the answer is selected. Existing XQA methods focus on reasoning on a single knowledge source, e.g., structured knowledge bases, unstructured corpora, etc. However, integrating information from heterogeneous knowledge sources is essential to answer complex questions. In this paper, we propose to leverage question decomposing for heterogeneous knowledge integration, by breaking down a complex question into simpler ones, and selecting the appropriate knowledge source for each sub-question. To facilitate reasoning, we propose a novel two-stage XQA framework, Reasoning over Hierarchical Question Decomposition Tree (RoHT). First, we build the Hierarchical Question Decomposition Tree (HQDT) to understand the semantics of a complex question; then, we conduct probabilistic reasoning over HQDT from root to leaves recursively, to aggregate heterogeneous knowledge at different tree levels and search for a best solution considering the decomposing and answering probabilities. The experiments on complex QA datasets KQA Pro and Musique show that our framework outperforms SOTA methods significantly, demonstrating the effectiveness of leveraging question decomposing for knowledge integration and our RoHT framework.

NLP-60-标题: Understanding Arithmetic Reasoning in Language Models using Causal Mediation Analysis

链接: https://arxiv.org/abs/2305.15054
作者: Alessandro Stolfo, Yonatan Belinkov, Mrinmaya Sachan


Abstract: Mathematical reasoning in large language models (LLMs) has garnered attention in recent research, but there is limited understanding of how these models process and store information related to arithmetic tasks. In this paper, we present a mechanistic interpretation of LLMs for arithmetic-based questions using a causal mediation analysis framework. By intervening on the activations of specific model components and measuring the resulting changes in predicted probabilities, we identify the subset of parameters responsible for specific predictions. We analyze two pre-trained language models with different sizes (2.8B and 6B parameters). Experimental results reveal that a small set of mid-late layers significantly affect predictions for arithmetic-based questions, with distinct activation patterns for correct and wrong predictions. We also investigate the role of the attention mechanism and compare the model’s activation patterns for arithmetic queries with the prediction of factual knowledge. Our findings provide insights into the mechanistic interpretation of LLMs for arithmetic tasks and highlight the specific components involved in arithmetic reasoning.

NLP-61-标题: Decomposing Complex Queries for Tip-of-the-tongue Retrieval

链接: https://arxiv.org/abs/2305.15053
作者: Kevin Lin, Kyle Lo, Joseph E. Gonzalez, Dan Klein


Abstract: When re-finding items, users who forget or are uncertain about identifying details often rely on creative strategies for expressing their information needs – complex queries that describe content elements (e.g., book characters or events), information beyond the document text (e.g., descriptions of book covers), or personal context (e.g., when they read a book). This retrieval setting, called tip of the tongue (TOT), is especially challenging for models heavily reliant on lexical and semantic overlap between query and document text. In this work, we introduce a simple yet effective framework for handling such complex queries by decomposing the query into individual clues, routing those as sub-queries to specialized retrievers, and ensembling the results. This approach allows us to take advantage of off-the-shelf retrievers (e.g., CLIP for retrieving images of book covers) or incorporate retriever-specific logic (e.g., date constraints). We show that our framework incorportating query decompositions into retrievers can improve gold book recall up to 7% relative again for Recall@5 on a new collection of 14,441 real-world query-book pairs from an online community for resolving TOT inquiries.

NLP-62-标题: A Monte Carlo Language Model Pipeline for Zero-Shot Sociopolitical Event Extraction

链接: https://arxiv.org/abs/2305.15051
作者: Erica Cai, Brendan O’Connor


Abstract: We consider dyadic zero-shot event extraction (EE) to identify actions between pairs of actors. The \emphzero-shot setting allows social scientists or other non-computational researchers to extract any customized, user-specified set of events without training, resulting in a \emphdyadic event database, allowing insight into sociopolitical relational dynamics among actors and the higher level organizations or countries they represent. Unfortunately, we find that current zero-shot EE methods perform poorly for the task, with issues including word sense ambiguity, modality mismatch, and efficiency. Straightforward application of large language model prompting typically performs even worse. We address these challenges with a new fine-grained, multi-stage generative question-answer method, using a Monte Carlo approach to exploit and overcome the randomness of generative outputs. It performs 90% fewer queries than a previous approach, with strong performance on the widely-used Automatic Content Extraction dataset. Finally, we extend our method to extract affiliations of actor arguments and demonstrate our method and findings on a dyadic international relations case study.

NLP-63-标题: Ranger: A Toolkit for Effect-Size Based Multi-Task Evaluation ACL2023

链接: https://arxiv.org/abs/2305.15048
作者: Mete Sertkan, Sophia Althammer, Sebastian Hofstätter
备注: Accepted at ACL 2023 (System Demonstrations)


Abstract: In this paper, we introduce Ranger - a toolkit to facilitate the easy use of effect-size-based meta-analysis for multi-task evaluation in NLP and IR. We observed that our communities often face the challenge of aggregating results over incomparable metrics and scenarios, which makes conclusions and take-away messages less reliable. With Ranger, we aim to address this issue by providing a task-agnostic toolkit that combines the effect of a treatment on multiple tasks into one statistical evaluation, allowing for comparison of metrics and computation of an overall summary effect. Our toolkit produces publication-ready forest plots that enable clear communication of evaluation results over multiple tasks. Our goal with the ready-to-use Ranger toolkit is to promote robust, effect-size-based evaluation and improve evaluation standards in the community. We provide two case studies for common IR and NLP settings to highlight Ranger’s benefits.

NLP-64-标题: Ghostbuster: Detecting Text Ghostwritten by Large Language Models

链接: https://arxiv.org/abs/2305.15047
作者: Vivek Verma, Eve Fleisig, Nicholas Tomlin, Dan Klein


Abstract: We introduce Ghostbuster, a state-of-the-art system for detecting AI-generated text. Our method works by passing documents through a series of weaker language models and running a structured search over possible combinations of their features, then training a classifier on the selected features to determine if the target document was AI-generated. Crucially, Ghostbuster does not require access to token probabilities from the target model, making it useful for detecting text generated by black-box models or unknown model versions. In conjunction with our model, we release three new datasets of human and AI-generated text as detection benchmarks that cover multiple domains (student essays, creative fiction, and news) and task setups: document-level detection, author identification, and a challenge task of paragraph-level detection. Ghostbuster averages 99.1 F1 across all three datasets on document-level detection, outperforming previous approaches such as GPTZero and DetectGPT by up to 32.7 F1.

NLP-65-标题: SETI: Systematicity Evaluation of Textual Inference ACL2023

链接: https://arxiv.org/abs/2305.15045
作者: Xiyan Fu, Anette Frank
备注: Accepted to Findings of ACL2023


Abstract: We propose SETI (Systematicity Evaluation of Textual Inference), a novel and comprehensive benchmark designed for evaluating pre-trained language models (PLMs) for their systematicity capabilities in the domain of textual inference. Specifically, SETI offers three different NLI tasks and corresponding datasets to evaluate various types of systematicity in reasoning processes. In order to solve these tasks, models are required to perform compositional inference based on known primitive constituents. We conduct experiments of SETI on six widely used PLMs. Results show that various PLMs are able to solve unseen compositional inferences when having encountered the knowledge of how to combine primitives, with good performance. However, they are considerably limited when this knowledge is unknown to the model (40-100% points decrease). Furthermore, we find that PLMs can improve drastically once exposed to crucial compositional knowledge in minimalistic shots. These findings position SETI as the first benchmark for measuring the future progress of PLMs in achieving systematicity generalization in the textual inference.

NLP-66-标题: Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks

链接: https://arxiv.org/abs/2305.15044
作者: Xiao Pu, Mingqi Gao, Xiaojun Wan


Abstract: Research on automated text summarization relies heavily on human and automatic evaluation. While recent work on human evaluation mainly adopted intrinsic evaluation methods, judging the generic quality of text summaries, e.g. informativeness and coherence, our work focuses on evaluating the usefulness of text summaries with extrinsic methods. We carefully design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment. We carry out experiments using system rankings and user behavior data to evaluate the performance of different summarization models. We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks. The results show that summaries generated by fine-tuned models lead to higher consistency in usefulness across all three tasks, as rankings of fine-tuned summarization systems are close across downstream tasks according to the proposed extrinsic metrics. Summaries generated by models in the zero-shot setting, however, are found to be biased towards the text classification and similarity assessment tasks, due to its general and less detailed summary style. We further evaluate the correlation of 14 intrinsic automatic metrics with human criteria and show that intrinsic automatic metrics perform well in evaluating the usefulness of summaries in the question-answering task, but are less effective in the other two tasks. This highlights the limitations of relying solely on intrinsic automatic metrics in evaluating the performance and usefulness of summaries.

NLP-67-标题: Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

链接: https://arxiv.org/abs/2305.15041
作者: Veniamin Veselovsky, Manoel Horta Ribeiro, Akhil Arora, Martin Josifoski, Ashton Anderson, Robert West
备注: 8 pages


Abstract: Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.

NLP-68-标题: Active Learning for Natural Language Generation

链接: https://arxiv.org/abs/2305.15040
作者: Yotam Perlitz, Ariel Gera, Michal Shmueli-Scheuer, Dafna Sheinwald, Noam Slonim, Liat Ein-Dor


Abstract: The field of text generation suffers from a severe shortage of labeled data due to the extremely expensive and time consuming process involved in manual annotation. A natural approach for coping with this problem is active learning (AL), a well-known machine learning technique for improving annotation efficiency by selectively choosing the most informative examples to label. However, while AL has been well-researched in the context of text classification, its application to text generation remained largely unexplored. In this paper, we present a first systematic study of active learning for text generation, considering a diverse set of tasks and multiple leading AL strategies. Our results indicate that existing AL strategies, despite their success in classification, are largely ineffective for the text generation scenario, and fail to consistently surpass the baseline of random example selection. We highlight some notable differences between the classification and generation scenarios, and analyze the selection behaviors of existing AL strategies. Our findings motivate exploring novel approaches for applying AL to NLG tasks.

NLP-69-标题: Is GPT -4 a Good Data Analyst?

链接: https://arxiv.org/abs/2305.15038
作者: Liying Cheng, Xingxuan Li, Lidong Bing
备注: 11 pages, 2 figures


Abstract: As large language models (LLMs) have demonstrated their powerful capabilities in plenty of domains and tasks, including context understanding, code generation, language generation, data storytelling, etc., many data analysts may raise concerns if their jobs will be replaced by AI. This controversial topic has drawn a lot of attention in public. However, we are still at a stage of divergent opinions without any definitive conclusion. Motivated by this, we raise the research question of “is GPT-4 a good data analyst?” in this work and aim to answer it by conducting head-to-head comparative studies. In detail, we regard GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains. We propose a framework to tackle the problems by carefully designing the prompts for GPT-4 to conduct experiments. We also design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4. Experimental results show that GPT-4 can achieve comparable performance to humans. We also provide in-depth discussions about our results to shed light on further studies before we reach the conclusion that GPT-4 can replace data analysts.

NLP-70-标题: Self-ICL: Zero-Shot In-Context Learning with Self-Generated Demonstrations

链接: https://arxiv.org/abs/2305.15035
作者: Wei-Lin Chen, Cheng-Kuang Wu, Hsin-Hsi Chen
备注: Work in progress


Abstract: Large language models (LMs) have exhibited superior in-context learning (ICL) ability to adopt to target tasks by prompting with a few input-output demonstrations. Towards better ICL, different methods are proposed to select representative demonstrations from existing training corpora. However, such a setting is not aligned with real-world practices, as end-users usually query LMs without accesses to demonstration pools. Inspired by evidence suggesting LMs’ zero-shot capabilities are underrated, and the role of demonstrations are primarily for exposing models’ intrinsic functionalities, we introduce Self-ICL, a simple framework for zero-shot ICL. Given a test input, Self-ICL first prompts the model to generate pseudo-inputs. Next, the model predicts pseudo-labels for the pseudo-inputs via zero-shot prompting. Finally, we construct pseudo-demonstrations from pseudo-input-label pairs, and perform ICL for the test input. Evaluation on BIG-Bench Hard shows Self-ICL steadily surpasses zero-shot and zero-shot chain-of-thought baselines on head-to-head and all-task average performance. Our findings suggest the possibility to bootstrap LMs’ intrinsic capabilities towards better zero-shot performance.

NLP-71-标题: SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models

链接: https://arxiv.org/abs/2305.15033
作者: Zekun Wang, Jingchang Chen, Wangchunshu Zhou, Ming Liu, Bing Qin
备注: Work in progress


Abstract: Despite achieving remarkable performance on various vision-language tasks, Transformer-based pretrained vision-language models (VLMs) still suffer from efficiency issues arising from long inputs and numerous parameters, limiting their real-world applications. However, the huge computation is redundant for most samples and the degree of redundancy and the respective components vary significantly depending on tasks and input instances. In this work, we propose an adaptive acceleration method SmartTrim for VLMs, which adjusts the inference overhead based on the complexity of instances. Specifically, SmartTrim incorporates lightweight trimming modules into the backbone to perform task-specific pruning on redundant inputs and parameters, without the need for additional pre-training or data augmentation. Since visual and textual representations complement each other in VLMs, we propose to leverage cross-modal interaction information to provide more critical semantic guidance for identifying redundant parts. Meanwhile, we introduce a self-distillation strategy that encourages the trimmed model to be consistent with the full-capacity model, which yields further performance gains. Experimental results demonstrate that SmartTrim significantly reduces the computation overhead (2-3 times) of various VLMs with comparable performance (only a 1-2% degradation) on various vision-language tasks. Compared to previous acceleration methods, SmartTrim attains a better efficiency-performance trade-off, demonstrating great potential for application in resource-constrained scenarios.

NLP-72-标题: How to Distill your BERT : An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives ACL2023

链接: https://arxiv.org/abs/2305.15032
作者: Xinpeng Wang, Leonie Weissweiler, Hinrich Schütze, Barbara Plank
备注: ACL 2023


Abstract: Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves consistently under different initialisation settings. We release our code as an efficient transformer-based model distillation framework for further studies.

NLP-73-标题: ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

链接: https://arxiv.org/abs/2305.15028
作者: Heming Xia, Qingxiu Dong, Lei Li, Jingjing Xu, Ziwei Qin, Zhifang Sui


Abstract: Recently, Pretrained Language Models (PLMs) have been serving as general-purpose interfaces, posing a significant demand for comprehensive visual knowledge. However, it remains unclear how well current PLMs and their visually augmented counterparts (VaLMs) can master visual commonsense knowledge. To investigate this, we propose ImageNetVC, a fine-grained, human-annotated dataset specifically designed for zero-shot visual commonsense evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we delve into the fundamental visual commonsense knowledge of both unimodal PLMs and VaLMs, uncovering the scaling law and the influence of the backbone model on VaLMs. Furthermore, we investigate the factors affecting the visual commonsense knowledge of large-scale models, providing insights into the development of language models enriched with visual commonsense knowledge. Our code and dataset are available at this https URL.

NLP-74-标题: Dior-CVAE: Diffusion Priors in Variational Dialog Generation

链接: https://arxiv.org/abs/2305.15025
作者: Tianyu Yang, Thy Thy Tran, Iryna Gurevych


Abstract: Conditional variational autoencoders (CVAEs) have been used recently for diverse response generation, by introducing latent variables to represent the relationship between a dialog context and its potential responses. However, the diversity of the generated responses brought by a CVAE model is limited due to the oversimplified assumption of the isotropic Gaussian prior. We propose, Dior-CVAE, a hierarchical CVAE model with an informative prior produced by a diffusion model. Dior-CVAE derives a series of layer-wise latent variables using attention mechanism and infusing them into decoder layers accordingly. We propose memory dropout in the latent infusion to alleviate posterior collapse. The prior distribution of the latent variables is parameterized by a diffusion model to introduce a multimodal distribution. Overall, experiments on two popular open-domain dialog datasets indicate the advantages of our approach over previous Transformer-based variational dialog models in dialog response generation. We publicly release the code for reproducing Dior-CVAE and all baselines at this https URL.

NLP-75-标题: ChatAgri: Exploring Potentials of ChatGPT on Cross-linguistic Agricultural Text Classification

链接: https://arxiv.org/abs/2305.15024
作者: Biao Zhao, Weiqiang Jin, Javier Del Ser, Guang Yang
备注: 24 pages,10+figures,46references.Both the first two authors, Biao Zhao and Weiqiang Jin, made equal contributions to this work. Corresponding author: Guang Yang


Abstract: In the era of sustainable smart agriculture, a massive amount of agricultural news text is being posted on the Internet, in which massive agricultural knowledge has been accumulated. In this context, it is urgent to explore effective text classification techniques for users to access the required agricultural knowledge with high efficiency. Mainstream deep learning approaches employing fine-tuning strategies on pre-trained language models (PLMs), have demonstrated remarkable performance gains over the past few years. Nonetheless, these methods still face many drawbacks that are complex to solve, including: 1. Limited agricultural training data due to the expensive-cost and labour-intensive annotation; 2. Poor domain transferability, especially of cross-linguistic ability; 3. Complex and expensive large models deployment.Inspired by the extraordinary success brought by the recent ChatGPT (e.g. GPT-3.5, GPT-4), in this work, we systematically investigate and explore the capability and utilization of ChatGPT applying to the agricultural informatization field. …(shown in article)… Code has been released on Github this https URL.

NLP-76-标题: An Efficient Multilingual Language Model Compression through Vocabulary Trimming

链接: https://arxiv.org/abs/2305.15020
作者: Asahi Ushio, Yi Zhou, Jose Camacho-Collados


Abstract: Multilingual language model (LM) have become a powerful tool in NLP especially for non-English languages. Nevertheless, model parameters of multilingual LMs remain large due to the larger embedding matrix of the vocabulary covering tokens in different languages. On the contrary, monolingual LMs can be trained in a target language with the language-specific vocabulary only, but this requires a large budget and availability of reliable corpora to achieve a high-quality LM from scratch. In this paper, we propose vocabulary-trimming (VT), a method to reduce a multilingual LM vocabulary to a target language by deleting irrelevant tokens from its vocabulary. In theory, VT can compress any existing multilingual LM to build monolingual LMs in any language covered by the multilingual LM. In our experiments, we show that VT can retain the original performance of the multilingual LM, while being smaller in size (in general around 50% of the original vocabulary size is enough) than the original multilingual LM. The evaluation is performed over four NLP tasks (two generative and two classification tasks) among four widely used multilingual LMs in seven languages. Finally, we show that this methodology can keep the best of both monolingual and multilingual worlds by keeping a small size as monolingual models without the need for specifically retraining them, and even limiting potentially harmful social biases.

NLP-77-标题: Calc-X: Enriching Arithmetical Chain-of-Thoughts Dataset s by Interaction with Symbolic Systems

链接: https://arxiv.org/abs/2305.15017
作者: Marek Kadlčík, Michal Štefánik


Abstract: This report overviews our ongoing work in enriching chain-of-thoughts datasets requiring arithmetical reasoning with the integration of non-parametric components, such as a calculator. We conduct an analysis of prominent relevant datasets such as GSM8K, Ape210K, AQuA-RAT, and MathQA and propose a machine-processable HTML-like format specifically tailored for working with semi-structured chains. By converting the datasets into this unified format, we enable the effective integration of large language models and symbolic systems, empowering them to tackle arithmetical reasoning tasks more efficiently.

NLP-78-标题: Unlocking Temporal Question Answering for Large Language Models Using Code Execution

链接: https://arxiv.org/abs/2305.15014
作者: Xingxuan Li, Liying Cheng, Qingyu Tan, Hwee Tou Ng, Shafiq Joty, Lidong Bing


Abstract: Large language models (LLMs) have made significant progress in natural language processing (NLP), and are utilized extensively in various applications. Recent works, such as chain-of-thought (CoT), have shown that intermediate reasoning steps can improve the performance of LLMs for complex reasoning tasks, such as math problems and symbolic question-answering tasks. However, we notice the challenge that LLMs face when it comes to temporal reasoning. Our preliminary experiments show that generating intermediate reasoning steps does not always boost the performance of complex temporal question-answering tasks. Therefore, we propose a novel framework that combines the extraction capability of LLMs and the logical reasoning capability of a Python solver to tackle this issue. Extensive experiments and analysis demonstrate the effectiveness of our framework in handling intricate time-bound reasoning tasks.

NLP-79-标题: Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation

链接: https://arxiv.org/abs/2305.15011
作者: Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, Timothy Baldwin


Abstract: Instruction tuning has shown great promise in the field of natural language processing. However, the research on multilingual instruction tuning has been limited due to the scarcity of high-quality instruction-response datasets. To address this gap, we present Bactrian-X, a comprehensive multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. Leveraging this dataset, we train a set of adapters using low-rank adaptation (LoRA), which are lightweight components seamlessly integrated with foundational models. These adapters have a significantly smaller parameter count than the base model, making them easily replaceable and usable as plug-ins for different languages or language groups. Through extensive experiments on 52 languages, we demonstrate the superior performance of our models in various multilingual evaluation settings. Our proposed models outperform both the vanilla models and the existing instruction-tuned models. The code and models are publicly available at this https URL.

NLP-80-标题: Injecting Knowledge into Biomedical Pre-trained Models via Polymorphism and Synonymous Substitution

链接: https://arxiv.org/abs/2305.15010
作者: Hongbo Zhang, Xiang Wan, Benyou Wang


Abstract: Pre-trained language models (PLMs) were considered to be able to store relational knowledge present in the training data. However, some relational knowledge seems to be discarded unsafely in PLMs due to \textbfreport bias: low-frequency relational knowledge might be underexpressed compared to high-frequency one in PLMs. This gives us a hint that relational knowledge might not be redundant to the stored knowledge of PLMs, but rather be complementary. To additionally inject relational knowledge into PLMs, we propose a simple-yet-effective approach to inject relational knowledge into PLMs, which is inspired by three observations (namely, polymorphism, synonymous substitution, and association). In particular, we switch entities in the training corpus to related entities (either hypernyms/hyponyms/synonyms, or arbitrarily-related concepts). Experimental results show that the proposed approach could not only better capture relational knowledge, but also improve the performance in various biomedical downstream tasks. Our model is available in \urlthis https URL.

NLP-81-标题: Are Chatbots Ready for Privacy-Sensitive Applications? An Investigation into Input Regurgitation and Prompt -Induced Sanitization

链接: https://arxiv.org/abs/2305.15008
作者: Aman Priyanshu, Supriti Vijay, Ayush Kumar, Rakshit Naidu, Fatemehsadat Mireshghallah
备注: 12 pages, 9 figures, and 4 tables


Abstract: LLM-powered chatbots are becoming widely adopted in applications such as healthcare, personal assistants, industry hiring decisions, etc. In many of these cases, chatbots are fed sensitive, personal information in their prompts, as samples for in-context learning, retrieved records from a database, or as part of the conversation. The information provided in the prompt could directly appear in the output, which might have privacy ramifications if there is sensitive information there. As such, in this paper, we aim to understand the input copying and regurgitation capabilities of these models during inference and how they can be directly instructed to limit this copying by complying with regulations such as HIPAA and GDPR, based on their internal knowledge of them. More specifically, we find that when ChatGPT is prompted to summarize cover letters of a 100 candidates, it would retain personally identifiable information (PII) verbatim in 57.4% of cases, and we find this retention to be non-uniform between different subgroups of people, based on attributes such as gender identity. We then probe ChatGPT’s perception of privacy-related policies and privatization mechanisms by directly instructing it to provide compliant outputs and observe a significant omission of PII from output.

NLP-82-标题: Sentiment Analysis in the Era of Large Language Models: A Reality Check

链接: https://arxiv.org/abs/2305.15005
作者: Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, Lidong Bing


Abstract: Sentiment analysis (SA) has been a long-standing research area in natural language processing. It can offer rich insights into human sentiments and opinions and has thus seen considerable interest from both academia and industry. With the advent of large language models (LLMs) such as ChatGPT, there is a great potential for their employment on SA problems. However, the extent to which existing LLMs can be leveraged for different sentiment analysis tasks remains unclear. This paper aims to provide a comprehensive investigation into the capabilities of LLMs in performing various sentiment analysis tasks, from conventional sentiment classification to aspect-based sentiment analysis and multifaceted analysis of subjective texts. We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets. Our study reveals that while LLMs demonstrate satisfactory performance in simpler tasks, they lag behind in more complex tasks requiring deeper understanding or structured sentiment information. However, LLMs significantly outperform SLMs in few-shot learning settings, suggesting their potential when annotation resources are limited. We also highlight the limitations of current evaluation practices in assessing LLMs’ SA abilities and propose a novel benchmark, \textscSentiEval, for a more comprehensive and realistic evaluation. Data and code during our investigations are available at \urlthis https URL.

NLP-83-标题: LLMDet: A Large Language Models Detection Tool

链接: https://arxiv.org/abs/2305.15004
作者: Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng, Tat-Seng Chua
备注: 7 pages, 1 figure


Abstract: With the advancement of generative language models, the generated text has come remarkably close to high-quality human-authored text in terms of fluency and diversity. This calls for a highly practical detection tool that can identify the source of text, preferably pinpointing the language model it originates from. However, existing detection tools typically require access to language models and can only differentiate between machine-generated and human-authored text, failing to meet the requirements of rapid detection and text tracing. Therefore, in this paper, we propose an efficient, secure, and scalable detection tool called LLMDet, which calculates the proxy perplexity of text by utilizing the prior information of the model’s next-token probabilities, obtained through pre-training. Subsequently, we use the self-watermarking information of the model, as measured by proxy perplexity, to detect the source of the text. We found that our method demonstrates impressive detection performance while ensuring speed and security, particularly achieving a recognition accuracy of 97.97% for human-authored text. Furthermore, our detection tool also shows promising results in identifying the large language model (e.g., GPT-2, OPT, LLaMA, Vicuna…) responsible for the text. We release the code and processed data at \urlthis https URL.

NLP-84-标题: A RelEntLess Benchmark for Modelling Graded Relations between Named Entities

链接: https://arxiv.org/abs/2305.15002
作者: Asahi Ushio, Jose Camacho Collados, Steven Schockaert


Abstract: Relations such as “is influenced by”, “is known for” or “is a competitor of” are inherently graded: we can rank entity pairs based on how well they satisfy these relations, but it is hard to draw a line between those pairs that satisfy them and those that do not. Such graded relations play a central role in many applications, yet they are typically not covered by existing Knowledge Graphs. In this paper, we consider the possibility of using Large Language Models (LLMs) to fill this gap. To this end, we introduce a new benchmark, in which entity pairs have to be ranked according to how much they satisfy a given graded relation. The task is formulated as a few-shot ranking problem, where models only have access to a description of the relation and five prototypical instances. We use the proposed benchmark to evaluate state-of-the-art relation embedding strategies as well as several recent LLMs, covering both publicly available LLMs and closed models such as GPT-4. Overall, we find a strong correlation between model size and performance, with smaller Language Models struggling to outperform a naive baseline. The results of the largest Flan-T5 and OPT models are remarkably strong, although a clear gap with human performance remains.

NLP-85-标题: The Art of SOCRATIC QUESTIONING: Zero-shot Multimodal Reasoning with Recursive Thinking and Self-Questioning

链接: https://arxiv.org/abs/2305.14999
作者: Jingyuan Qi, Zhiyang Xu, Ying Shen, Minqian Liu, Di Jin, Qifan Wang, Lifu Huang
备注: 15 pages, 12 figure, 2 algorithms


Abstract: Chain-of-Thought prompting (CoT) enables large-scale language models to solve complex reasoning problems by decomposing the problem and tackling it step-by-step. However, Chain-of-Thought is a greedy thinking process that requires the language model to come up with a starting point and generate the next step solely based on previous steps. This thinking process is different from how humans approach a complex problem e.g., we proactively raise sub-problems related to the original problem and recursively answer them. In this work, we propose Socratic Questioning, a divide-and-conquer fashion algorithm that simulates the self-questioning and recursive thinking process. Socratic Questioning is driven by a Self-Questioning module that employs a large-scale language model to propose sub-problems related to the original problem as intermediate steps and Socratic Questioning recursively backtracks and answers the sub-problems until reaches the original problem. We apply our proposed algorithm to the visual question-answering task as a case study and by evaluating it on three public benchmark datasets, we observe a significant performance improvement over all baselines on (almost) all datasets. In addition, the qualitative analysis clearly demonstrates the intermediate thinking steps elicited by Socratic Questioning are similar to the human’s recursively thinking process of a complex reasoning problem.

NLP-86-标题: An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics

链接: https://arxiv.org/abs/2305.14998
作者: Saba Ahmadi, Aishwarya Agrawal


Abstract: Recently, reference-free metrics such as CLIPScore (Hessel et al., 2021) and UMIC (Lee et al., 2021) have been proposed for automatic evaluation of image captions, demonstrating a high correlation with human judgment. In this work, our focus lies in evaluating the robustness of these metrics in scenarios that require distinguishing between two captions with high lexical overlap but very different meanings. Our findings reveal that despite their high correlation with human judgment, both CLIPScore and UMIC struggle to identify fine-grained errors in captions. However, when comparing different types of fine-grained errors, both metrics exhibit limited sensitivity to implausibility of captions and strong sensitivity to lack of sufficient visual grounding. Probing further into the visual grounding aspect, we found that both CLIPScore and UMIC are impacted by the size of image-relevant objects mentioned in the caption, and that CLIPScore is also sensitive to the number of mentions of image-relevant objects in the caption. In terms of linguistic aspects of a caption, we found that both metrics lack the ability to comprehend negation, UMIC is sensitive to caption lengths, and CLIPScore is insensitive to the structure of the sentence. We hope our findings will serve as a valuable guide towards improving reference-free evaluation in image captioning.

NLP-87-标题: The ACL OCL Corpus: advancing Open science in Computational Linguistics

链接: https://arxiv.org/abs/2305.14996
作者: Shaurya Rohatgi, Yanxia Qin, Benjamin Aw, Niranjana Unnithan, Min-Yen Kan


Abstract: We present a scholarly corpus from the ACL Anthology to assist Open scientific research in the Computational Linguistics domain, named as ACL OCL. Compared with previous ARC and AAN versions, ACL OCL includes structured full-texts with logical sections, references to figures, and links to a large knowledge resource (semantic scholar). ACL OCL contains 74k scientific papers, together with 210k figures extracted up to September 2022. To observe the development in the computational linguistics domain, we detect the topics of all OCL papers with a supervised neural model. We observe ‘‘Syntax: Tagging, Chunking and Parsing’’ topic is significantly shrinking and ‘‘Natural Language Generation’’ is resurging. Our dataset is open and available to download from HuggingFace in this https URL.

NLP-88-标题: Ref GPT : Reference -> Truthful & Customized Dialogues Generation by GPT GPT s and for GPTs

链接: https://arxiv.org/abs/2305.14994
作者: Dongjie Yang, Ruifeng Yuan, YuanTao Fan, YiFei Yang, Zili Wang, Shushen Wang, Hai Zhao


Abstract: General chat models, like ChatGPT, have attained impressive capability to resolve a wide range of NLP tasks by tuning Large Language Models (LLMs) with high-quality instruction data. However, collecting human-written high-quality data, especially multi-turn dialogues, is expensive and unattainable for most people. Though previous studies have used powerful LLMs to generate the dialogues automatically, but they all suffer from generating untruthful dialogues because of the LLMs hallucination. Therefore, we propose a method called RefGPT to generate enormous truthful and customized dialogues without worrying about factual errors caused by the model hallucination. RefGPT solves the model hallucination in dialogue generation by restricting the LLMs to leverage the given reference instead of reciting their own knowledge to generate dialogues. Additionally, RefGPT adds detailed controls on every utterances to enable highly customization capability, which previous studies have ignored. On the basis of RefGPT, we also propose two high-quality dialogue datasets generated by GPT-4, namely RefGPT-Fact and RefGPT-Code. RefGPT-Fact is 100k multi-turn dialogue datasets based on factual knowledge and RefGPT-Code is 76k multi-turn dialogue dataset covering a wide range of coding scenarios. Our code and datasets are released in this https URL

NLP-89-标题: How To Control Text Simplification? An Empirical Study of Control Tokens for Meaning Preserving Controlled Simplification

链接: https://arxiv.org/abs/2305.14993
作者: Sweta Agrawal, Marine Carpuat
备注: work in progress


Abstract: Text simplification rewrites text to be more readable for a specific audience, while preserving its meaning. However, determining what makes a text easy to read depends on who are the intended readers. Recent work has introduced a wealth of techniques to control output simplicity, ranging from specifying the desired reading grade level to providing control tokens that directly encode low-level simplification edit operations. However, it remains unclear how to set the input parameters that control simplification in practice. Existing approaches set them at the corpus level, disregarding the complexity of individual source text, and do not directly evaluate them at the instance level. In this work, we conduct an empirical study to understand how different control mechanisms impact the adequacy and simplicity of model outputs. Based on these insights, we introduce a simple method for predicting control tokens at the sentence level to enhance the quality of the simplified text. Predicting control token values using features extracted from the original complex text and a user-specified degree of complexity improves the quality of the simplified outputs over corpus-level search-based heuristics.

NLP-90-标题: Reasoning with Language Model is Planning with World Model

链接: https://arxiv.org/abs/2305.14992
作者: Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, Zhiting Hu


Abstract: Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. The deficiency stems from the key fact that LLMs lack an internal \textitworld model to predict the world \textitstate (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, \underlineR\textiteasoning vi\underlinea \underlineP\textitlanning \textbf(RAP) . RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monto Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and task-specific rewards, and obtains a high-reward reasoning path efficiently with a proper balance between exploration \textitvs. exploitation. We apply RAP to a variety of challenging reasoning problems including plan generation, math reasoning, and logical inference. Empirical results on these tasks demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.

NLP-91-标题: MuLER: Detailed and Scalable Reference-based Evaluation

链接: https://arxiv.org/abs/2305.14991
作者: Taelin Karidi, Leshem Choshen, Gal Patel, Omri Abend


Abstract: We propose a novel methodology (namely, MuLER) that transforms any reference-based evaluation metric for text generation, such as machine translation (MT) into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types (e.g., errors in translating names of locations). MuLER thus enables a detailed error analysis which can lead to targeted improvement efforts for specific phenomena. We perform experiments in both synthetic and naturalistic settings to support MuLER’s validity and showcase its usability in MT evaluation, and other tasks, such as summarization. Analyzing all submissions to WMT in 2014-2020, we find consistent trends. For example, nouns and verbs are among the most frequent POS tags. However, they are among the hardest to translate. Performance on most POS tags improves with overall system performance, but a few are not thus correlated (their identity changes from language to language). Preliminary experiments with summarization reveal similar trends.

NLP-92-标题: Dolphin: A Challenging and Diverse Benchmark for Arabic NLG

链接: https://arxiv.org/abs/2305.14989
作者: El Moatez Billah Nagoudi, Ahmed El-Shangiti, AbdelRahim Elmadany, Muhammad Abdul-Mageed


Abstract: We present Dolphin, a novel benchmark that addresses the need for an evaluation framework for the wide collection of Arabic languages and varieties. The proposed benchmark encompasses a broad range of 13 different NLG tasks, including text summarization, machine translation, question answering, and dialogue generation, among others. Dolphin comprises a substantial corpus of 40 diverse and representative public datasets across 50 test splits, carefully curated to reflect real-world scenarios and the linguistic richness of Arabic. It sets a new standard for evaluating the performance and generalization capabilities of Arabic and multilingual models, promising to enable researchers to push the boundaries of current methodologies. We provide an extensive analysis of Dolphin, highlighting its diversity and identifying gaps in current Arabic NLG research. We also evaluate several Arabic and multilingual models on our benchmark, allowing us to set strong baselines against which researchers can compare.

NLP-93-标题: Large Language Models are Effective Table-to-Text Generators Evaluators and Feedback Providers

链接: https://arxiv.org/abs/2305.14987
作者: Yilun Zhao, Haowei Zhang, Shengyun Si, Linyong Nan, Xiangru Tang, Arman Cohan
备注: work in progress


Abstract: Large language models (LLMs) have shown remarkable ability on controllable text generation. However, the potential of LLMs in generating text from structured tables remains largely under-explored. In this paper, we study the capabilities of LLMs for table-to-text generation tasks, particularly aiming to investigate their performance in generating natural language statements that can be logically entailed by a provided table. First, we investigate how LLMs compare to state-of-the-art table-to-text fine-tuned models, and demonstrate that LLMs can generate statements with higher faithfulness compared with previous state-of-the-art fine-tuned models. Given this finding, we next explore whether LLMs can serve as faithfulness-level automated evaluation metrics. Through human evaluation, we show that evaluation metrics adopted from LLMs correlates better with human judgments compared with existing faithfulness-level metrics. Finally, we demonstrate that LLMs using chain-of-thought prompting can generate high-fidelity natural language feedback for other table-to-text models’ generations, provide insights for future work regarding the distillation of text generation capabilities from LLMs to smaller models.

NLP-94-标题: Ideal GPT : Iteratively Decomposing Vision and Language Reasoning via Large Language Models

链接: https://arxiv.org/abs/2305.14985
作者: Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang
备注: 13 pages, 5 figures


Abstract: The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at this https URL

NLP-95-标题: Benchmarking Arabic AI with Large Language Models

链接: https://arxiv.org/abs/2305.14982
作者: Ahmed Abdelali, Hamdy Mubarak, Shammur Absar Chowdhury, Maram Hasanain, Basel Mousi, Sabri Boughorbel, Yassine El Kheir, Daniel Izham, Fahim Dalvi, Majd Hawasly, Nizi Nazar, Yousseif Elshahawy, Ahmed Ali, Nadir Durrani, Natasa Milic-Frayling, Firoj Alam
备注: Foundation Models, Large Language Models, Arabic NLP, Arabic Speech, Arabic AI, , CHatGPT Evaluation, USM Evaluation, Whisper Evaluation


Abstract: With large Foundation Models (FMs), language technologies (AI in general) are entering a new paradigm: eliminating the need for developing large-scale task-specific datasets and supporting a variety of tasks through set-ups ranging from zero-shot to few-shot learning. However, understanding FMs capabilities requires a systematic benchmarking effort by comparing FMs performance with the state-of-the-art (SOTA) task-specific models. With that goal, past work focused on the English language and included a few efforts with multiple languages. Our study contributes to ongoing research by evaluating FMs performance for standard Arabic NLP and Speech processing, including a range of tasks from sequence tagging to content classification across diverse domains. We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM, addressing 33 unique tasks using 59 publicly available datasets resulting in 96 test setups. For a few tasks, FMs performs on par or exceeds the performance of the SOTA models but for the majority it under-performs. Given the importance of prompt for the FMs performance, we discuss our prompt strategies in detail and elaborate on our findings. Our future work on Arabic AI will explore few-shot prompting, expand the range of tasks, and investigate additional open-source models.

NLP-96-标题: Improving Factuality of Abstractive Summarization without Sacrificing Summary Quality ACL2023

链接: https://arxiv.org/abs/2305.14981
作者: Tanay Dixit, Fei Wang, Muhao Chen
备注: ACL 2023


Abstract: Improving factual consistency of abstractive summarization has been a widely studied topic. However, most of the prior works on training factuality-aware models have ignored the negative effect it has on summary quality. We propose EFACTSUM (i.e., Effective Factual Summarization), a candidate summary generation and ranking technique to improve summary factuality without sacrificing summary quality. We show that using a contrastive learning framework with our refined candidate summaries leads to significant gains on both factuality and similarity-based metrics. Specifically, we propose a ranking strategy in which we effectively combine two metrics, thereby preventing any conflict during training. Models trained using our approach show up to 6 points of absolute improvement over the base model with respect to FactCC on XSUM and 11 points on CNN/DM, without negatively affecting either similarity-based metrics or absractiveness.

NLP-97-标题: GPT AraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP

链接: https://arxiv.org/abs/2305.14976
作者: Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed
备注: Work in progress


Abstract: The recent emergence of ChatGPT has brought a revolutionary change in the landscape of NLP. Although ChatGPT has consistently shown impressive performance on English benchmarks, its exact capabilities on most other languages remain largely unknown. To better understand ChatGPT’s capabilities on Arabic, we present a large-scale evaluation of the model on a broad range of Arabic NLP tasks. Namely, we evaluate ChatGPT on 32 diverse natural language understanding and generation tasks on over 60 different datasets. To the best of our knowledge, our work offers the first performance analysis of ChatGPT on Arabic NLP at such a massive scale. Our results show that, despite its success on English benchmarks, ChatGPT trained in-context (few-shot) is consistently outperformed by much smaller dedicated models finetuned on Arabic. These results suggest that there is significant place for improvement for instruction-tuned LLMs such as ChatGPT.

NLP-98-标题: Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

链接: https://arxiv.org/abs/2305.14975
作者: Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D. Manning


Abstract: A trustworthy real-world prediction system should be well-calibrated; that is, its confidence in an answer is indicative of the likelihood that the answer is correct, enabling deferral to a more expensive expert in cases of low-confidence predictions. While recent studies have shown that unsupervised pre-training produces large language models (LMs) that are remarkably well-calibrated, the most widely-used LMs in practice are fine-tuned with reinforcement learning with human feedback (RLHF-LMs) after the initial unsupervised pre-training stage, and results are mixed as to whether these models preserve the well-calibratedness of their ancestors. In this paper, we conduct a broad evaluation of computationally feasible methods for extracting confidence scores from LLMs fine-tuned with RLHF. We find that with the right prompting strategy, RLHF-LMs verbalize probabilities that are much better calibrated than the model’s conditional probabilities, enabling fairly well-calibrated predictions. Through a combination of prompting strategy and temperature scaling, we find that we can reduce the expected calibration error of RLHF-LMs by over 50%.

NLP-99-标题: Over Prompt : Enhancing ChatGPT Capabilities through an Efficient In-Context Learning Approach

链接: https://arxiv.org/abs/2305.14973
作者: Jiazheng Li, Runcong Zhao, Yulan He, Lin Gui


Abstract: The exceptional performance of pre-trained large language models has revolutionised various applications, but their adoption in production environments is hindered by prohibitive costs and inefficiencies, particularly when utilising long prompts. This paper proposes OverPrompt, an in-context learning method aimed at improving LLM efficiency and performance by processing multiple inputs in parallel. Evaluated across diverse datasets, OverPrompt enhances task efficiency and integrates a diverse range of examples for improved performance. Particularly, it amplifies fact-checking and sentiment analysis tasks when supplemented with contextual information. Synthetic data grouping further enhances performance, suggesting a viable approach for data augmentation.

NLP-100-标题: Getting Sick After Seeing a Doctor? Diagnosing and Mitigating Knowledge Conflicts in Event Temporal Reasoning

链接: https://arxiv.org/abs/2305.14970
作者: Tianqing Fang, Zhaowei Wang, Wenxuan Zhou, Hongming Zhang, Yangqiu Song, Muhao Chen
备注: 13 pages, 1 figure


Abstract: Event temporal reasoning aims at identifying the temporal relations between two or more events. However, knowledge conflicts arise when there is a mismatch between the actual temporal relations of events in the context and the prior knowledge or biases learned by the model. We first systematically define distinct kinds of bias in event temporal reasoning, which include event relation prior bias, tense bias, narrative bias, and dependency bias, as indicators to study knowledge conflicts. To mitigate such event-related knowledge conflict, we introduce a Counterfactual Data Augmentation based method that can be applied to both Pre-trained Language Models (PLMs) and Large Language Models (LLMs) either as additional training data or demonstrations for In-Context Learning. Experiments suggest the importance of mitigating knowledge conflicts in event temporal reasoning tasks for reducing hallucination and highlight the potential of counterfactual data augmentation for improving model performance.

NLP-101-标题: Tricking LLMs into Disobedience: Understanding Analyzing and Preventing Jailbreaks

链接: https://arxiv.org/abs/2305.14965
作者: Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, Monojit Choudhury


Abstract: Recent explorations with commercial Large Language Models (LLMs) have shown that non-expert users can jailbreak LLMs by simply manipulating the prompts; resulting in degenerate output behavior, privacy and security breaches, offensive outputs, and violations of content regulator policies. Limited formal studies have been carried out to formalize and analyze these attacks and their mitigations. We bridge this gap by proposing a formalism and a taxonomy of known (and possible) jailbreaks. We perform a survey of existing jailbreak methods and their effectiveness on open-source and commercial LLMs (such as GPT 3.5, OPT, BLOOM, and FLAN-T5-xxl). We further propose a limited set of prompt guards and discuss their effectiveness against known attack types.

NLP-102-标题: Detecting and Characterizing Political Incivility on Social Media

链接: https://arxiv.org/abs/2305.14964
作者: Sagi Penzel, Nir Lotan, Alon Zoizner, Einat Minkov


Abstract: Researchers of political communication study the impact and perceptions of political incivility on social media. Yet, so far, relatively few works attempted to automatically detect and characterize political incivility. In our work, we study political incivility in Twitter, presenting several research contributions. First, we present state-of-the-art incivility detection results using a large dataset, which we collected and labeled via crowd sourcing. Importantly, we distinguish between uncivil political speech that is impolite and intolerant anti-democratic discourse. Applying political incivility detection at large-scale, we derive insights regarding the prevalence of this phenomenon across users, and explore the network characteristics of users who are susceptible to disseminating uncivil political content online. Finally, we propose an approach for modeling social context information about the tweet author alongside the tweet content, showing that this leads to significantly improved performance on the task of political incivility detection. This result holds promise for related tasks, such as hate speech and stance detection.

NLP-103-标题: PESCO: Prompt -enhanced Self Contrastive Learning for Zero-shot Text Classification ACL2023

链接: https://arxiv.org/abs/2305.14963
作者: Yau-Shian Wang, Ta-Chung Chi, Ruohong Zhang, Yiming Yang
备注: accepted by ACL 2023


Abstract: We present PESCO, a novel contrastive learning framework that substantially improves the performance of zero-shot text classification. We formulate text classification as a neural text matching problem where each document is treated as a query, and the system learns the mapping from each query to the relevant class labels by (1) adding prompts to enhance label matching, and (2) using retrieved labels to enrich the training set in a self-training loop of contrastive learning. PESCO achieves state-of-the-art performance on four benchmark text classification datasets. On DBpedia, we achieve 98.5% accuracy without any labeled data, which is close to the fully-supervised result. Extensive experiments and analyses show all the components of PESCO are necessary for improving the performance of zero-shot text classification.

NLP-104-标题: Editing Commonsense Knowledge in GPT

链接: https://arxiv.org/abs/2305.14956
作者: Anshita Gupta, Debanjan Mondal, Akshay Krishna Sheshadri, Wenlong Zhao, Xiang Lorraine Li, Sarah Wiegreffe, Niket Tandon
备注: Code and data is available at this https URL


Abstract: Memory editing methods for updating encyclopedic knowledge in transformers have received increasing attention for their efficacy, specificity, and generalization advantages. However, it remains unclear if such methods can be adapted for the more nuanced domain of commonsense knowledge. We propose MEMIT_CSK , an adaptation of MEMIT to edit commonsense mistakes in GPT-2 Large and XL. We extend editing to various token locations and employ a robust layer selection strategy. Models edited by MEMIT_CSK outperforms the fine-tuning baselines by 10.97% and 10.73% F1 scores on subsets of PEP3k and 20Q. We further propose a novel evaluation dataset, MEMIT-CSK-PROBE, that contains unaffected neighborhood, affected neighborhood, affected paraphrase, and affected reasoning challenges. MEMIT_CSK demonstrates favorable semantic generalization, outperforming fine-tuning baselines by 13.72% and 5.57% overall scores on MEMIT-CSK-PROBE. These results suggest a compelling future direction of incorporating context-specific user feedback concerning commonsense in GPT by direct model editing, rectifying and customizing model behaviors via human-in-the-loop systems.

NLP-105-标题: Adversarial Demonstration Attacks on Large Language Models

链接: https://arxiv.org/abs/2305.14950
作者: Jiongxiao Wang, Zichen Liu, Keun Hee Park, Muhao Chen, Chaowei Xiao
备注: Work in Progress


Abstract: With the emergence of more powerful large language models (LLMs), such as ChatGPT and GPT-4, in-context learning (ICL) has gained significant prominence in leveraging these models for specific tasks by utilizing data-label pairs as precondition prompts. While incorporating demonstrations can greatly enhance the performance of LLMs across various tasks, it may introduce a new security concern: attackers can manipulate only the demonstrations without changing the input to perform an attack. In this paper, we investigate the security concern of ICL from an adversarial perspective, focusing on the impact of demonstrations. We propose an ICL attack based on TextAttack, which aims to only manipulate the demonstration without changing the input to mislead the models. Our results demonstrate that as the number of demonstrations increases, the robustness of in-context learning would decreases. Furthermore, we also observe that adversarially attacked demonstrations exhibit transferability to diverse input examples. These findings emphasize the critical security risks associated with ICL and underscore the necessity for extensive research on the robustness of ICL, particularly given its increasing significance in the advancement of LLMs.

NLP-106-标题: Cross-lingual Data Augmentation for Document-grounded Dialog Systems in Low Resource Languages

链接: https://arxiv.org/abs/2305.14949
作者: Qi Gou, Zehua Xia, Wenzhe Du


Abstract: This paper proposes a framework to address the issue of data scarcity in Document-Grounded Dialogue Systems(DGDS). Our model leverages high-resource languages to enhance the capability of dialogue generation in low-resource languages. Specifically, We present a novel pipeline CLEM (Cross-Lingual Enhanced Model) including adversarial training retrieval (Retriever and Re-ranker), and Fid (fusion-in-decoder) generator. To further leverage high-resource language, we also propose an innovative architecture to conduct alignment across different languages with translated training. Extensive experiment results demonstrate the effectiveness of our model and we achieved 4th place in the DialDoc 2023 Competition. Therefore, CLEM can serve as a solution to resource scarcity in DGDS and provide useful guidance for multi-lingual alignment tasks.

NLP-107-标题: How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench

链接: https://arxiv.org/abs/2305.14947
作者: Qinyuan Ye, Harvey Yiyun Fu, Xiang Ren, Robin Jia


Abstract: We investigate the predictability of large language model (LLM) capabilities: given records of past experiments using different model families, numbers of parameters, tasks, and numbers of in-context examples, can we accurately predict LLM performance on new experiment configurations? Answering this question has practical implications for LLM users (e.g., deciding which models to try), developers (e.g., prioritizing evaluation on representative tasks), and the research community (e.g., identifying hard-to-predict capabilities that warrant further investigation). We study the performance prediction problem on experiment records from BIG-bench. On a random train-test split, an MLP-based predictor achieves RMSE below 5%, demonstrating the presence of learnable patterns within the experiment records. Further, we formulate the problem of searching for “small-bench,” an informative subset of BIG-bench tasks from which the performance of the full set can be maximally recovered, and find a subset as informative for evaluating new model families as BIG-bench Hard, while being 3x smaller.

NLP-108-标题: Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with SocKET Benchmark

链接: https://arxiv.org/abs/2305.14938
作者: Minje Choi, Jiaxin Pei, Sagar Kumar, Chang Shu, David Jurgens
备注: 24 pages, 7 tables, 5 figures


Abstract: Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks. While LLMs are increasingly deployed in many forms including conversational agents that interact with humans, we lack a grounded benchmark to measure how well LLMs understand \textitsocial language. Here, we introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge which we group into five categories: humor & sarcasm, offensiveness, sentiment & emotion, and trustworthiness. In tests on the benchmark, we demonstrate that current models attain only moderate performance but reveal significant potential for task transfer among different types and categories of tasks, which were predicted from theory. Through zero-shot evaluations, we show that pretrained models already possess some innate but limited capabilities of social language understanding and training on one category of tasks can improve zero-shot testing on others. Our benchmark provides a systematic way to analyze model performance on an important dimension of language and points to clear room for improvement to build more socially-aware LLMs. The associated resources are released at this https URL.

NLP-109-标题: A Fair and In-Depth Evaluation of Existing End-to-End Entity Linking Systems

链接: https://arxiv.org/abs/2305.14937
作者: Hannah Bast, Matthias Hertel, Natalie Prange


Abstract: Existing evaluations of entity linking systems often say little about how the system is going to perform for a particular application. There are four fundamental reasons for this: many benchmarks focus on named entities; it is hard to define which other entities to include; there are ambiguities in entity recognition and entity linking; many benchmarks have errors or artifacts that invite overfitting or lead to evaluation results of limited meaningfulness. We provide a more meaningful and fair in-depth evaluation of a variety of existing end-to-end entity linkers. We characterize the strengths and weaknesses of these linkers and how well the results from the respective publications can be reproduced. Our evaluation is based on several widely used benchmarks, which exhibit the problems mentioned above to various degrees, as well as on two new benchmarks, which address these problems.

NLP-110-标题: Trade-Offs Between Fairness and Privacy in Language Modeling ACL2023

链接: https://arxiv.org/abs/2305.14936
作者: Cleo Matzken, Steffen Eger, Ivan Habernal
备注: Findings of ACL 2023


Abstract: Protecting privacy in contemporary NLP models is gaining in importance. So does the need to mitigate social biases of such models. But can we have both at the same time? Existing research suggests that privacy preservation comes at the price of worsening biases in classification tasks. In this paper, we explore the extent to which this tradeoff really holds when we incorporate both privacy preservation and de-biasing techniques into training text generation models. How does improving the model along one dimension affect the other dimension as well as the utility of the model? We conduct an extensive set of experiments that include bias detection, privacy attacks, language modeling, and performance on downstream tasks.

NLP-111-标题: Modeling Appropriate Language in Argumentation

链接: https://arxiv.org/abs/2305.14935
作者: Timon Ziegenbein, Shahbaz Syed, Felix Lange, Martin Potthast, Henning Wachsmuth


Abstract: Online discussion moderators must make ad-hoc decisions about whether the contributions of discussion participants are appropriate or should be removed to maintain civility. Existing research on offensive language and the resulting tools cover only one aspect among many involved in such decisions. The question of what is considered appropriate in a controversial discussion has not yet been systematically addressed. In this paper, we operationalize appropriate language in argumentation for the first time. In particular, we model appropriateness through the absence of flaws, grounded in research on argument quality assessment, especially in aspects from rhetoric. From these, we derive a new taxonomy of 14 dimensions that determine inappropriate language in online discussions. Building on three argument quality corpora, we then create a corpus of 2191 arguments annotated for the 14 dimensions. Empirical analyses support that the taxonomy covers the concept of appropriateness comprehensively, showing several plausible correlations with argument quality dimensions. Moreover, results of baseline approaches to assessing appropriateness suggest that all dimensions can be modeled computationally on the corpus.

NLP-112-标题: Discriminator-Guided Multi-step Reasoning with Language Models

链接: https://arxiv.org/abs/2305.14934
作者: Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Lu Wang
备注: 19 pages, 7 figures, and 8 tables


Abstract: In the context of multi-step reasoning, language models (LMs) probabilities are often miscalibrated – solutions with high probabilities are not always correct. Therefore, greedy decoding, which is the standard decoding method for reasoning tasks, often yields incorrect solutions. In addition, methods such as self-consistency and verifiers rely on sampling from the LM distribution and do not tackle the underlying issue. To address this, we introduce Guiding Multi-step ReAsoning with a CorrectnEss Discriminator (GRACE), a stepwise decoding approach that nudges the model towards producing correct reasoning steps. GRACE employs a discriminator model, which is trained to differentiate correct steps from invalid ones, to adjust decoding preferences based on the correctness of each reasoning step. Importantly, GRACE does not require fine-tuning or re-training the LMs. When compared with conventional decoding strategies over four popular math reasoning benchmarks, GRACE exhibits significant improvements in both final answer accuracy and step correctness, outperforming both greedy decoding and self-consistency.\footnoteOur code can be found at \urlthis https URL.

NLP-113-标题: In-Context Impersonation Reveals Large Language Models Strengths and Biases

链接: https://arxiv.org/abs/2305.14930
作者: Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, Zeynep Akata


Abstract: In everyday conversations, humans can take on different roles and adapt their vocabulary to their chosen roles. We explore whether LLMs can take on, that is impersonate, different roles when they generate text in-context. We ask LLMs to assume different personas before solving vision and language tasks. We do this by prefixing the prompt with a persona that is associated either with a social identity or domain expertise. In a multi-armed bandit task, we find that LLMs pretending to be children of different ages recover human-like developmental stages of exploration. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts. Finally, we test whether LLMs’ impersonations are complementary to visual information when describing different categories. We find that impersonation can improve performance: an LLM prompted to be a bird expert describes birds better than one prompted to be a car expert. However, impersonation can also uncover LLMs’ biases: an LLM prompted to be a man describes cars better than one prompted to be a woman. These findings demonstrate that LLMs are capable of taking on diverse roles and that this in-context impersonation can be used to uncover their hidden strengths and biases.

NLP-114-标题: Aligning Language Models to User Opinions

链接: https://arxiv.org/abs/2305.14929
作者: EunJeong Hwang, Bodhisattwa Prasad Majumder, Niket Tandon


Abstract: An important aspect of developing LLMs that interact with humans is to align models’ behavior to their users. It is possible to prompt an LLM into behaving as a certain persona, especially a user group or ideological persona the model captured during its pertaining stage. But, how to best align an LLM with a specific user and not a demographic or ideological group remains an open question. Mining public opinion surveys (by Pew Research), we find that the opinions of a user and their demographics and ideologies are not mutual predictors. We use this insight to align LLMs by modeling both user opinions as well as user demographics and ideology, achieving up to 7 points accuracy gains in predicting public opinions from survey questions across a broad set of topics. In addition to the typical approach of prompting LLMs with demographics and ideology, we discover that utilizing the most relevant past opinions from individual users enables the model to predict user opinions more accurately.

NLP-115-标题: Towards Reliable Misinformation Mitigation: Generalization Uncertainty and GPT -4

链接: https://arxiv.org/abs/2305.14928
作者: Kellin Pelrine, Meilina Reksoprodjo, Caleb Gupta, Joel Christoph, Reihaneh Rabbany


Abstract: Misinformation poses a critical societal challenge, and current approaches have yet to produce an effective solution. We propose focusing on generalization, soft classification, and leveraging recent large language models to create more practical tools in contexts where perfect predictions remain unattainable. We begin by demonstrating that GPT-4 and other language models can outperform existing methods in the literature. Next, we explore their generalization, revealing that GPT-4 and RoBERTa-large exhibit critical differences in failure modes, which offer potential for significant performance improvements. Finally, we show that these models can be employed in soft classification frameworks to better quantify uncertainty. We find that models with inferior hard classification results can achieve superior soft classification performance. Overall, this research lays groundwork for future tools that can drive real-world progress on misinformation.

NLP-116-标题: Universal Self-adaptive Prompt ing

链接: https://arxiv.org/abs/2305.14926
作者: Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Hanjun Dai, Julian Martin Eisenschlos, Sercan O. Arik, Tomas Pfister
备注: 10 pages, 3 figures, 4 tables (19 pages, 5 figures and 9 tables including references and appendices)


Abstract: A hallmark of modern large language models (LLMs) is their impressive general zero-shot and few-shot abilities, often elicited through prompt-based and/or in-context learning. However, while highly coveted and being the most general, zero-shot performances in LLMs are still typically weaker due to the lack of guidance and the difficulty of applying existing automatic prompt design methods in general tasks when ground-truth labels are unavailable. In this study, we address this by presenting Universal Self-adaptive Prompting (USP), an automatic prompt design approach specifically tailored for zero-shot learning (while compatible with few-shot). Requiring only a small amount of unlabeled data & an inference-only LLM, USP is highly versatile: to achieve universal prompting, USP categorizes a possible NLP task into one of the three possible task types, and then uses a corresponding selector to select the most suitable queries & zero-shot model-generated responses as pseudo-demonstrations, thereby generalizing ICL to the zero-shot setup in a fully automated way. We evaluate zero-shot USP with two PaLM models, and demonstrate performances that are considerably stronger than standard zero-shot baselines and are comparable to or even superior than few-shot baselines across more than 20 natural language understanding (NLU) and natural language generation (NLG) tasks.

NLP-117-标题: Frugal Prompt ing for Dialog Models

链接: https://arxiv.org/abs/2305.14919
作者: Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan Goyal
备注: First two authors have equal contribution


Abstract: The use of large language models (LLMs) in natural language processing (NLP) tasks is rapidly increasing, leading to changes in how researchers approach problems in the field. To fully utilize these models’ abilities, a better understanding of their behavior for different input protocols is required. With LLMs, users can directly interact with the models through a text-based interface to define and solve various tasks. Hence, understanding the conversational abilities of these LLMs, which may not have been specifically trained for dialog modeling, is also important. This study examines different approaches for building dialog systems using LLMs by considering various aspects of the prompt. As part of prompt tuning, we experiment with various ways of providing instructions, exemplars, current query and additional context. The research also analyzes the representations of dialog history that have the optimal usable-information density. Based on the findings, the paper suggests more compact ways of providing dialog history information while ensuring good performance and reducing model’s inference-API costs. The research contributes to a better understanding of how LLMs can be effectively used for building interactive systems.

NLP-118-标题: Structural Ambiguity and its Disambiguation in Language Model Based Parsers: the Case of Dutch Clause Relativization

链接: https://arxiv.org/abs/2305.14917
作者: Gijs Wijnholds, Michael Moortgat


Abstract: This paper addresses structural ambiguity in Dutch relative clauses. By investigating the task of disambiguation by grounding, we study how the presence of a prior sentence can resolve relative clause ambiguities. We apply this method to two parsing architectures in an attempt to demystify the parsing and language model components of two present-day neural parsers. Results show that a neurosymbolic parser, based on proof nets, is more open to data bias correction than an approach based on universal dependencies, although both setups suffer from a comparable initial data bias.

NLP-119-标题: CoLaDa: A Collaborative Label Denoising Framework for Cross-lingual Named Entity Recognition ACL2023

链接: https://arxiv.org/abs/2305.14913
作者: Tingting Ma, Qianhui Wu, Huiqiang Jiang, Börje F. Karlsson, Tiejun Zhao, Chin-Yew Lin
备注: ACL 2023. Our code is available at this https URL


Abstract: Cross-lingual named entity recognition (NER) aims to train an NER system that generalizes well to a target language by leveraging labeled data in a given source language. Previous work alleviates the data scarcity problem by translating source-language labeled data or performing knowledge distillation on target-language unlabeled data. However, these methods may suffer from label noise due to the automatic labeling process. In this paper, we propose CoLaDa, a Collaborative Label Denoising Framework, to address this problem. Specifically, we first explore a model-collaboration-based denoising scheme that enables models trained on different data sources to collaboratively denoise pseudo labels used by each other. We then present an instance-collaboration-based strategy that considers the label consistency of each token’s neighborhood in the representation space for denoising. Experiments on different benchmark datasets show that the proposed CoLaDa achieves superior results compared to previous methods, especially when generalizing to distant languages.

NLP-120-标题: From Shortcuts to Triggers: Backdoor Defense with Denoised PoE

链接: https://arxiv.org/abs/2305.14910
作者: Qin Liu, Fei Wang, Chaowei Xiao, Muhao Chen
备注: Work in Progress


Abstract: Language models are often at risk of diverse backdoor attacks, especially data poisoning. Thus, it is important to investigate defense solutions for addressing them. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers, leaving a universal defense against various backdoor attacks with diverse triggers largely unexplored. In this paper, we propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised Product-of-Experts), which is inspired by the shortcut nature of backdoor attacks, to defend various backdoor attacks. DPoE consists of two models: a shallow model that captures the backdoor shortcuts and a main model that is prevented from learning the backdoor shortcuts. To address the label flip caused by backdoor attackers, DPoE incorporates a denoising design. Experiments on SST-2 dataset show that DPoE significantly improves the defense performance against various types of backdoor triggers including word-level, sentence-level, and syntactic triggers. Furthermore, DPoE is also effective under a more challenging but practical setting that mixes multiple types of trigger.

NLP-121-标题: PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions

链接: https://arxiv.org/abs/2305.14908
作者: Anthony Chen, Panupong Pasupat, Sameer Singh, Hongrae Lee, Kelvin Guu


Abstract: The remarkable capabilities of large language models have been accompanied by a persistent drawback: the generation of false and unsubstantiated claims commonly known as “hallucinations”. To combat this issue, recent research has introduced approaches that involve editing and attributing the outputs of language models, particularly through prompt-based editing. However, the inference cost and speed of using large language models for editing currently bottleneck prompt-based methods. These bottlenecks motivate the training of compact editors, which is challenging due to the scarcity of training data for this purpose. To overcome these challenges, we exploit the power of large language models to introduce corruptions (i.e., noise) into text and subsequently fine-tune compact editors to denoise the corruptions by incorporating relevant evidence. Our methodology is entirely unsupervised and provides us with faux hallucinations for training in any domain. Our Petite Unsupervised Research and Revision model, PURR, not only improves attribution over existing editing methods based on fine-tuning and prompting, but also achieves faster execution times by orders of magnitude.

NLP-122-标题: Coverage-based Example Selection for In-Context Learning

链接: https://arxiv.org/abs/2305.14907
作者: Shivanshu Gupta, Sameer Singh, Matt Gardner


Abstract: In-context learning (ICL), the ability of large language models to perform novel tasks by conditioning on a prompt with a few task examples, requires demonstrations that are informative about the test instance. The standard approach of independently selecting the most similar examples selects redundant demonstrations while overlooking important information. This work proposes a framework for assessing the informativeness of demonstrations based on their coverage of salient aspects (e.g., reasoning patterns) of the test input. Using this framework, we show that contextual token embeddings effectively capture these salient aspects, and their recall measured using BERTScore-Recall (BSR) yields a reliable measure of informativeness. Further, we extend recall metrics like BSR to propose their set versions to find maximally informative sets of demonstrations. On 6 complex compositional generation tasks and 7 diverse LLMs, we show that Set-BSR outperforms the standard similarity-based approach by up to 16% on average and, despite being learning-free, often surpasses methods that leverage task or LLM-specific training.

NLP-123-标题: Identifying Informational Sources in News Articles

链接: https://arxiv.org/abs/2305.14904
作者: Alexander Spangher, Nanyun Peng, Jonathan May, Emilio Ferrara
备注: 13 pages


Abstract: News articles are driven by the informational sources journalists use in reporting. Modeling when, how and why sources get used together in stories can help us better understand the information we consume and even help journalists with the task of producing it. In this work, we take steps toward this goal by constructing the largest and widest-ranging annotated dataset, to date, of informational sources used in news writing. We show that our dataset can be used to train high-performing models for information detection and source attribution. We further introduce a novel task, source prediction, to study the compositionality of sources in news articles. We show good performance on this task, which we argue is an important proof for narrative science exploring the internal structure of news articles and aiding in planning-based language generation, and an important step towards a source-recommendation system to aid journalists.

NLP-124-标题: M4: Multi-generator Multi-domain and Multi-lingual Black-Box Machine-Generated Text Detection

链接: https://arxiv.org/abs/2305.14902
作者: Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse, Osama Mohammed Afzal, Tarek Mahmoud, Alham Fikri Aji, Preslav Nakov
备注: 11 pages


Abstract: Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries, but this has also resulted in concerns regarding the potential misuse of such texts in journalism, educational, and academic context. In this work, we aim to develop automatic systems to identify machine-generated text and to detect potential misuse. We first introduce a large-scale benchmark M4, which is multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Using the dataset, we experiment with a number of methods and we show that it is challenging for detectors to generalize well on unseen examples if they are either from different domains or are generated by different large language models. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and there is a lot of room for improvement. We believe that our dataset M4, which covers different generators, domains and languages, will enable future research towards more robust approaches for this pressing societal problem. The M4 dataset is available at this https URL.

NLP-125-标题: Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering

链接: https://arxiv.org/abs/2305.14901
作者: Wang Zhu, Jesse Thomason, Robin Jia
备注: 12 pages, 2 figures


Abstract: We train a language model (LM) to robustly answer multistep questions by generating and answering sub-questions. We propose Chain-of-Questions, a framework that trains a model to generate sub-questions and sub-answers one at a time by leveraging human annotated question decomposition meaning representation (QDMR). The key technical challenge is that QDMR only contains sub-questions but not answers to those sub-questions, so we treat sub-answers as latent variables and optimize them using a novel dynamic mixture of Hard-EM and MAPO. Chain-of-Questions greatly outperforms strong neuro-symbolic methods by 9.0 F1 on DROP contrast set, and outperforms GPT-3.5 by 24.3 F1 on HOTPOTQA adversarial set, thus demonstrating the effectiveness and robustness of our framework.

NLP-126-标题: PIVOINE: Instruction Tuning for Open-world Information Extraction

链接: https://arxiv.org/abs/2305.14898
作者: Keming Lu, Xiaoman Pan, Kaiqiang Song, Hongming Zhang, Dong Yu, Jianshu Chen


Abstract: We consider the problem of Open-world Information Extraction (Open-world IE), which extracts comprehensive entity profiles from unstructured texts. Different from the conventional closed-world setting of Information Extraction (IE), Open-world IE considers a more general situation where entities and relations could be beyond a predefined ontology. More importantly, we seek to develop a large language model (LLM) that is able to perform Open-world IE to extract desirable entity profiles characterized by (possibly fine-grained) natural language instructions. We achieve this by finetuning LLMs using instruction tuning. In particular, we construct INSTRUCTOPENWIKI, a substantial instruction tuning dataset for Open-world IE enriched with a comprehensive corpus, extensive annotations, and diverse instructions. We finetune the pretrained BLOOM models on INSTRUCTOPENWIKI and obtain PIVOINE, an LLM for Open-world IE with strong instruction-following capabilities. Our experiments demonstrate that PIVOINE significantly outperforms traditional closed-world methods and other LLM baselines, displaying impressive generalization capabilities on both unseen instructions and out-of-ontology cases. Consequently, PIVOINE emerges as a promising solution to tackle the open-world challenge in IE effectively.

NLP-127-标题: Text encoders are performance bottlenecks in contrastive vision-language models

链接: https://arxiv.org/abs/2305.14897
作者: Amita Kamath, Jack Hessel, Kai-Wei Chang


Abstract: Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach doesn’t require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP’s text encoder falls short on object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multi-modal matching performance on ControlledImCaps: a new evaluation benchmark we collect+release consisting of fine-grained compositional images+captions. Specifically – our results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive vision+language models. We release data+code.

NLP-128-标题: Extracting Psychological Indicators Using Question Answering

链接: https://arxiv.org/abs/2305.14891
作者: Luka Pavlović
备注: 4 pages, 0 figures


Abstract: In this work, we propose a method for extracting text spans that may indicate one of the BIG5 psychological traits using a question-answering task with examples that have no answer for the asked question. We utilized the RoBERTa model fine-tuned on SQuAD 2.0 dataset. The model was further fine-tuned utilizing comments from Reddit. We examined the effect of the percentage of examples with no answer in the training dataset on the overall performance. The results obtained in this study are in line with the SQuAD 2.0 benchmark and present a good baseline for further research.

NLP-129-标题: Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective

链接: https://arxiv.org/abs/2305.14889
作者: Ziang Xiao, Susu Zhang, Vivian Lai, Q. Vera Liao


Abstract: We address the fundamental challenge in Natural Language Generation (NLG) model evaluation, the design and validation of evaluation metrics. Recognizing the limitations of existing metrics and issues with human judgment, we propose using measurement theory, the foundation of test design, as a framework for conceptualizing and evaluating the validity and reliability of NLG evaluation metrics. This approach offers a systematic method for defining “good” metrics, developing robust metrics, and assessing metric performance. In this paper, we introduce core concepts in measurement theory in the context of NLG evaluation and key methods to evaluate the performance of NLG metrics. Through this framework, we aim to promote the design, evaluation, and interpretation of valid and reliable metrics, ultimately contributing to the advancement of robust and effective NLG models in real-world settings.

NLP-130-标题: Privacy Implications of Retrieval-Based Language Models

链接: https://arxiv.org/abs/2305.14888
作者: Yangsibo Huang, Samyak Gupta, Zexuan Zhong, Kai Li, Danqi Chen


Abstract: Retrieval-based language models (LMs) have demonstrated improved interpretability, factuality, and adaptability compared to their parametric counterparts, by incorporating retrieved text from external datastores. While it is well known that parametric models are prone to leaking private data, it remains unclear how the addition of a retrieval datastore impacts model privacy. In this work, we present the first study of privacy risks in retrieval-based LMs, particularly k NN-LMs. Our goal is to explore the optimal design and training procedure in domains where privacy is of concern, aiming to strike a balance between utility and privacy. Crucially, we find that k NN-LMs are more susceptible to leaking private information from their private datastore than parametric models. We further explore mitigations of privacy risks. When privacy information is targeted and readily detected in the text, we find that a simple sanitization step would completely eliminate the risks, while decoupling query and key encoders achieves an even better utility-privacy trade-off. Otherwise, we consider strategies of mixing public and private data in both datastore and encoder training. While these methods offer modest improvements, they leave considerable room for future work. Together, our findings provide insights for practitioners to better understand and mitigate privacy risks in retrieval-based LMs. Our code is available at: this https URL .

NLP-131-标题: Interpretable by Design Visual Question Answering

链接: https://arxiv.org/abs/2305.14882
作者: Xingyu Fu, Ben Zhou, Sihao Chen, Mark Yatskar, Dan Roth
备注: Multimodal, Vision and Language


Abstract: Model interpretability has long been a hard problem for the AI community especially in the multimodal setting, where vision and language need to be aligned and reasoned at the same time. In this paper, we specifically focus on the problem of Visual Question Answering (VQA). While previous researches try to probe into the network structures of black-box multimodal models, we propose to tackle the problem from a different angle – to treat interpretability as an explicit additional goal. Given an image and question, we argue that an interpretable VQA model should be able to tell what conclusions it can get from which part of the image, and show how each statement help to arrive at an answer. We introduce InterVQA: Interpretable-by-design VQA, where we design an explicit intermediate dynamic reasoning structure for VQA problems and enforce symbolic reasoning that only use the structure for final answer prediction to take place. InterVQA produces high-quality explicit intermediate reasoning steps, while maintaining similar to the state-of-the-art (sota) end-task performance.

NLP-132-标题: ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games

链接: https://arxiv.org/abs/2305.14879
作者: Ruoyao Wang, Graham Todd, Eric Yuan, Ziang Xiao, Marc-Alexandre Côté, Peter Jansen
备注: 10 pages


Abstract: In this work we examine the ability of language models to generate explicit world models of scientific and common-sense reasoning tasks by framing this as a problem of generating text-based games. To support this, we introduce ByteSized32, a corpus of 32 highly-templated text games written in Python totaling 24k lines of code, each centered around a particular task, and paired with a set of 16 unseen text game specifications for evaluation. We propose a suite of automatic and manual metrics for assessing simulation validity, compliance with task specifications, playability, winnability, and alignment with the physical world. In a single-shot evaluation of GPT-4 on this simulation-as-code-generation task, we find it capable of producing runnable games in 27% of cases, highlighting the difficulty of this challenge task. We discuss areas of future improvement, including GPT-4’s apparent capacity to perform well at simulating near canonical task solutions, with performance dropping off as simulations include distractors or deviate from canonical solutions in the action space.

NLP-133-标题: Leveraging GPT -4 for Automatic Translation Post-Editing

链接: https://arxiv.org/abs/2305.14878
作者: Vikas Raunak, Amr Sharaf, Hany Hassan Awadallah, Arul Menezes


Abstract: While Neural Machine Translation (NMT) represents the leading approach to Machine Translation (MT), the outputs of NMT models still require translation post-editing to rectify errors and enhance quality, particularly under critical settings. In this work, we formalize the task of translation post-editing with Large Language Models (LLMs) and explore the use of GPT-4 to automatically post-edit NMT outputs across several language pairs. Our results demonstrate that GPT-4 is adept at translation post-editing and produces meaningful edits even when the target language is not English. Notably, we achieve state-of-the-art performance on WMT-22 English-Chinese, English-German, Chinese-English and German-English language pairs using GPT-4 based post-editing, as evaluated by state-of-the-art MT quality metrics.

NLP-134-标题: Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis

链接: https://arxiv.org/abs/2305.14877
作者: Sohee Yang, Jonghyeon Kim, Joel Jang, Seonghyeon Ye, Hyunji Lee, Minjoon Seo


Abstract: Large Language Models (LLMs) have demonstrated great capabilities in solving a wide range of tasks in a resource-efficient manner through prompting, which does not require task-specific training, but suffers from performance fluctuation when there are multiple prompt candidates. Previous works have introduced gradient-free probability-based prompt selection methods that aim to choose the optimal prompt among the candidates for a given task but fail to provide a comprehensive and fair comparison between each other. In this paper, we propose a unified framework to interpret and evaluate the existing probability-based prompt selection methods by performing extensive experiments on 13 common NLP tasks. We find that all existing methods can be unified into some variant of the method that maximizes the mutual information between the input and the corresponding model output (denoted as MI). Using the finding, we develop several variants of MI and increases the effectiveness of the best prompt selection method from 87.79% to 94.98%, measured as the ratio of the performance of the selected prompt to that of the optimal oracle prompt. Furthermore, we propose a novel calibration method called Calibration by Marginalization (CBM) that is orthogonal to existing methods and helps increase the prompt selection effectiveness of the best method by 99.44%. The code and datasets used in our work will be released at this https URL.

NLP-135-标题: From Words to Wires: Generating Functioning Electronic Devices from Natural Language Descriptions

链接: https://arxiv.org/abs/2305.14874
作者: Peter Jansen
备注: 13 pages, 4 figures


Abstract: In this work, we show that contemporary language models have a previously unknown skill – the capacity for electronic circuit design from high-level textual descriptions, akin to code generation. We introduce two benchmarks: Pins100, assessing model knowledge of electrical components, and Micro25, evaluating a model’s capability to design common microcontroller circuits and code in the Arduino ecosystem that involve input, output, sensors, motors, protocols, and logic – with models such as GPT-4 and Claude-V1 achieving between 60% to 96% Pass@1 on generating full devices. We include six case studies of using language models as a design assistant for moderately complex devices, such as a radiation-powered random number generator, an emoji keyboard, a visible spectrometer, and several assistive devices, while offering a qualitative analysis performance, outlining evaluation challenges, and suggesting areas of development to improve complex circuit design and practical utility. With this work, we aim to spur research at the juncture of natural language processing and electronic design.

NLP-136-标题: ClusterLLM: Large Language Models as a Guide for Text Clustering

链接: https://arxiv.org/abs/2305.14871
作者: Yuwei Zhang, Zihan Wang, Jingbo Shang


Abstract: We introduce ClusterLLM, a novel text clustering framework that leverages feedback from an instruction-tuned large language model, such as ChatGPT. Compared with traditional unsupervised methods that builds upon “small” embedders, ClusterLLM exhibits two intriguing advantages: (1) it enjoys the emergent capability of LLM even if its embeddings are inaccessible; and (2) it understands the user’s preference on clustering through textual instruction and/or a few annotated data. First, we prompt ChatGPT for insights on clustering perspective by constructing hard triplet questions , where A, B and C are similar data points that belong to different clusters according to small embedder. We empirically show that this strategy is both effective for fine-tuning small embedder and cost-efficient to query ChatGPT. Second, we prompt ChatGPT for helps on clustering granularity by carefully designed pairwise questions , and tune the granularity from cluster hierarchies that is the most consistent with the ChatGPT answers. Extensive experiments on 14 datasets show that ClusterLLM consistently improves clustering quality, at an average cost of ~ 0.6 per dataset.

NLP-137-标题: CAR: Conceptualization-Augmented Reasoner for Zero-Shot Commonsense Question Answering

链接: https://arxiv.org/abs/2305.14869
作者: Weiqi Wang, Tianqing Fang, Wenxuan Ding, Baixuan Xu, Xin Liu, Yangqiu Song, Antoine Bosselut


Abstract: The task of zero-shot commonsense question answering evaluates models on their capacity to reason about general scenarios beyond those presented in specific datasets. Existing approaches for tackling this task leverage external knowledge from CommonSense Knowledge Bases (CSKBs) by pretraining the model on synthetic QA pairs constructed from CSKBs. In these approaches, negative examples (distractors) are formulated by randomly sampling from CSKBs using fairly primitive keyword constraints. However, two bottlenecks limit these approaches: the inherent incompleteness of CSKBs limits the semantic coverage of synthetic QA pairs, and the lack of human annotations makes the sampled negative examples potentially uninformative and contradictory. To tackle these limitations above, we propose Conceptualization-Augmented Reasoner (CAR), a zero-shot commonsense question-answering framework that fully leverages the power of conceptualization. Specifically, CAR abstracts a commonsense knowledge triple to many higher-level instances, which increases the coverage of CSKB and expands the ground-truth answer space, reducing the likelihood of selecting false-negative distractors. Extensive experiments demonstrate that CAR more robustly generalizes to answering questions about zero-shot commonsense scenarios than existing methods, including large language models, such as GPT3.5 and ChatGPT. Our codes, data, and model checkpoints are available at this https URL.

NLP-138-标题: Large Language Model Distillation Doesnt Need a Teacher

链接: https://arxiv.org/abs/2305.14864
作者: Ananya Harsh Jha, Dirk Groeneveld, Emma Strubell, Iz Beltagy
备注: 10 pages, 3 figures, 5 tables


Abstract: Knowledge distillation trains a smaller student model to match the output distribution of a larger teacher to maximize the end-task performance under computational constraints. However, existing literature on language model distillation primarily focuses on compressing encoder-only models that are then specialized by task-specific supervised finetuning. We need to rethink this setup for more recent large language models with tens to hundreds of billions of parameters. Task-specific finetuning is impractical at this scale, and model performance is often measured using zero/few-shot prompting. Thus, in this work, we advocate for task-agnostic zero-shot evaluated distillation for large language models without access to end-task finetuning data. We propose a teacher-free task-agnostic distillation method, which uses a truncated version of the larger model for initialization, and continues pretraining this model using a language modeling objective. Our teacher-free method shines in a distillation regime where it is infeasible to fit both the student and teacher into the GPU memory. Despite its simplicity, our method can effectively reduce the model size by 50%, matching or outperforming the vanilla distillation method on perplexity and accuracy on 13 zero-shot end-tasks while being 1.5x computationally efficient.

NLP-139-标题: Utility-Probability Duality of Neural Networks

链接: https://arxiv.org/abs/2305.14859
作者: Huang Bojun, Fei Yuan


Abstract: It is typically understood that the training of modern neural networks is a process of fitting the probability distribution of desired output. However, recent paradoxical observations in a number of language generation tasks let one wonder if this canonical probability-based explanation can really account for the empirical success of deep learning. To resolve this issue, we propose an alternative utility-based explanation to the standard supervised learning procedure in deep learning. The basic idea is to interpret the learned neural network not as a probability model but as an ordinal utility function that encodes the preference revealed in training data. In this perspective, training of the neural network corresponds to a utility learning process. Specifically, we show that for all neural networks with softmax outputs, the SGD learning dynamic of maximum likelihood estimation (MLE) can be seen as an iteration process that optimizes the neural network toward an optimal utility function. This utility-based interpretation can explain several otherwise-paradoxical observations about the neural networks thus trained. Moreover, our utility-based theory also entails an equation that can transform the learned utility values back to a new kind of probability estimation with which probability-compatible decision rules enjoy dramatic (double-digits) performance improvements. These evidences collectively reveal a phenomenon of utility-probability duality in terms of what modern neural networks are (truly) modeling: We thought they are one thing (probabilities), until the unexplainable showed up; changing mindset and treating them as another thing (utility values) largely reconcile the theory, despite remaining subtleties regarding its original (probabilistic) identity.

NLP-140-标题: BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer

链接: https://arxiv.org/abs/2305.14857
作者: Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, Hannaneh Hajishirzi
备注: The data and code is available at this https URL


Abstract: Despite remarkable advancements in few-shot generalization in natural language processing, most models are developed and evaluated primarily in English. To facilitate research on few-shot cross-lingual transfer, we introduce a new benchmark, called BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format and provides a fixed set of few-shot examples and instructions. BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer across a broad range of tasks and languages. Using BUFFET, we perform thorough evaluations of state-of-the-art multilingual large language models with different transfer methods, namely in-context learning and fine-tuning. Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer. In particular, ChatGPT with in-context learning often performs worse than much smaller mT5-base models fine-tuned on English task data and few-shot in-language examples. Our analysis suggests various avenues for future research in few-shot cross-lingual transfer, such as improved pretraining, understanding, and future evaluations.

NLP-141-标题: Drafting Event Schemas using Language Models

链接: https://arxiv.org/abs/2305.14847
作者: Anisha Gunjal, Greg Durrett


Abstract: Past work has studied event prediction and event language modeling, sometimes mediated through structured representations of knowledge in the form of event schemas. Such schemas can lead to explainable predictions and forecasting of unseen events given incomplete information. In this work, we look at the process of creating such schemas to describe complex events. We use large language models (LLMs) to draft schemas directly in natural language, which can be further refined by human curators as necessary. Our focus is on whether we can achieve sufficient diversity and recall of key events and whether we can produce the schemas in a sufficiently descriptive style. We show that large language models are able to achieve moderate recall against schemas taken from two different datasets, with even better results when multiple prompts and multiple samples are combined. Moreover, we show that textual entailment methods can be used for both matching schemas to instances of events as well as evaluating overlap between gold and predicted schemas. Our method paves the way for easier distillation of event knowledge from large language model into schemas.

NLP-142-标题: Meta-Learning For Vision-and-Language Cross-lingual Transfer

链接: https://arxiv.org/abs/2305.14843
作者: Hanxu Hu, Frank Keller


Abstract: Current pre-trained vison-language models (PVLMs) achieve excellent performance on a range of multi-modal datasets. Recent work has aimed at building multilingual models, and a range of novel multilingual multi-modal datasets have been proposed. Current PVLMs typically perform poorly on these datasets when used for multi-modal zero-shot or few-shot cross-lingual transfer, especially for low-resource languages. To alleviate this problem, we propose a novel meta-learning fine-tuning framework. Our framework makes current PVLMs rapidly adaptive to new languages in vision-language scenarios by designing MAML in a cross-lingual multi-modal manner. Experiments show that our method boosts the performance of current state-of-the-art PVLMs in both zero-shot and few-shot cross-lingual transfer on a range of vision-language understanding tasks and datasets (XVNLI, xGQA, MaRVL, xFlicker&Co

NLP-143-标题: Exploring Sentiment Analysis Techniques in Natural Language Processing: A Comprehensive Review

链接: https://arxiv.org/abs/2305.14842
作者: Karthick Prasad Gunasekaran


Abstract: Sentiment analysis (SA) is the automated process of detecting and understanding the emotions conveyed through written text. Over the past decade, SA has gained significant popularity in the field of Natural Language Processing (NLP). With the widespread use of social media and online platforms, SA has become crucial for companies to gather customer feedback and shape their marketing strategies. Additionally, researchers rely on SA to analyze public sentiment on various topics. In this particular research study, a comprehensive survey was conducted to explore the latest trends and techniques in SA. The survey encompassed a wide range of methods, including lexicon-based, graph-based, network-based, machine learning, deep learning, ensemble-based, rule-based, and hybrid techniques. The paper also addresses the challenges and opportunities in SA, such as dealing with sarcasm and irony, analyzing multi-lingual data, and addressing ethical concerns. To provide a practical case study, Twitter was chosen as one of the largest online social media platforms. Furthermore, the researchers shed light on the diverse application areas of SA, including social media, healthcare, marketing, finance, and politics. The paper also presents a comparative and comprehensive analysis of existing trends and techniques, datasets, and evaluation metrics. The ultimate goal is to offer researchers and practitioners a systematic review of SA techniques, identify existing gaps, and suggest possible improvements. This study aims to enhance the efficiency and accuracy of SA processes, leading to smoother and error-free outcomes.

NLP-144-标题: PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts ACL2023

链接: https://arxiv.org/abs/2305.14839
作者: Yunshui Li, Binyuan Hui, ZhiChao Yin, Min Yang, Fei Huang, Yongbin Li
备注: ACL 2023


Abstract: Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes \textbfPaCE, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.

NLP-145-标题: ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

链接: https://arxiv.org/abs/2305.14838
作者: Chenyang Le, Yao Qian, Long Zhou, Shujie Liu, Michael Zeng, Xuedong Huang


Abstract: Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.

NLP-146-标题: SummIt: Iterative Text Summarization via ChatGPT

链接: https://arxiv.org/abs/2305.14835
作者: Haopeng Zhang, Xiao Liu, Jiawei Zhang
备注: work in progress


Abstract: Existing text summarization systems have made significant progress in recent years but typically generates summaries in a single step. The one-shot summarization setting is sometimes inadequate, however, as the generated summary may contain hallucinations or overlook important details related to the reader’s interests. In this paper, we address this limitation by proposing SummIt, an iterative text summarization framework based on large language models like ChatGPT. Our framework enables the model to refine the generated summary iteratively through self-evaluation and feedback, closely resembling the iterative process humans undertake when drafting and revising summaries. We also explore using in-context learning to guide the rationale generation and summary refinement. Furthermore, we explore the potential benefits of integrating knowledge and topic extractors into the framework to enhance summary faithfulness and controllability. We evaluate the performance of our framework on three benchmark summarization datasets through empirical and qualitative analyses. We also conduct a human evaluation to validate the effectiveness of the model’s refinements and find a potential issue of over-correction. Our code is available at \urlthis https URL.

NLP-147-标题: Towards Few-shot Entity Recognition in Document Images: A Graph Neural Network Approach Robust to Image Manipulation

链接: https://arxiv.org/abs/2305.14828
作者: Prashant Krishnan, Zilong Wang, Yangkun Wang, Jingbo Shang


Abstract: Recent advances of incorporating layout information, typically bounding box coordinates, into pre-trained language models have achieved significant performance in entity recognition from document images. Using coordinates can easily model the absolute position of each token, but they might be sensitive to manipulations in document images (e.g., shifting, rotation or scaling), especially when the training data is limited in few-shot settings. In this paper, we propose to further introduce the topological adjacency relationship among the tokens, emphasizing their relative position information. Specifically, we consider the tokens in the documents as nodes and formulate the edges based on the topological heuristics from the k-nearest bounding boxes. Such adjacency graphs are invariant to affine transformations including shifting, rotations and scaling. We incorporate these graphs into the pre-trained language model by adding graph neural network layers on top of the language model embeddings, leading to a novel model LAGER. Extensive experiments on two benchmark datasets show that LAGER significantly outperforms strong baselines under different few-shot settings and also demonstrate better robustness to manipulations.

NLP-148-标题: Pre-training Intent-Aware Encoders for Zero- and Few-Shot Intent Classification

链接: https://arxiv.org/abs/2305.14827
作者: Mujeen Sung, James Gung, Elman Mansimov, Nikolaos Pappas, Raphael Shu, Salvatore Romeo, Yi Zhang, Vittorio Castelli


Abstract: Intent classification (IC) plays an important role in task-oriented dialogue systems as it identifies user intents from given utterances. However, models trained on limited annotations for IC often suffer from a lack of generalization to unseen intent classes. We propose a novel pre-training method for text encoders that uses contrastive learning with intent psuedo-labels to produce embeddings that are well-suited for IC tasks. By applying this pre-training strategy, we also introduce the pre-trained intent-aware encoder (PIE). Specifically, we first train a tagger to identify key phrases within utterances that are crucial for interpreting intents. We then use these extracted phrases to create examples for pre-training a text encoder in a contrastive manner. As a result, our PIE model achieves up to 5.4% and 4.0% higher accuracy than the previous state-of-the-art pre-trained sentence encoder for the N-way zero- and one-shot settings on four IC datasets.

NLP-149-标题: Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners

链接: https://arxiv.org/abs/2305.14825
作者: Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, Muhan Zhang


Abstract: The emergent few-shot reasoning capabilities of Large Language Models (LLMs) have excited the natural language and machine learning community over recent years. Despite of numerous successful applications, the underlying mechanism of such in-context capabilities still remains unclear. In this work, we hypothesize that the learned \textitsemantics of language tokens do the most heavy lifting during the reasoning process. Different from human’s symbolic reasoning process, the semantic representations of LLMs could create strong connections among tokens, thus composing a superficial logical chain. To test our hypothesis, we decouple semantics from the language reasoning process and evaluate three kinds of reasoning abilities, i.e., deduction, induction and abduction. Our findings reveal that semantics play a vital role in LLMs’ in-context reasoning – LLMs perform significantly better when semantics are consistent with commonsense but struggle to solve symbolic or counter-commonsense reasoning tasks by leveraging in-context new knowledge. The surprising observations question whether modern LLMs have mastered the inductive, deductive and abductive reasoning abilities as in human intelligence, and motivate research on unveiling the magic existing within the black-box LLMs. On the whole, our analysis provides a novel perspective on the role of semantics in developing and evaluating language models’ reasoning abilities. Code is available at \urlthis https URL.

NLP-150-标题: Mitigating Temporal Misalignment by Discarding Outdated Facts

链接: https://arxiv.org/abs/2305.14824
作者: Michael J.Q. Zhang, Eunsol Choi


Abstract: While large language models are able to retain vast amounts of world knowledge seen during pretraining, such knowledge is prone to going out of date and is nontrivial to update. Furthermore, these models are often used under temporal misalignment, tasked with answering questions about the present, despite having only been trained on data collected in the past. To mitigate the effects of temporal misalignment, we propose fact duration prediction: the task of predicting how long a given fact will remain true. In our experiments, we demonstrate how identifying facts that are prone to rapid change can help models avoid from reciting outdated information and identify which predictions require seeking out up-to-date knowledge sources. We also show how modeling fact duration improves calibration for knowledge-intensive tasks, such as open-retrieval question answering, under temporal misalignment by discarding volatile facts. Our data and code will be released publicly at this https URL.

NLP-151-标题: Machine Reading Comprehension using Case-based Reasoning

链接: https://arxiv.org/abs/2305.14815
作者: Dung Thai, Dhruv Agarwal, Mudit Chaudhary, Rajarshi Das, Manzil Zaheer, Jay-Yoon Lee, Hannaneh Hajishirzi, Andrew McCallum
备注: 9 pages, 2 figures


Abstract: We present an accurate and interpretable method for answer extraction in machine reading comprehension that is reminiscent of case-based reasoning (CBR) from classical AI. Our method (CBR-MRC) builds on the hypothesis that contextualized answers to similar questions share semantic similarities with each other. Given a target question, CBR-MRC retrieves a set of similar questions from a memory of observed cases and predicts an answer by selecting the span in the target context that is most similar to the contextualized representations of answers in the retrieved cases. The semi-parametric nature of our approach allows CBR-MRC to attribute a prediction to the specific set of cases used during inference, making it a desirable choice for building reliable and debuggable QA systems. We show that CBR-MRC achieves high test accuracy comparable with large reader models, outperforming baselines by 11.5 and 8.4 EM on NaturalQuestions and NewsQA, respectively. Further, we also demonstrate the ability of CBR-MRC in identifying not just the correct answer tokens but also the span with the most relevant supporting evidence. Lastly, we observe that contexts for certain question types show higher lexical diversity than others and find CBR-MRC to be robust to these variations while performance using fully-parametric methods drops.

NLP-152-标题: AWESOME: GPU Memory-constrained Long Document Summarization using Memory Mechanism and Global Salient Content

链接: https://arxiv.org/abs/2305.14806
作者: Shuyang Cao, Lu Wang


Abstract: Long document summarization systems are critical for domains with lengthy and jargonladen text, yet they present significant challenges to researchers and developers with limited computing resources. Existing solutions mainly focus on efficient attentions or divide-and-conquer strategies. The former reduces theoretical time complexity, but is still memory-heavy. The latter methods sacrifice global context, leading to uninformative and incoherent summaries. This work aims to leverage the memory-efficient nature of divide-and-conquer methods while preserving global context. Concretely, our framework AWESOME uses two novel mechanisms: (1) External memory mechanisms track previously encoded document segments and their corresponding summaries, to enhance global document understanding and summary coherence. (2) Global salient content is further identified beforehand to augment each document segment to support its summarization. Extensive experiments on diverse genres of text, including government reports, transcripts, scientific papers, and novels, show that AWESOME produces summaries with improved informativeness, faithfulness, and coherence than competitive baselines on longer documents, while having a similar or smaller GPU memory footprint.

NLP-153-标题: Estimating Large Language Model Capabilities without Labeled Test Data

链接: https://arxiv.org/abs/2305.14802
作者: Harvey Yiyun Fu, Qinyuan Ye, Albert Xu, Xiang Ren, Robin Jia
备注: 14 pages, 4 figures


Abstract: Large Language Models (LLMs) have exhibited an impressive ability to perform in-context learning (ICL) from only a few examples, but the success of ICL varies widely from task to task. Thus, it is important to quickly determine whether ICL is applicable to a new task, but directly evaluating ICL accuracy can be expensive in situations where test data is expensive to annotate – the exact situations where ICL is most appealing. In this paper, we propose the task of ICL accuracy estimation, in which we predict the accuracy of an LLM when doing in-context learning on a new task given only unlabeled data for that task. To perform ICL accuracy estimation, we propose a method that trains a meta-model using LLM confidence scores as features. We compare our method to several strong accuracy estimation baselines on a new benchmark that covers 4 LLMs and 3 task collections. On average, the meta-model improves over all baselines and achieves the same estimation performance as directly evaluating on 40 labeled test examples per task, across the total 12 settings. We encourage future work to improve on our methods and evaluate on our ICL accuracy estimation benchmark to deepen our understanding of when ICL works.

NLP-154-标题: MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions

链接: https://arxiv.org/abs/2305.14795
作者: Zexuan Zhong, Zhengxuan Wu, Christopher D. Manning, Christopher Potts, Danqi Chen
备注: Our code and datasets are available at this https URL


Abstract: The information stored in large language models (LLMs) falls out of date quickly, and retraining from scratch is often not an option. This has recently given rise to a range of techniques for injecting new facts through updating model weights. Current evaluation paradigms are extremely limited, mainly validating the recall of edited facts, but changing one fact should cause rippling changes to the model’s related beliefs. If we edit the UK Prime Minister to now be Rishi Sunak, then we should get a different answer to Who is married to the British Prime Minister? In this work, we present a benchmark MQuAKE (Multi-hop Question Answering for Knowledge Editing) comprising multi-hop questions that assess whether edited models correctly answer questions where the answer should change as an entailed consequence of edited facts. While we find that current knowledge-editing approaches can recall edited facts accurately, they fail catastrophically on the constructed multi-hop questions. We thus propose a simple memory-based approach, MeLLo, which stores all edited facts externally while prompting the language model iteratively to generate answers that are consistent with the edited facts. While MQuAKE remains challenging, we show that MeLLo scales well with LLMs (up to 175B) and outperforms previous model editors by a large margin.

NLP-155-标题: Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak Supervision for Text Classification

链接: https://arxiv.org/abs/2305.14794
作者: Chengyu Dong, Zihan Wang, Jingbo Shang


Abstract: Recent advances in weakly supervised text classification mostly focus on designing sophisticated methods to turn high-level human heuristics into quality pseudo-labels. In this paper, we revisit the seed matching-based method, which is arguably the simplest way to generate pseudo-labels, and show that its power was greatly underestimated. We show that the limited performance of seed matching is largely due to the label bias injected by the simple seed-match rule, which prevents the classifier from learning reliable confidence for selecting high-quality pseudo-labels. Interestingly, simply deleting the seed words present in the matched input texts can mitigate the label bias and help learn better confidence. Subsequently, the performance achieved by seed matching can be improved significantly, making it on par with or even better than the state-of-the-art. Furthermore, to handle the case when the seed words are not made known, we propose to simply delete the word tokens in the input text randomly with a high deletion ratio. Remarkably, seed matching equipped with this random deletion method can often achieve even better performance than that with seed deletion.

NLP-156-标题: Faithful Low-Resource Data-to-Text Generation through Cycle Training ACL2023

链接: https://arxiv.org/abs/2305.14793
作者: Zhuoer Wang, Marcus Collins, Nikhita Vedula, Simone Filice, Shervin Malmasi, Oleg Rokhlenko
备注: 19 pages, 4 figures, ACL 2023


Abstract: Methods to generate text from structured data have advanced significantly in recent years, primarily due to fine-tuning of pre-trained language models on large datasets. However, such models can fail to produce output faithful to the input data, particularly on out-of-domain data. Sufficient annotated data is often not available for specific domains, leading us to seek an unsupervised approach to improve the faithfulness of output text. Since the problem is fundamentally one of consistency between the representations of the structured data and text, we evaluate the effectiveness of cycle training in this work. Cycle training uses two models which are inverses of each other: one that generates text from structured data, and one which generates the structured data from natural language text. We show that cycle training, when initialized with a small amount of supervised data (100 samples in our case), achieves nearly the same performance as fully supervised approaches for the data-to-text generation task on the WebNLG, E2E, WTQ, and WSQL datasets. We perform extensive empirical analysis with automated evaluation metrics and a newly designed human evaluation schema to reveal different cycle training strategies’ effectiveness of reducing various types of generation errors. Our code is publicly available at this https URL.

NLP-157-标题: Large Language Models as Counterfactual Generator: Strengths and Weaknesses

链接: https://arxiv.org/abs/2305.14791
作者: Yongqi Li, Mayi Xu, Xin Miao, Shen Zhou, Tieyun Qian


Abstract: Large language models (LLMs) have demonstrated remarkable performance in a range of natural language understanding and generation tasks. Yet, their ability to generate counterfactuals, which can be used for areas like data augmentation, remains under-explored. This study aims to investigate the counterfactual generation capabilities of LLMs and analysis factors that influence this ability. First, we evaluate how effective are LLMs in counterfactual generation through data augmentation experiments for small language models (SLMs) across four tasks: sentiment analysis, natural language inference, named entity recognition, and relation extraction. While LLMs show promising enhancements in various settings, they struggle in complex tasks due to their self-limitations and the lack of logical guidance to produce counterfactuals that align with commonsense. Second, our analysis reveals the pivotal role of providing accurate task definitions and detailed step-by-step instructions to LLMs in generating counterfactuals. Interestingly, we also find that LLMs can generate reasonable counterfactuals even with unreasonable demonstrations, which illustrates that demonstrations are primarily to regulate the output format.This study provides the first comprehensive insight into counterfactual generation abilities of LLMs, and offers a novel perspective on utilizing LLMs for data augmentation to enhance SLMs.

NLP-158-标题: Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation Corpus and Benchmark

链接: https://arxiv.org/abs/2305.14790
作者: Feng Jiang, Weihao Liu, Xiaomin Chu, Peifeng Li, Qiaoming Zhu, Haizhou Li


Abstract: Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings. Such a process unveils the discourse topic structure of a document that benefits quickly grasping and understanding the overall context of the document from a higher level. However, research and applications in this field have been restrained due to the lack of proper paragraph-level topic representations and large-scale, high-quality corpora in Chinese compared to the success achieved in English. Addressing these issues, we introduce a hierarchical paragraph-level topic structure representation with title, subheading, and paragraph that comprehensively models the document discourse topic structure. In addition, we ensure a more holistic representation of topic distribution within the document by using sentences instead of keywords to represent sub-topics. Following this representation, we construct the largest Chinese Paragraph-level Topic Structure corpus (CPTS), four times larger than the previously largest one. We also employ a two-stage man-machine collaborative annotation method to ensure the high quality of the corpus both in form and semantics. Finally, we validate the computability of CPTS on two fundamental tasks (topic segmentation and outline generation) by several strong baselines, and its efficacy has been preliminarily confirmed on the downstream task: discourse parsing. The representation, corpus, and benchmark we established will provide a solid foundation for future studies.

NLP-159-标题: Adapting Language Models to Compress Contexts

链接: https://arxiv.org/abs/2305.14788
作者: Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen


Abstract: Transformer-based language models (LMs) are powerful and widely-applicable tools, but their usefulness is constrained by a finite context window and the expensive computational cost of processing long text documents. We propose to adapt pre-trained LMs into AutoCompressors. These models are capable of compressing long contexts into compact summary vectors, which are then accessible to the model as soft prompts. Summary vectors are trained with an unsupervised objective, whereby long documents are processed in segments and summary vectors from all previous segments are used in language modeling. We fine-tune OPT models on sequences of up to 30,720 tokens and show that AutoCompressors can utilize long contexts to improve perplexity. We evaluate AutoCompressors on in-context learning by compressing task demonstrations. We find that summary vectors are good substitutes for plain-text demonstrations, increasing accuracy while reducing inference cost. Finally, we explore the benefits of pre-computing summary vectors for large corpora by applying summary vectors to retrieval-augmented language modeling. Overall, AutoCompressors emerge as a simple and inexpensive solution for extending the context window of LMs while speeding up inference over long contexts.

NLP-160-标题: ChatGPT and Simple Linguistic Inferences: Blind Spots and Blinds

链接: https://arxiv.org/abs/2305.14785
作者: Victoria Basmov, Yoav Goldberg, Reut Tsarfaty


Abstract: This paper sheds light on the limitations of ChatGPT’s understanding capabilities, focusing on simple inference tasks that are typically easy for humans but appear to be challenging for the model. Specifically, we target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. We present expert-designed evaluation sets for these inference types and conduct experiments in a zero-shot setup. Our results show that the model struggles with these types of inferences, exhibiting moderate to low accuracy. Moreover, while ChatGPT demonstrates knowledge of the underlying linguistic concepts when prompted directly, it often fails to incorporate this knowledge to make correct inferences. Even more strikingly, further experiments show that embedding the premise under presupposition triggers or non-factive verbs causes the model to predict entailment more frequently regardless of the correct semantic label. Overall these results suggest that, despite GPT’s celebrated language understanding capacity, ChatGPT has blindspots with respect to certain types of entailment, and that certain entailment-cancelling features act as ``blinds’’ overshadowing the semantics of the embedded premise. Our analyses emphasize the need for further research into the linguistic comprehension and reasoning capabilities of LLMs, in order to improve their reliability, and establish their trustworthiness for real-world applications.

NLP-161-标题: Anthropomorphization of AI: Opportunities and Risks

链接: https://arxiv.org/abs/2305.14784
作者: Ameet Deshpande, Tanmay Rajpurohit, Karthik Narasimhan, Ashwin Kalyan


Abstract: Anthropomorphization is the tendency to attribute human-like traits to non-human entities. It is prevalent in many social contexts – children anthropomorphize toys, adults do so with brands, and it is a literary device. It is also a versatile tool in science, with behavioral psychology and evolutionary biology meticulously documenting its consequences. With widespread adoption of AI systems, and the push from stakeholders to make it human-like through alignment techniques, human voice, and pictorial avatars, the tendency for users to anthropomorphize it increases significantly. We take a dyadic approach to understanding this phenomenon with large language models (LLMs) by studying (1) the objective legal implications, as analyzed through the lens of the recent blueprint of AI bill of rights and the (2) subtle psychological aspects customization and anthropomorphization. We find that anthropomorphized LLMs customized for different user bases violate multiple provisions in the legislative blueprint. In addition, we point out that anthropomorphization of LLMs affects the influence they can have on their users, thus having the potential to fundamentally change the nature of human-AI interaction, with potential for manipulation and negative influence. With LLMs being hyper-personalized for vulnerable groups like children and patients among others, our work is a timely and important contribution. We propose a conservative strategy for the cautious use of anthropomorphization to improve trustworthiness of AI systems.

NLP-162-标题: Disentangled Phonetic Representation for Chinese Spelling Correction ACL2023

链接: https://arxiv.org/abs/2305.14783
作者: Zihong Liang, Xiaojun Quan, Qifan Wang
备注: Accepted to ACL 2023 Main Conference


Abstract: Chinese Spelling Correction (CSC) aims to detect and correct erroneous characters in Chinese texts. Although efforts have been made to introduce phonetic information (Hanyu Pinyin) in this task, they typically merge phonetic representations with character representations, which tends to weaken the representation effect of normal texts. In this work, we propose to disentangle the two types of features to allow for direct interaction between textual and phonetic information. To learn useful phonetic representations, we introduce a pinyin-to-character objective to ask the model to predict the correct characters based solely on phonetic information, where a separation mask is imposed to disable attention from phonetic input to text. To avoid overfitting the phonetics, we further design a self-distillation module to ensure that semantic information plays a major role in the prediction. Extensive experiments on three CSC benchmarks demonstrate the superiority of our method in using phonetic information.

NLP-163-标题: Text Conditional Alt-Text Generation for Twitter Images

链接: https://arxiv.org/abs/2305.14779
作者: Nikita Srivatsan, Sofia Samaniego, Omar Florez, Taylor Berg-Kirkpatrick


Abstract: In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. This task is more than just a special case of image captioning, as alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative – e.g. the tweet may name an uncommon object in the image that the model has not previously seen. We address this with a CLIP prefix model that extracts an embedding of the image and passes it to a mapping network that outputs a short sequence in word embedding space, or a ``prefix’', to which we also concatenate the text from the tweet itself. This lets the model condition on both visual and textual information from the post. The combined multimodal prefix is then fed as a prompt to a pretrained language model which autoregressively completes the sequence to generate the alt-text. While prior work has used similar methods for captioning, ours is the first to our knowledge that incorporates textual information from the associated social media post into the prefix as well, and we further demonstrate through ablations that utility of these two information sources stacks. We put forward a new dataset scraped from Twitter and evaluate on it across a variety of automated metrics as well as human evaluation, and show that our approach of conditioning on both tweet text and visual information significantly outperforms prior work.

NLP-164-标题: Measuring the Knowledge Acquisition-Utilization Gap in Pretrain ed Language Models

链接: https://arxiv.org/abs/2305.14775
作者: Amirhossein Kazemnejad, Mehdi Rezagholizadeh, Prasanna Parthasarathi, Sarath Chandar


Abstract: While pre-trained language models (PLMs) have shown evidence of acquiring vast amounts of knowledge, it remains unclear how much of this parametric knowledge is actually usable in performing downstream tasks. We propose a systematic framework to measure parametric knowledge utilization in PLMs. Our framework first extracts knowledge from a PLM’s parameters and subsequently constructs a downstream task around this extracted knowledge. Performance on this task thus depends exclusively on utilizing the model’s possessed knowledge, avoiding confounding factors like insufficient signal. As an instantiation, we study factual knowledge of PLMs and measure utilization across 125M to 13B parameter PLMs. We observe that: (1) PLMs exhibit two gaps - in acquired vs. utilized knowledge, (2) they show limited robustness in utilizing knowledge under distribution shifts, and (3) larger models close the acquired knowledge gap but the utilized knowledge gap remains. Overall, our study provides insights into PLMs’ capabilities beyond their acquired knowledge.

NLP-165-标题: A Controllable QA-based Framework for Decontextualization

链接: https://arxiv.org/abs/2305.14772
作者: Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, Kyle Lo
备注: 11 pages, 3 figures, 6 tables


Abstract: Many real-world applications require surfacing extracted snippets to users, whether motivated by assistive tools for literature surveys or document cross-referencing, or needs to mitigate and recover from model generated inaccuracies., Yet, these passages can be difficult to consume when divorced from their original document context. In this work, we explore the limits of LLMs to perform decontextualization of document snippets in user-facing scenarios, focusing on two real-world settings - question answering and citation context previews for scientific documents. We propose a question-answering framework for decontextualization that allows for better handling of user information needs and preferences when determining the scope of rewriting. We present results showing state-of-the-art LLMs under our framework remain competitive with end-to-end approaches. We also explore incorporating user preferences into the system, finding our framework allows for controllability.

NLP-166-标题: SSD-2: Scaling and Inference-time Fusion of Diffusion Language Models

链接: https://arxiv.org/abs/2305.14771
作者: Xiaochuang Han, Sachin Kumar, Yulia Tsvetkov, Marjan Ghazvininejad


Abstract: Diffusion-based language models (LMs) have been shown to be competent generative models that are easy to control at inference and are a promising alternative to autoregressive LMs. While autoregressive LMs have benefited immensely from scaling and instruction-based learning, existing studies on diffusion LMs have been conducted on a relatively smaller scale. Starting with a recently proposed diffusion model SSD-LM, in this work we explore methods to scale it from 0.4B to 13B parameters, proposing several techniques to improve its training and inference efficiency. We call the new model SSD-2. We further show that this model can be easily finetuned to follow instructions. Finally, leveraging diffusion models’ capability at inference-time control, we show that SSD-2 facilitates novel ensembles with 100x smaller models that can be customized and deployed by individual users. We find that compared to autoregressive models, the collaboration between diffusion models is more effective, leading to higher-quality and more relevant model responses due to their ability to incorporate bi-directional contexts.

NLP-167-标题: Using Natural Language Explanations to Rescale Human Judgments

链接: https://arxiv.org/abs/2305.14770
作者: Manya Wadhwa, Jifan Chen, Junyi Jessy Li, Greg Durrett
备注: Data available at this https URL


Abstract: The rise of large language models (LLMs) has brought a critical need for high-quality human-labeled data, particularly for processes like human feedback and evaluation. A common practice is to label data via consensus annotation over the judgments of multiple crowdworkers. However, different annotators may have different interpretations of labeling schemes unless given extensive training, and for subjective NLP tasks, even trained expert annotators can diverge heavily. We show that these nuances can be captured by high quality natural language explanations, and propose a method to rescale ordinal annotation in the presence of disagreement using LLMs. Specifically, we feed Likert ratings and corresponding natural language explanations into an LLM and prompt it to produce a numeric score. This score should reflect the underlying assessment of the example by the annotator. The presence of explanations allows the LLM to homogenize ratings across annotators in spite of scale usage differences. We explore our technique in the context of a document-grounded question answering task on which large language models achieve near-human performance. Among questions where annotators identify incompleteness in the answers, our rescaling improves correlation between nearly all annotator pairs, improving pairwise correlation on these examples by an average of 0.2 Kendall’s tau.

NLP-168-标题: BeamSearchQA: Large Language Models are Strong Zero-Shot QA Solver

链接: https://arxiv.org/abs/2305.14766
作者: Hao Sun, Xiao Liu, Yeyun Gong, Yan Zhang, Nan Duan
备注: Work in progress


Abstract: Open-domain question answering is a crucial task that often requires accessing external information. Existing methods typically adopt a single-turn retrieve-then-read approach, where relevant documents are first retrieved, and questions are then answered based on the retrieved information. However, there are cases where answering a question requires implicit knowledge that is not directly retrievable from the question itself. In this work, we propose a novel question-answering pipeline called eamSearchQA. Our approach leverages large language models(LLMs) to iteratively generate new questions about the original question, enabling an iterative reasoning process. By iteratively refining and expanding the scope of the question, our method aims to capture and utilize hidden knowledge that may not be directly obtainable through retrieval. We evaluate our approach on the widely-used open-domain NQ and WebQ datasets. The experimental results demonstrate that BeamSearchQA significantly outperforms other zero-shot baselines, indicating its effectiveness in tackling the challenges of open-domain question answering.

NLP-169-标题: Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models

链接: https://arxiv.org/abs/2305.14763
作者: Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, Vered Shwartz


Abstract: The escalating debate on AI’s capabilities warrants developing reliable metrics to assess machine “intelligence”. Recently, many anecdotal examples were used to suggest that newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM); however, prior work reached conflicting conclusions regarding those abilities. We investigate the extent of LLMs’ N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust. We further examine the factors impacting performance on N-ToM tasks and discover that LLMs struggle with adversarial examples, indicating reliance on shallow heuristics rather than robust ToM abilities. We caution against drawing conclusions from anecdotal examples, limited benchmark testing, and using human-designed psychological tests to evaluate models.

NLP-170-标题: UniChart: A Universal Vision-language Pretrain ed Model for Chart Comprehension and Reasoning

链接: https://arxiv.org/abs/2305.14761
作者: Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, Shafiq Joty


Abstract: Charts are very popular for analyzing data, visualizing key insights and answering complex reasoning questions about data. To facilitate chart-based data analysis using natural language, several downstream tasks have been introduced recently such as chart question answering and chart summarization. However, most of the methods that solve these tasks use pretraining on language or vision-language tasks that do not attempt to explicitly model the structure of the charts (e.g., how data is visually encoded and how chart elements are related to each other). To address this, we first build a large corpus of charts covering a wide variety of topics and visual styles. We then present UniChart, a pretrained model for chart comprehension and reasoning. UniChart encodes the relevant text, data, and visual elements of charts and then uses a chart-grounded text decoder to generate the expected output in natural language. We propose several chart-specific pretraining tasks that include: (i) low-level tasks to extract the visual elements (e.g., bars, lines) and data from charts, and (ii) high-level tasks to acquire chart understanding and reasoning skills. We find that pretraining the model on a large corpus with chart-specific low- and high-level tasks followed by finetuning on three down-streaming tasks results in state-of-the-art performance on three downstream tasks.

NLP-171-标题: Bi-Drop: Generalizable Fine-tuning for Pre-trained Language Models via Adaptive Subnetwork Optimization

链接: https://arxiv.org/abs/2305.14760
作者: Shoujie Tong, Heming Xia, Damai Dai, Tianyu Liu, Binghuai Lin, Yunbo Cao, Zhifang Sui
备注: Work in progress. Co-first authors with equal contributions


Abstract: Pretrained language models have achieved remarkable success in a variety of natural language understanding tasks. Nevertheless, finetuning large pretrained models on downstream tasks is susceptible to overfitting if the training set is limited, which will lead to diminished performance. In this work, we propose a dynamic fine-tuning strategy for pretrained language models called Bi-Drop. It utilizes the gradient information of various sub-models generated by dropout to update the model parameters selectively. Experiments on the GLUE benchmark show that Bi-Drop outperforms previous fine-tuning methods by a considerable margin, and exhibits consistent superiority over vanilla fine-tuning across various pretrained models. Furthermore, empirical results indicate that Bi-Drop yields substantial improvements in the multiple task or domain transfer, data imbalance, and low-resource scenarios, demonstrating superb generalization ability and robustness.

NLP-172-标题: Human-Centered Metrics for Dialog System Evaluation

链接: https://arxiv.org/abs/2305.14757
作者: Salvatore Giorgi, Shreya Havaldar, Farhan Ahmed, Zuhaib Akhtar, Shalaka Vaidya, Gary Pan, Lyle H. Ungar, H. Andrew Schwartz, Joao Sedoc


Abstract: We present metrics for evaluating dialog systems through a psychologically-grounded “human” lens: conversational agents express a diversity of both states (short-term factors like emotions) and traits (longer-term factors like personality) just as people do. These interpretable metrics consist of five measures from established psychology constructs that can be applied both across dialogs and on turns within dialogs: emotional entropy, linguistic style and emotion matching, as well as agreeableness and empathy. We compare these human metrics against 6 state-of-the-art automatic metrics (e.g. BARTScore and BLEURT) on 7 standard dialog system data sets. We also introduce a novel data set, the Three Bot Dialog Evaluation Corpus, which consists of annotated conversations from ChatGPT, GPT-3, and BlenderBot. We demonstrate the proposed human metrics offer novel information, are uncorrelated with automatic metrics, and lead to increased accuracy beyond existing automatic metrics for predicting crowd-sourced dialog judgements. The interpretability and unique signal of our proposed human-centered framework make it a valuable tool for evaluating and improving dialog systems.

NLP-173-标题: Dont Take This Out of Context! On the Need for Contextual Models and Evaluations for Stylistic Rewriting

链接: https://arxiv.org/abs/2305.14755
作者: Akhila Yerukola, Xuhui Zhou, Maarten Sap
备注: may 24 submission


Abstract: Most existing stylistic text rewriting methods operate on a sentence level, but ignoring the broader context of the text can lead to generic, ambiguous, and incoherent rewrites. In this paper, we propose the integration of preceding textual context into both the rewriting and evaluation stages of stylistic text rewriting, focusing on formality, toxicity, and sentiment transfer tasks. We conduct a comparative evaluation of rewriting through few-shot prompting of GPT-3.5 and GPT NeoX, comparing non-contextual rewrites to contextual rewrites. Our experiments show that humans often prefer contextual rewrites over non-contextual ones, but automatic metrics (e.g., BLEU, sBERT) do not. To bridge this gap, we propose context-infused versions of common automatic metrics, and show that these better reflect human preferences. Overall, our paper highlights the importance of integrating preceding textual context into both the rewriting and evaluation stages of stylistic text rewriting.

NLP-174-标题: DialogVCS: Robust Natural Language Understanding in Dialogue System Upgrade

链接: https://arxiv.org/abs/2305.14751
作者: Zefan Cai, Xin Zheng, Tianyu Liu, Xu Wang, Haoran Meng, Jiaqi Han, Gang Yuan, Binghuai Lin, Baobao Chang, Yunbo Cao
备注: work in progress. The first three authors contribute equally


Abstract: In the constant updates of the product dialogue systems, we need to retrain the natural language understanding (NLU) model as new data from the real users would be merged into the existent data accumulated in the last updates. Within the newly added data, new intents would emerge and might have semantic entanglement with the existing intents, e.g. new intents that are semantically too specific or generic are actually subset or superset of some existing intents in the semantic space, thus impairing the robustness of the NLU model. As the first attempt to solve this problem, we setup a new benchmark consisting of 4 Dialogue Version Control dataSets (DialogVCS). We formulate the intent detection with imperfect data in the system update as a multi-label classification task with positive but unlabeled intents, which asks the models to recognize all the proper intents, including the ones with semantic entanglement, in the inference. We also propose comprehensive baseline models and conduct in-depth analyses for the benchmark, showing that the semantically entangled intents can be effectively recognized with an automatic workflow.

NLP-175-标题: Mastering the ABCDs of Complex Questions: Answer-Based Claim Decomposition for Fine-grained Self-Evaluation

链接: https://arxiv.org/abs/2305.14750
作者: Nishant Balepur, Jie Huang, Samraj Moorjani, Hari Sundaram, Kevin Chen-Chuan Chang
备注: In progress preprint


Abstract: When answering complex questions, large language models (LLMs) may produce answers that do not satisfy all criteria of the question. While existing self-evaluation techniques aim to detect if such answers are correct, these techniques are unable to determine which criteria of the question are satisfied by the generated answers. To address this issue, we propose answer-based claim decomposition (ABCD), a prompting strategy that decomposes questions into a series of true/false claims that can be used to verify which criteria of the input question an answer satisfies. Using the decomposed ABCD claims, we perform fine-grained self-evaluation. Through preliminary experiments on three datasets, including a newly-collected challenge dataset ObscureQA, we find that GPT-3.5 has some ability to determine to what extent its answer satisfies the criteria of the input question, and can give insights into the errors and knowledge gaps of the model.

NLP-176-标题: ECHo: Event Causality Inference via Human-centric Reasoning

链接: https://arxiv.org/abs/2305.14740
作者: Yuxi Xie, Guanzhen Li, Min-Yen Kan
备注: Please find data and code at this https URL


Abstract: We introduce ECHo, a diagnostic dataset of event causality inference grounded in visual-and-linguistic social scenarios. ECHo employs real-world human-centric deductive information collected from crime drama, bridging the gap in multimodal reasoning towards higher social intelligence through the elicitation of intermediate Theory-of-Mind (ToM). We propose a unified framework aligned with the Chain-of-Thought (CoT) paradigm to assess the reasoning capability of current AI systems. This ToM-enhanced CoT pipeline can accommodate and integrate various large foundation models in zero-shot visual-and-linguistic understanding. With this framework, we scrutinize the advanced large language and multimodal models via three complementary human-centric ECHo tasks. Further analysis demonstrates ECHo as a challenging dataset to expose imperfections and inconsistencies in reasoning.

NLP-177-标题: Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

链接: https://arxiv.org/abs/2305.14739
作者: Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, Scott Wen-tau Yih


Abstract: Language models (LMs) often struggle to pay enough attention to the input context, and generate texts that are unfaithful or contain hallucinations. To mitigate this issue, we present context-aware decoding (CAD), which follows a contrastive output distribution that amplifies the difference between the output probabilities when a model is used with and without context. Our experiments show that CAD, without additional training, significantly improves the faithfulness of different LM families, including OPT, GPT, LLaMA and FLAN-T5 for summarization tasks (e.g., 14.3% gain for LLaMA in factuality metrics). Furthermore, CAD is particularly effective in overriding a model’s prior knowledge when it contradicts the provided context, leading to substantial improvements in tasks where resolving the knowledge conflict is essential.

NLP-178-标题: Centering the Margins: Outlier-Based Identification of Harmed Populations in Toxicity Detection

链接: https://arxiv.org/abs/2305.14735
作者: Vyoma Raman, Eve Fleisig, Dan Klein


Abstract: A standard method for measuring the impacts of AI on marginalized communities is to determine performance discrepancies between specified demographic groups. These approaches aim to address harms toward vulnerable groups, but they obscure harm patterns faced by intersectional subgroups or shared across demographic groups. We instead operationalize “the margins” as data points that are statistical outliers due to having demographic attributes distant from the “norm” and measure harms toward these outliers. We propose a Group-Based Performance Disparity Index (GPDI) that measures the extent to which a subdivision of a dataset into subgroups identifies those facing increased harms. We apply our approach to detecting disparities in toxicity detection and find that text targeting outliers is 28% to 86% more toxic for all types of toxicity examined. We also discover that model performance is consistently worse for demographic outliers, with disparities in error between outliers and non-outliers ranging from 28% to 71% across toxicity types. Our outlier-based analysis has comparable or higher GPDI than traditional subgroup-based analyses, suggesting that outlier analysis enhances identification of subgroups facing greater harms. Finally, we find that minoritized racial and religious groups are most associated with outliers, which suggests that outlier analysis is particularly beneficial for identifying harms against those groups.

NLP-179-标题: Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation

链接: https://arxiv.org/abs/2305.14734
作者: Bashar Alhafni, Go Inoue, Christian Khairallah, Nizar Habash


Abstract: Grammatical error correction (GEC) is a well-explored problem in English with many existing models and datasets. However, research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity. In this paper, we present the first results on Arabic GEC by using two newly developed Transformer-based pretrained sequence-to-sequence models. We address the task of multi-class Arabic grammatical error detection (GED) and present the first results on multi-class Arabic GED. We show that using GED information as auxiliary input in GEC models improves GEC performance across three datasets spanning different genres. Moreover, we also investigate the use of contextual morphological preprocessing in aiding GEC systems. Our models achieve state-of-the-art results on two Arabic GEC shared tasks datasets and establish a strong benchmark on a newly created dataset.

NLP-180-标题: SenteCon: Leveraging Lexicons to Learn Human-Interpretable Language Representations ACL2023

链接: https://arxiv.org/abs/2305.14728
作者: Victoria Lin, Louis-Philippe Morency
备注: Accepted to Findings of ACL 2023


Abstract: Although deep language representations have become the dominant form of language featurization in recent years, in many settings it is important to understand a model’s decision-making process. This necessitates not only an interpretable model but also interpretable features. In particular, language must be featurized in a way that is interpretable while still characterizing the original text well. We present SenteCon, a method for introducing human interpretability in deep language representations. Given a passage of text, SenteCon encodes the text as a layer of interpretable categories in which each dimension corresponds to the relevance of a specific category. Our empirical evaluations indicate that encoding language with SenteCon provides high-level interpretability at little to no cost to predictive performance on downstream tasks. Moreover, we find that SenteCon outperforms existing interpretable language representations with respect to both its downstream performance and its agreement with human characterizations of the text.

NLP-181-标题: In-Context Demonstration Selection with Cross Entropy Difference

链接: https://arxiv.org/abs/2305.14726
作者: Dan Iter, Reid Pryzant, Ruochen Xu, Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu


Abstract: Large language models (LLMs) can use in-context demonstrations to improve performance on zero-shot tasks. However, selecting the best in-context examples is challenging because model performance can vary widely depending on the selected examples. We present a cross-entropy difference (CED) method for selecting in-context demonstrations. Our method is based on the observation that the effectiveness of in-context demonstrations negatively correlates with the perplexity of the test example by a language model that was finetuned on that demonstration. We utilize parameter efficient finetuning to train small models on training data that are used for computing the cross-entropy difference between a test example and every candidate in-context demonstration. This metric is used to rank and select in-context demonstrations independently for each test input. We evaluate our method on a mix-domain dataset that combines 8 benchmarks, representing 4 text generation tasks, showing that CED for in-context demonstration selection can improve performance for a variety of LLMs.

NLP-182-标题: AMELI: Enhancing Multimodal Entity Linking with Fine-Grained Attributes

链接: https://arxiv.org/abs/2305.14725
作者: Barry Menglong Yao, Yu Chen, Qifan Wang, Sijia Wang, Minqian Liu, Zhiyang Xu, Licheng Yu, Lifu Huang
备注: 12 pages, 4 figures


Abstract: We propose attribute-aware multimodal entity linking, where the input is a mention described with a text and image, and the goal is to predict the corresponding target entity from a multimodal knowledge base (KB) where each entity is also described with a text description, a visual image and a set of attributes and values. To support this research, we construct AMELI, a large-scale dataset consisting of 18,472 reviews and 35,598 products. To establish baseline performance on AMELI, we experiment with the current state-of-the-art multimodal entity linking approaches and our enhanced attribute-aware model and demonstrate the importance of incorporating the attribute information into the entity linking process. To be best of our knowledge, we are the first to build benchmark dataset and solutions for the attribute-aware multimodal entity linking task. Datasets and codes will be made publicly available.

NLP-183-标题: I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors ACL2023

链接: https://arxiv.org/abs/2305.14724
作者: Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, Artemis Panagopoulou, Yue Yang, Marianna Apidianaki, Smaranda Muresan
备注: ACL 2023 (Findings)


Abstract: Visual metaphors are powerful rhetorical devices used to persuade or communicate creative ideas through images. Similar to linguistic metaphors, they convey meaning implicitly through symbolism and juxtaposition of the symbols. We propose a new task of generating visual metaphors from linguistic metaphors. This is a challenging task for diffusion-based text-to-image models, such as DALL \cdot E 2, since it requires the ability to model implicit meaning and compositionality. We propose to solve the task through the collaboration between Large Language Models (LLMs) and Diffusion Models: Instruct GPT-3 (davinci-002) with Chain-of-Thought prompting generates text that represents a visual elaboration of the linguistic metaphor containing the implicit meaning and relevant objects, which is then used as input to the diffusion-based text-to-image models.Using a human-AI collaboration framework, where humans interact both with the LLM and the top-performing diffusion model, we create a high-quality dataset containing 6,476 visual metaphors for 1,540 linguistic metaphors and their associated visual elaborations. Evaluation by professional illustrators shows the promise of LLM-Diffusion Model collaboration for this this http URL evaluate the utility of our Human-AI collaboration framework and the quality of our dataset, we perform both an intrinsic human-based evaluation and an extrinsic evaluation using visual entailment as a downstream task.

NLP-184-标题: CuRIAM: Corpus re Interpretation and Metalanguage in U.S. Supreme Court Opinions

链接: https://arxiv.org/abs/2305.14719
作者: Michael Kranzlein, Nathan Schneider, Kevin Tobia


Abstract: Most judicial decisions involve the interpretation of legal texts; as such, judicial opinion requires the use of language as a medium to comment on or draw attention to other language. Language used this way is called metalanguage. We develop an annotation schema for categorizing types of legal metalanguage and apply our schema to a set of U.S. Supreme Court opinions, yielding a corpus totaling 59k tokens. We remark on several patterns observed in the kinds of metalanguage used by the justices.

NLP-185-标题: Improving Language Models with Advantage-based Offline Policy Gradients

链接: https://arxiv.org/abs/2305.14718
作者: Ashutosh Baheti, Ximing Lu, Faeze Brahman, Ronan Le Bras, Maarten Sap, Mark Riedl


Abstract: Improving language model generations according to some user-defined quality or style constraints is challenging. Typical approaches include learning on additional human-written data, filtering low-quality'' data using heuristics and/or using reinforcement learning with human feedback (RLHF). However, filtering can remove valuable training signals, whereas data collection and RLHF constantly require additional human-written or LM exploration data which can be costly to obtain. A natural question to ask is Can we leverage RL to optimize LM utility on existing crowd-sourced and internet data?‘’ To this end, we present Left-over Lunch RL (LoL-RL), a simple training algorithm that uses offline policy gradients for learning language generation tasks as a 1-step RL game. LoL-RL can finetune LMs to optimize arbitrary classifier-based or human-defined utility functions on any sequence-to-sequence data. Experiments with five different language generation tasks using models of varying sizes and multiple rewards show that models trained with LoL-RL can consistently outperform the best supervised learning models. We also release our experimental code. this https URL

NLP-186-标题: Exploiting Correlations Between Contexts and Definitions with Multiple Definition Modeling

链接: https://arxiv.org/abs/2305.14717
作者: Linhan Zhang, Qian Chen, Wen Wang, Yuxin Jiang, Bing Li, Wei Wang, Xin Cao


Abstract: Definition modeling is an important task in advanced natural language applications such as understanding and conversation. Since its introduction, it focus on generating one definition for a target word or phrase in a given context, which we refer to as Single Definition Modeling (SDM). However, this approach does not adequately model the correlations and patterns among different contexts and definitions of words. In addition, the creation of a training dataset for SDM requires significant human expertise and effort. In this paper, we carefully design a new task called Multiple Definition Modeling (MDM) that pool together all contexts and definition of target words. We demonstrate the ease of creating a model as well as multiple training sets automatically. % In the experiments, we demonstrate and analyze the benefits of MDM, including improving SDM’s performance by using MDM as the pretraining task and its comparable performance in the zero-shot setting.

NLP-187-标题: GlobalBench: A Benchmark for Global Progress in Natural Language Processing

链接: https://arxiv.org/abs/2305.14716
作者: Yueqi Song, Catherine Cui, Simran Khanuja, Pengfei Liu, Fahim Faisal, Alissa Ostapenko, Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Yulia Tsvetkov, Antonios Anastasopoulos, Graham Neubig
备注: Preprint, 9 pages


Abstract: Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist. Arguably, these are due to uneven resource allocation and sub-optimal incentives to work on less resourced languages. To track and further incentivize the global development of equitable language technology, we introduce GlobalBench. Prior multilingual benchmarks are static and have focused on a limited number of tasks and languages. In contrast, GlobalBench is an ever-expanding collection that aims to dynamically track progress on all NLP datasets in all languages. Rather than solely measuring accuracy, GlobalBench also tracks the estimated per-speaker utility and equity of technology across all languages, providing a multi-faceted view of how language technology is serving people of the world. Furthermore, GlobalBench is designed to identify the most under-served languages, and rewards research efforts directed towards those languages. At present, the most under-served languages are the ones with a relatively high population, but nonetheless overlooked by composite multilingual benchmarks (like Punjabi, Portuguese, and Wu Chinese). Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.

NLP-188-标题: Gender Biases in Automatic Evaluation Metrics: A Case Study on Image Captioning

链接: https://arxiv.org/abs/2305.14711
作者: Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, Nanyun Peng


Abstract: Pretrained model-based evaluation metrics have demonstrated strong performance with high correlations with human judgments in various natural language generation tasks such as image captioning. Despite the impressive results, their impact on fairness is under-explored – it is widely acknowledged that pretrained models can encode societal biases, and utilizing them for evaluation purposes may inadvertently manifest and potentially amplify biases. In this paper, we conduct a systematic study in gender biases of model-based evaluation metrics with a focus on image captioning tasks. Specifically, we first identify and quantify gender biases in different evaluation metrics regarding profession, activity, and object concepts. Then, we demonstrate the negative consequences of using these biased metrics, such as favoring biased generation models in deployment and propagating the biases to generation models through reinforcement learning. We also present a simple but effective alternative to reduce gender biases by combining n-gram matching-based and pretrained model-based evaluation metrics.

NLP-189-标题: Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

链接: https://arxiv.org/abs/2305.14710
作者: Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, Muhao Chen


Abstract: Instruction-tuned models are trained on crowdsourcing datasets with task instructions to achieve superior performance. However, in this work we raise security concerns about this training paradigm. Our studies demonstrate that an attacker can inject backdoors by issuing very few malicious instructions among thousands of gathered data and control model behavior through data poisoning, without even the need of modifying data instances or labels themselves. Through such instruction attacks, the attacker can achieve over 90% attack success rate across four commonly used NLP datasets, and cause persistent backdoors that are easily transferred to 15 diverse datasets zero-shot. In this way, the attacker can directly apply poisoned instructions designed for one dataset on many other datasets. Moreover, the poisoned model cannot be cured by continual learning. Lastly, instruction attacks show resistance to existing inference-time defense. These findings highlight the need for more robust defenses against data poisoning attacks in instructiontuning models and underscore the importance of ensuring data quality in instruction crowdsourcing.

NLP-190-标题: The student becomes the master: Matching GPT 3 on Scientific Factual Error Correction

链接: https://arxiv.org/abs/2305.14707
作者: Dhananjay Ashok, Atharva Kulkarni, Hai Pham, Barnabás Póczos


Abstract: Due to the prohibitively high cost of creating error correction datasets, most Factual Claim Correction methods rely on a powerful verification model to guide the correction process. This leads to a significant drop in performance in domains like Scientific Claim Correction, where good verification models do not always exist. In this work, we introduce a claim correction system that makes no domain assumptions and does not require a verifier but is able to outperform existing methods by an order of magnitude – achieving 94% correction accuracy on the SciFact dataset, and 62.5% on the SciFact-Open dataset, compared to the next best methods 0.5% and 1.50% respectively. Our method leverages the power of prompting with LLMs during training to create a richly annotated dataset that can be used for fully supervised training and regularization. We additionally use a claim-aware decoding procedure to improve the quality of corrected claims. Our method is competitive with the very LLM that was used to generate the annotated dataset – with GPT3.5 achieving 89.5% and 60% correction accuracy on SciFact and SciFact-Open, despite using 1250 times as many parameters as our model.

NLP-191-标题: Flan-MoE: Scaling Instruction-Finetuned Language Models with Sparse Mixture of Experts

链接: https://arxiv.org/abs/2305.14705
作者: Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, Denny Zhou
备注: Preprint


Abstract: The explosive growth of language models and their applications have led to an increased demand for efficient and scalable methods. In this paper, we introduce Flan-MoE, a set of Instruction-Finetuned Sparse Mixture-of-Expert (MoE) models. We show that naively finetuning MoE models on a task-specific dataset (in other words, no instruction-finetuning) often yield worse performance compared to dense models of the same computational complexity. However, our Flan-MoE outperforms dense models under multiple experiment settings: instruction-finetuning only and instruction-finetuning followed by task-specific finetuning. This shows that instruction-finetuning is an essential stage for MoE models. Specifically, our largest model, Flan-MoE-32B, surpasses the performance of Flan-PaLM-62B on four benchmarks, while utilizing only one-third of the FLOPs. The success of Flan-MoE encourages rethinking the design of large-scale, high-performance language models, under the setting of task-agnostic learning.

NLP-192-标题: Analyzing Influential Factors in Human Preference Judgments via GPT -4

链接: https://arxiv.org/abs/2305.14702
作者: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh, Fei Liu


Abstract: Pairwise human judgments are pivotal in guiding large language models (LLMs) to generate outputs that align with human preferences. They are also often used in summarization evaluation, complementing existing automatic metrics. Despite their significance, however, there has been limited research probing these pairwise human judgments. The collective impact and respective weights of factors such as informativeness, coherence, fluency, and factual consistency remain elusive. The impact of hidden factors on the final judgment is also unclear. In this paper, we conduct an in-depth examination of a dataset of pairwise human judgments released by OpenAI. Utilizing the Bradley-Terry-Luce model, we identify key factors that could potentially influence human judgments. Our research uncovers the inherent preferences embedded in human judgments and suggests strategies to boost sample efficiency. Finally, we provide insights on the construction of balanced datasets for human judgment evaluations, a crucial step in shaping the behaviors of future LLMs.

NLP-193-标题: Modeling rapid language learning by distilling Bayesian priors into artificial neural networks

链接: https://arxiv.org/abs/2305.14701
作者: R. Thomas McCoy, Thomas L. Griffiths
备注: 21 pages plus references; 4 figures


Abstract: Humans can learn languages from remarkably little experience. Developing computational models that explain this ability has been a major challenge in cognitive science. Bayesian models that build in strong inductive biases - factors that guide generalization - have been successful at explaining how humans might generalize from few examples in controlled settings but are usually too restrictive to be tractably applied to more naturalistic data. By contrast, neural networks have flexible representations that allow them to learn well from naturalistic data but require many more examples than humans receive. We show that learning from limited naturalistic data is possible with an approach that combines the strong inductive biases of a Bayesian model with the flexible representations of a neural network. This approach works by distilling a Bayesian model’s biases into a neural network. Like a Bayesian model, the resulting system can learn formal linguistic patterns from a small number of examples. Like a neural network, it can also learn aspects of English syntax from a corpus of natural language - and it outperforms a standard neural network at acquiring the linguistic phenomena of recursion and priming. Bridging the divide between Bayesian models and neural networks makes it possible to handle a broader range of learning scenarios than either approach can handle on its own.

NLP-194-标题: SELFOOD: Self-Supervised Out-Of-Distribution Detection via Learning to Rank

链接: https://arxiv.org/abs/2305.14696
作者: Dheeraj Mekala, Adithya Samavedhi, Chengyu Dong, Jingbo Shang


Abstract: Deep neural classifiers trained with cross-entropy loss (CE loss) often suffer from poor calibration, necessitating the task of out-of-distribution (OOD) detection. Traditional supervised OOD detection methods require expensive manual annotation of in-distribution and OOD samples. To address the annotation bottleneck, we introduce SELFOOD, a self-supervised OOD detection method that requires only in-distribution samples as supervision. We cast OOD detection as an inter-document intra-label (IDIL) ranking problem and train the classifier with our pairwise ranking loss, referred to as IDIL loss. Specifically, given a set of in-distribution documents and their labels, for each label, we train the classifier to rank the softmax scores of documents belonging to that label to be higher than the scores of documents that belong to other labels. Unlike CE loss, our IDIL loss function reaches zero when the desired confidence ranking is achieved and gradients are backpropagated to decrease probabilities associated with incorrect labels rather than continuously increasing the probability of the correct label. Extensive experiments with several classifiers on multiple classification datasets demonstrate the effectiveness of our method in both coarse- and fine-grained settings.

NLP-195-标题: A Causal View of Entity Bias in (Large) Language Models

链接: https://arxiv.org/abs/2305.14695
作者: Fei Wang, Wenjie Mo, Yiwei Wang, Wenxuan Zhou, Muhao Chen
备注: Work in progress


Abstract: Entity bias widely affects pretrained (large) language models, causing them to excessively rely on (biased) parametric knowledge to make unfaithful predictions. Although causality-inspired methods have shown great potential to mitigate entity bias, it is hard to precisely estimate the parameters of underlying causal models in practice. The rise of black-box LLMs also makes the situation even worse, because of their inaccessible parameters and uncalibrated logits. To address these problems, we propose a specific structured causal model (SCM) whose parameters are comparatively easier to estimate. Building upon this SCM, we propose causal intervention techniques to mitigate entity bias for both white-box and black-box settings. The proposed causal intervention perturbs the original entity with neighboring entities. This intervention reduces specific biasing information pertaining to the original entity while still preserving sufficient common predictive information from similar entities. When evaluated on the relation extraction task, our training-time intervention significantly improves the F1 score of RoBERTa by 5.7 points on EntRED, in which spurious shortcuts between entities and labels are removed. Meanwhile, our in-context intervention effectively reduces the knowledge conflicts between parametric knowledge and contextual knowledge in GPT-3.5 and improves the F1 score by 9.14 points on a challenging test set derived from Re-TACRED.

NLP-196-标题: Have Large Language Models Developed a Personality?: Applicability of Self-Assessment Tests in Measuring Personality in LLMs

链接: https://arxiv.org/abs/2305.14693
作者: Xiaoyang Song, Akshat Gupta, Kiyan Mohebbizadeh, Shujie Hu, Anant Singh


Abstract: Have Large Language Models (LLMs) developed a personality? The short answer is a resounding “We Don’t Know!”. In this paper, we show that we do not yet have the right tools to measure personality in language models. Personality is an important characteristic that influences behavior. As LLMs emulate human-like intelligence and performance in various tasks, a natural question to ask is whether these models have developed a personality. Previous works have evaluated machine personality through self-assessment personality tests, which are a set of multiple-choice questions created to evaluate personality in humans. A fundamental assumption here is that human personality tests can accurately measure personality in machines. In this paper, we investigate the emergence of personality in five LLMs of different sizes ranging from 1.5B to 30B. We propose the Option-Order Symmetry property as a necessary condition for the reliability of these self-assessment tests. Under this condition, the answer to self-assessment questions is invariant to the order in which the options are presented. We find that many LLMs personality test responses do not preserve option-order symmetry. We take a deeper look at LLMs test responses where option-order symmetry is preserved to find that in these cases, LLMs do not take into account the situational statement being tested and produce the exact same answer irrespective of the situation being tested. We also identify the existence of inherent biases in these LLMs which is the root cause of the aforementioned phenomenon and makes self-assessment tests unreliable. These observations indicate that self-assessment tests are not the correct tools to measure personality in LLMs. Through this paper, we hope to draw attention to the shortcomings of current literature in measuring personality in LLMs and call for developing tools for machine personality measurement.

NLP-197-标题: Expert Prompt ing: Instructing Large Language Models to be Distinguished Experts

链接: https://arxiv.org/abs/2305.14688
作者: Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Yongdong Zhang, Zhendong Mao


Abstract: The answering quality of an aligned large language model (LLM) can be drastically improved if treated with proper crafting of prompts. In this paper, we propose ExpertPrompting to elicit the potential of LLMs to answer as distinguished experts. We first utilize In-Context Learning to automatically synthesize detailed and customized descriptions of the expert identity for each specific instruction, and then ask LLMs to provide answer conditioned on such agent background. Based on this augmented prompting strategy, we produce a new set of instruction-following data using GPT-3.5, and train a competitive open-source chat assistant called ExpertLLaMA. We employ GPT4-based evaluation to show that 1) the expert data is of significantly higher quality than vanilla answers, and 2) ExpertLLaMA outperforms existing open-source opponents and achieves 96% of the original ChatGPT’s capability. All data and the ExpertLLaMA model will be made publicly available at \urlthis https URL.

NLP-198-标题: TACR: A Table-alignment-based Cell-selection and Reasoning Model for Hybrid Question-Answering ACL2023

链接: https://arxiv.org/abs/2305.14682
作者: Jian Wu, Yicheng Xu, Yan Gao, Jian-Guang Lou, Börje F. Karlsson, Manabu Okumura
备注: Accepted at Findings of ACL 2023


Abstract: Hybrid Question-Answering (HQA), which targets reasoning over tables and passages linked from table cells, has witnessed significant research in recent years. A common challenge in HQA and other passage-table QA datasets is that it is generally unrealistic to iterate over all table rows, columns, and linked passages to retrieve evidence. Such a challenge made it difficult for previous studies to show their reasoning ability in retrieving answers. To bridge this gap, we propose a novel Table-alignment-based Cell-selection and Reasoning model (TACR) for hybrid text and table QA, evaluated on the HybridQA and WikiTableQuestions datasets. In evidence retrieval, we design a table-question-alignment enhanced cell-selection method to retrieve fine-grained evidence. In answer reasoning, we incorporate a QA module that treats the row containing selected cells as context. Experimental results over the HybridQA and WikiTableQuestions (WTQ) datasets show that TACR achieves state-of-the-art results on cell selection and outperforms fine-grained evidence retrieval baselines on HybridQA, while achieving competitive performance on WTQ. We also conducted a detailed analysis to demonstrate that being able to align questions to tables in the cell-selection stage can result in important gains from experiments of over 90% table row and column selection accuracy, meanwhile also improving output explainability.

NLP-199-标题: Emergent inabilities? Inverse scaling over the course of pretrain ing

链接: https://arxiv.org/abs/2305.14681
作者: James A. Michaelov, Benjamin K. Bergen


Abstract: Does inverse scaling only occur as a function of model parameter size, or can it also occur over the course of training? We carry out an exploratory study investigating whether, over the course of training on the language modeling task, the performance of language models at specific tasks can decrease while general performance remains high. We find that for two tasks from the Inverse Scaling Challenge - quote-repetition and redefine-math - this is indeed the case. Specifically, we find that for Pythia (Biderman et al., 2023) models with a higher number of parameters, performance decreases over the course of training at these two tasks, despite these models showing standard (positive) scaling overall. This highlights the importance of testing model performance at all relevant benchmarks any time they are trained on additional data, even if their overall performance improves.

NLP-200-标题: GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

链接: https://arxiv.org/abs/2305.14676
作者: Woojeong Jin, Subhabrata Mukherjee, Yu Cheng, Yelong Shen, Weizhu Chen, Ahmed Hassan Awadallah, Damien Jose, Xiang Ren
备注: Preprint


Abstract: Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks. However, such generalization to vision-language tasks including grounding and generation tasks has been under-explored; existing few-shot VL models struggle to handle tasks that involve object grounding and multiple images such as visual commonsense reasoning or NLVR2. In this paper, we introduce GRILL, GRounded vIsion Language aLigning, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances. Specifically, GRILL learns object grounding and localization by exploiting object-text alignments, which enables it to transfer to grounding tasks in a zero-/few-shot fashion. We evaluate our model on various zero-/few-shot VL tasks and show that it consistently surpasses the state-of-the-art few-shot methods.

NLP-201-标题: Quantifying Character Similarity with Vision Transformer s

链接: https://arxiv.org/abs/2305.14672
作者: Xinmei Yang, Abhishek Arora, Shao-Yu Jheng, Melissa Dell


Abstract: Record linkage is a bedrock of quantitative social science, as analyses often require linking data from multiple, noisy sources. Off-the-shelf string matching methods are widely used, as they are straightforward and cheap to implement and scale. Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions are more likely, that improve the accuracy of string matching. However, such lists do not exist for many settings, skewing research with linked datasets towards a few high-resource contexts that are not representative of the diversity of human societies. This study develops an extensible way to measure character substitution costs for OCR’ed documents, by employing large-scale self-supervised training of vision transformers (ViT) with augmented digital fonts. For each language written with the CJK script, we contrastively learn a metric space where different augmentations of the same character are represented nearby. In this space, homoglyphic characters - those with similar appearance such as O'' and 0’’ - have similar vector representations. Using the cosine distance between characters’ representations as the substitution cost in an edit distance matching algorithm significantly improves record linkage compared to other widely used string matching methods, as OCR errors tend to be homoglyphic in nature. Homoglyphs can plausibly capture character visual similarity across any script, including low-resource settings. We illustrate this by creating homoglyph sets for 3,000 year old ancient Chinese characters, which are highly pictorial. Fascinatingly, a ViT is able to capture relationships in how different abstract concepts were conceptualized by ancient societies, that have been noted in the archaeological literature.

NLP-202-标题: Diffusion Models in NLP: A Survey

链接: https://arxiv.org/abs/2305.14671
作者: Hao Zou, Zae Myung Kim, Dongyeop Kang


Abstract: This survey paper provides a comprehensive review of the use of diffusion models in natural language processing (NLP). Diffusion models are a class of mathematical models that aim to capture the diffusion of information or signals across a network or manifold. In NLP, diffusion models have been used in a variety of applications, such as natural language generation, sentiment analysis, topic modeling, and machine translation. This paper discusses the different formulations of diffusion models used in NLP, their strengths and limitations, and their applications. We also perform a thorough comparison between diffusion models and alternative generative models, specifically highlighting the autoregressive (AR) models, while also examining how diverse architectures incorporate the Transformer in conjunction with diffusion models. Compared to AR models, diffusion models have significant advantages for parallel generation, text interpolation, token-level controls such as syntactic structures and semantic contents, and robustness. Exploring further permutations of integrating Transformers into diffusion models would be a valuable pursuit. Also, the development of multimodal diffusion models and large-scale diffusion language models with notable capabilities for few-shot learning would be important directions for the future advance of diffusion models in NLP.

NLP-203-标题: You Are What You Annotate: Towards Better Models through Annotator Representations

链接: https://arxiv.org/abs/2305.14663
作者: Naihao Deng, Siyang Liu, Xinliang Frederick Zhang, Winston Wu, Lu Wang, Rada Mihalcea
备注: 19 pages


Abstract: Annotator disagreement is ubiquitous in natural language processing (NLP) tasks. There are multiple reasons for such disagreements, including the subjectivity of the task, difficult cases, unclear guidelines, and so on. Rather than simply aggregating labels to obtain data annotations, we instead propose to explicitly account for the annotator idiosyncrasies and leverage them in the modeling process. We create representations for the annotators (annotator embeddings) and their annotations (annotation embeddings) with learnable matrices associated with each. Our approach significantly improves model performance on various NLP benchmarks by adding fewer than 1% model parameters. By capturing the unique tendencies and subjectivity of individual annotators, our embeddings help democratize AI and ensure that AI models are inclusive of diverse viewpoints.

NLP-204-标题: Complex Mathematical Symbol Definition Structures: A Dataset and Model for Coordination Resolution in Definition Extraction

链接: https://arxiv.org/abs/2305.14660
作者: Anna Martin-Boyle, Andrew Head, Kyle Lo, Risham Sidhu, Marti A. Hearst, Dongyeop Kang
备注: 9 pages, 4 figures


Abstract: Mathematical symbol definition extraction is important for improving scholarly reading interfaces and scholarly information extraction (IE). However, the task poses several challenges: math symbols are difficult to process as they are not composed of natural language morphemes; and scholarly papers often contain sentences that require resolving complex coordinate structures. We present SymDef, an English language dataset of 5,927 sentences from full-text scientific papers where each sentence is annotated with all mathematical symbols linked with their corresponding definitions. This dataset focuses specifically on complex coordination structures such as “respectively” constructions, which often contain overlapping definition spans. We also introduce a new definition extraction method that masks mathematical symbols, creates a copy of each sentence for each symbol, specifies a target symbol, and predicts its corresponding definition spans using slot filling. Our experiments show that our definition extraction model significantly outperforms RoBERTa and other strong IE baseline systems by 10.9 points with a macro F1 score of 84.82. With our dataset and model, we can detect complex definitions in scholarly documents to make scientific writing more readable.

NLP-205-标题: InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction

链接: https://arxiv.org/abs/2305.14659
作者: Ishani Mondal, Michelle Yuan, Anandhavelu N, Aparna Garimella, Francis Ferraro, Andrew Blair-Stanek, Benjamin Van Durme, Jordan Boyd-Graber


Abstract: Learning template based information extraction from documents is a crucial yet difficult task. Prior template-based IE approaches assume foreknowledge of the domain templates; however, real-world IE do not have pre-defined schemas and it is a figure-out-as you go phenomena. To quickly bootstrap templates in a real-world setting, we need to induce template slots from documents with zero or minimal supervision. Since the purpose of question answering intersect with the goal of information extraction, we use automatic question generation to induce template slots from the documents and investigate how a tiny amount of a proxy human-supervision on-the-fly (termed as InteractiveIE) can further boost the performance. Extensive experiments on biomedical and legal documents, where obtaining training data is expensive, reveal encouraging trends of performance improvement using InteractiveIE over AI-only baseline.

NLP-206-标题: Evaluate What You Cant Evaluate: Unassessable Generated Responses Quality

链接: https://arxiv.org/abs/2305.14658
作者: Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang, Hinrich Schütze
备注: preprint


Abstract: LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on LLMs, we construct two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD, respectively. Compared to previous meta-evaluation benchmarks, KdConv-ADV and DSTC7-ADV are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. Empirical results show that the ability of LLMs to identify unreasonable responses is insufficient. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.

NLP-207-标题: Denoising Bottleneck with Mutual Information Maximization for Video Multimodal Fusion ACL2023

链接: https://arxiv.org/abs/2305.14652
作者: Shaoxaing Wu, Damai Dai, Ziwei Qin, Tianyu Liu, Binghuai Lin, Yunbo Cao, Zhifang Sui
备注: Accept at ACL2023


Abstract: Video multimodal fusion aims to integrate multimodal signals in videos, such as visual, audio and text, to make a complementary prediction with multiple modalities contents. However, unlike other image-text multimodal tasks, video has longer multimodal sequences with more redundancy and noise in both visual and audio modalities. Prior denoising methods like forget gate are coarse in the granularity of noise filtering. They often suppress the redundant and noisy information at the risk of losing critical information. Therefore, we propose a denoising bottleneck fusion (DBF) model for fine-grained video multimodal fusion. On the one hand, we employ a bottleneck mechanism to filter out noise and redundancy with a restrained receptive field. On the other hand, we use a mutual information maximization module to regulate the filter-out module to preserve key information within different modalities. Our DBF model achieves significant improvement over current state-of-the-art baselines on multiple benchmarks covering multimodal sentiment analysis and multimodal summarization tasks. It proves that our model can effectively capture salient features from noisy and redundant video, audio, and text inputs. The code for this paper is publicly available at this https URL.

NLP-208-标题: Revisit and Outstrip Entity Alignment: A Perspective of Generative Models

链接: https://arxiv.org/abs/2305.14651
作者: Lingbing Guo, Zhuo Chen, Jiaoyan Chen, Huajun Chen
备注: Under review


Abstract: Recent embedding-based methods have achieved great successes on exploiting entity alignment from knowledge graph (KG) embeddings of multiple modals. In this paper, we study embedding-based entity alignment (EEA) from a perspective of generative models. We show that EEA is a special problem where the main objective is analogous to that in a typical generative model, based on which we theoretically prove the effectiveness of the recently developed generative adversarial network (GAN)-based EEA methods. We then reveal that their incomplete objective limits the capacity on both entity alignment and entity synthesis (i.e., generating new entities). We mitigate this problem by introducing a generative EEA (abbr., GEEA) framework with the proposed mutual variational autoencoder (M-VAE) as the generative model. M-VAE can convert an entity from one KG to another and generate new entities from random noise vectors. We demonstrate the power of GEEA with theoretical analysis and empirical experiments on both entity alignment and entity synthesis tasks.

NLP-209-标题: Meta- review Generation with Checklist-guided Iterative Introspection

链接: https://arxiv.org/abs/2305.14647
作者: Qi Zeng, Mankeerat Sidhu, Hou Pong Chan, Lu Wang, Heng Ji


Abstract: Opinions in the scientific domain can be divergent, leading to controversy or consensus among reviewers. However, current opinion summarization datasets mostly focus on product review domains, which do not account for this variability under the assumption that the input opinions are non-controversial. To address this gap, we propose the task of scientific opinion summarization, where research paper reviews are synthesized into meta-reviews. To facilitate this task, we introduce a new ORSUM dataset covering 10,989 paper meta-reviews and 40,903 paper reviews from 39 conferences. Furthermore, we propose the Checklist-guided Iterative Introspection (CGI ^2 ) approach, which breaks down the task into several stages and iteratively refines the summary under the guidance of questions from a checklist. We conclude that (1) human-written summaries are not always reliable since many do not follow the guideline, and (2) the combination of task decomposition and iterative self-refinement shows promising discussion involvement ability and can be applied to other complex text generation using black-box LLM.

NLP-210-标题: Iteratively Improving Biomedical Entity Linking and Event Extraction via Hard Expectation-Maximization

链接: https://arxiv.org/abs/2305.14645
作者: Xiaochu Li, Minqian Liu, Zhiyang Xu, Lifu Huang
备注: 15 pages, 3 figures, 10 tables


Abstract: Biomedical entity linking and event extraction are two crucial tasks to support text understanding and retrieval in the biomedical domain. These two tasks intrinsically benefit each other: entity linking disambiguates the biomedical concepts by referring to external knowledge bases and the domain knowledge further provides additional clues to understand and extract the biological processes, while event extraction identifies a key trigger and entities involved to describe each biological process which also captures the structural context to better disambiguate the biomedical entities. However, previous research typically solves these two tasks separately or in a pipeline, leading to error propagation. What’s more, it’s even more challenging to solve these two tasks together as there is no existing dataset that contains annotations for both tasks. To solve these challenges, we propose joint biomedical entity linking and event extraction by regarding the event structures and entity references in knowledge bases as latent variables and updating the two task-specific models in a hard Expectation-Maximization (EM) fashion: (1) predicting the missing variables for each partially annotated dataset based on the current two task-specific models, and (2) updating the parameters of each model on the corresponding pseudo completed dataset. Experimental results on two benchmark datasets: Genia 2011 for event extraction and BC4GO for entity linking, show that our joint framework significantly improves the model for each individual task and outperforms the strong baselines for both tasks. We will make the code and model checkpoints publicly available once the paper is accepted.

NLP-211-标题: CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation ACL2023

链接: https://arxiv.org/abs/2305.14635
作者: Yan Zhou, Qingkai Fang, Yang Feng
备注: ACL 2023 main conference


Abstract: End-to-end speech translation (ST) is the task of translating speech signals in the source language into text in the target language. As a cross-modal task, end-to-end ST is difficult to train with limited data. Existing methods often try to transfer knowledge from machine translation (MT), but their performances are restricted by the modality gap between speech and text. In this paper, we propose Cross-modal Mixup via Optimal Transport CMOT to overcome the modality gap. We find the alignment between speech and text sequences via optimal transport and then mix up the sequences from different modalities at a token level using the alignment. Experiments on the MuST-C ST benchmark demonstrate that CMOT achieves an average BLEU of 30.0 in 8 translation directions, outperforming previous methods. Further analysis shows CMOT can adaptively find the alignment between modalities, which helps alleviate the modality gap between speech and text. Code is publicly available at this https URL.

NLP-212-标题: Testing Causal Models of Word Meaning in GPT -3 and -4

链接: https://arxiv.org/abs/2305.14630
作者: Sam Musker, Ellie Pavlick
备注: Unabridged version. Code available at this https URL


Abstract: Large Language Models (LLMs) have driven extraordinary improvements in NLP. However, it is unclear how such models represent lexical concepts-i.e., the meanings of the words they use. This paper evaluates the lexical representations of GPT-3 and GPT-4 through the lens of HIPE theory, a theory of concept representations which focuses on representations of words describing artifacts (such as “mop”, “pencil”, and “whistle”). The theory posits a causal graph that relates the meanings of such words to the form, use, and history of the objects to which they refer. We test LLMs using the same stimuli originally used by Chaigneau et al. (2004) to evaluate the theory in humans, and consider a variety of prompt designs. Our experiments concern judgements about causal outcomes, object function, and object naming. We find no evidence that GPT-3 encodes the causal structure hypothesized by HIPE, but do find evidence that GPT-4 encodes such structure. The results contribute to a growing body of research characterizing the representational capacity of large language models.

NLP-213-标题: Mixture of Prompt Experts for Generalizable and Interpretable Question Answering

链接: https://arxiv.org/abs/2305.14628
作者: Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer, Jordan Boyd-Graber


Abstract: One of the ultimate quests of question answering (QA) is to deploy a system that can answer any type of question from the users, and refrain from answering when it does not know the answer. While recent advancements in scaling large language models (LLMs) brought significant improvements on various QA datasets, it remains difficult for a single model to generalize across question types that require distinct reasoning abilities. In this paper, we first provide empirical evidence that state-of-the-art LLMs such as Codex suffer from poor generalizability on question types beyond those seen in the prompt. To address this, we propose a Mixture-of-Prompt-Experts (MOPE) system that ensembles multiple specialized LLMs. We first implement each specialized model based on the same backbone model (Codex) but with prompts optimized for different reasoning categories including factual, multihop, mathematical, and commonsense reasoning. By strategically selecting the best specialized model for each given question, our MOPE system significantly outperforms any single specialized model on a collection of 12 QA datasets from four reasoning types. Moreover, the attribution and agreement among specialized expert models offer greater interpretability, allowing for better selective question answering. Our human study further confirms that presenting the expert predictions and answer selection process helps annotators more accurately decide when to trust the system’s output. We release all code and data to facilitate future work.

NLP-214-标题: Enabling Large Language Models to Generate Text with Citations

链接: https://arxiv.org/abs/2305.14627
作者: Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen
备注: Code and data are available at this https URL


Abstract: Large language models (LLMs) have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, we aim to enable LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare with different modeling approaches. We propose ALCE, the first benchmark for Automatic LLMs’ Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations. We build automatic metrics along three dimensions – fluency, correctness, and citation quality – and demonstrate their strong correlation with human judgements. Our experiments with state-of-the-art LLMs and novel prompting strategies show that current systems have considerable room for improvements – for example, on the ELI5 dataset, even the best model has 49% of its generations lacking complete citation support. Our extensive analyses further highlight promising future directions, including developing better retrievers, advancing long-context LLMs, and improving the ability to synthesize information from multiple sources.

NLP-215-标题: KNN-LM Does Not Improve Open-ended Text Generation

链接: https://arxiv.org/abs/2305.14625
作者: Shufan Wang, Yixiao Song, Andrew Drozdov, Aparna Garimella, Varun Manjunatha, Mohit Iyyer


Abstract: In this paper, we study the generation quality of interpolation-based retrieval-augmented language models (LMs). These methods, best exemplified by the KNN-LM, interpolate the LM’s predicted distribution of the next word with a distribution formed from the most relevant retrievals for a given prefix. While the KNN-LM and related methods yield impressive decreases in perplexity, we discover that they do not exhibit corresponding improvements in open-ended generation quality, as measured by both automatic evaluation metrics (e.g., MAUVE) and human evaluations. Digging deeper, we find that interpolating with a retrieval distribution actually increases perplexity compared to a baseline Transformer LM for the majority of tokens in the WikiText-103 test set, even though the overall perplexity is lower due to a smaller number of tokens for which perplexity dramatically decreases after interpolation. However, when decoding a long sequence at inference time, significant improvements on this smaller subset of tokens are washed out by slightly worse predictions on most tokens. Furthermore, we discover that the entropy of the retrieval distribution increases faster than that of the base LM as the generated sequence becomes longer, which indicates that retrieval is less reliable when using model-generated text as queries (i.e., is subject to exposure bias). We hope that our analysis spurs future work on improved decoding algorithms and interpolation strategies for retrieval-augmented language models.

NLP-216-标题: Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models

链接: https://arxiv.org/abs/2305.14623
作者: Miaoran Li, Baolin Peng, Zhu Zhang


Abstract: Fact-checking is an essential task in NLP that is commonly utilized for validating the factual accuracy of claims. Prior work has mainly focused on fine-tuning pre-trained languages models on specific datasets, which can be computationally intensive and time-consuming. With the rapid development of large language models (LLMs), such as ChatGPT and GPT-3, researchers are now exploring their in-context learning capabilities for a wide range of tasks. In this paper, we aim to assess the capacity of LLMs for fact-checking by introducing Self-Checker, a framework comprising a set of plug-and-play modules that facilitate fact-checking by purely prompting LLMs in an almost zero-shot setting. This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments. Empirical results demonstrate the potential of Self-Checker in utilizing LLMs for fact-checking. However, there is still significant room for improvement compared to SOTA fine-tuned models, which suggests that LLM adoption could be a promising approach for future fact-checking research.

NLP-217-标题: EXnet: Efficient In-context Learning for Data-less Text classification

链接: https://arxiv.org/abs/2305.14622
作者: Debaditya Shome, Kuldeep Yadav


Abstract: Large pre-trained language models (PLMs) have made significant progress in encoding world knowledge and spawned a new set of learning paradigms including zero-shot, few-shot, and in-context learning. Many language tasks can be modeled as a set of prompts (for example, is this text about geography?) and language models can provide binary answers, i.e., Yes or No. There is evidence to suggest that the next-word prediction used by many PLMs does not align well with zero-shot paradigms. Therefore, PLMs are fine-tuned as a question-answering system. In-context learning extends zero-shot learning by incorporating prompts and examples, resulting in increased task accuracy. Our paper presents EXnet, a model specifically designed to perform in-context learning without any limitations on the number of examples. We argue that in-context learning is an effective method to increase task accuracy, and providing examples facilitates cross-task generalization, especially when it comes to text classification tasks. With extensive experiments, we show that even our smallest model (15M parameters) generalizes to several unseen classification tasks and domains.

NLP-218-标题: Abductive Commonsense Reasoning Exploiting Mutually Exclusive Explanations ACL’23

链接: https://arxiv.org/abs/2305.14618
作者: Wenting Zhao, Justin T. Chiu, Claire Cardie, Alexander M. Rush
备注: accepted at ACL’23


Abstract: Abductive reasoning aims to find plausible explanations for an event. This style of reasoning is critical for commonsense tasks where there are often multiple plausible explanations. Existing approaches for abductive reasoning in natural language processing (NLP) often rely on manually generated annotations for supervision; however, such annotations can be subjective and biased. Instead of using direct supervision, this work proposes an approach for abductive commonsense reasoning that exploits the fact that only a subset of explanations is correct for a given context. The method uses posterior regularization to enforce a mutual exclusion constraint, encouraging the model to learn the distinction between fluent explanations and plausible ones. We evaluate our approach on a diverse set of abductive reasoning datasets; experimental results show that our approach outperforms or is comparable to directly applying pretrained language models in a zero-shot manner and other knowledge-augmented zero-shot methods.

NLP-219-标题: COMET-M: Reasoning about Multiple Events in Complex Sentences

链接: https://arxiv.org/abs/2305.14617
作者: Sahithya Ravi, Raymond Ng, Vered Shwartz


Abstract: Understanding the speaker’s intended meaning often involves drawing commonsense inferences to reason about what is not stated explicitly. In multi-event sentences, it requires understanding the relationships between events based on contextual knowledge. We propose COMET-M (Multi-Event), an event-centric commonsense model capable of generating commonsense inferences for a target event within a complex sentence. COMET-M builds upon COMET (Bosselut et al., 2019), which excels at generating event-centric inferences for simple sentences, but struggles with the complexity of multi-event sentences prevalent in natural text. To overcome this limitation, we curate a multi-event inference dataset of 35K human-written inferences. We trained COMET-M on the human-written inferences and also created baselines using automatically labeled examples. Experimental results demonstrate the significant performance improvement of COMET-M over COMET in generating multi-event inferences. Moreover, COMET-M successfully produces distinct inferences for each target event, taking the complete context into consideration. COMET-M holds promise for downstream tasks involving natural text such as coreference resolution, dialogue, and story understanding.

NLP-220-标题: Exploring the Grounding Issues in Image Caption

链接: https://arxiv.org/abs/2305.14616
作者: Pin-Er Chen, Hsin-Yu Chou, Po-Ya Angela Wang, Yu-Hsiang Tseng, Shu-Kai Hsieh
备注: 10 pages, 10 figures


Abstract: This paper explores the grounding issue concerning multimodal semantic representation from a computational cognitive-linguistic view. Five perceptual properties of groundedness are annotated and analyzed: Affordance, Perceptual salience, Object number, Gaze cueing, and Ecological Niche Association (ENA). We annotated selected images from the Flickr30k dataset with exploratory analyses and statistical modeling of their captions. Our findings suggest that a comprehensive understanding of an object or event requires cognitive attention, semantic distinctions in linguistic expression, and multimodal construction. During this construction process, viewers integrate situated meaning and affordance into multimodal semantics, which is consolidated into image captions used in the image-text dataset incorporating visual and textual elements. Our findings suggest that situated meaning and affordance grounding are critical for grounded natural language understanding systems to generate appropriate responses and show the potential to advance the understanding of human construal in diverse situations.

NLP-221-标题: Selectively Answering Ambiguous Questions

链接: https://arxiv.org/abs/2305.14613
作者: Jeremy R. Cole, Michael J.Q. Zhang, Daniel Gillick, Julian Martin Eisenschlos, Bhuwan Dhingra, Jacob Eisenstein
备注: 10 pages, 5 figures, 2 pages of appendix


Abstract: Trustworthy language models should abstain from answering questions when they do not know the answer. However, the answer to a question can be unknown for a variety of reasons. Prior research has focused on the case in which the question is clear and the answer is unambiguous but possibly unknown. However, the answer to a question can also be unclear due to uncertainty of the questioner’s intent or context. We investigate question answering from this perspective, focusing on answering a subset of questions with a high degree of accuracy, from a set of questions in which many are inherently ambiguous. In this setting, we find that the most reliable approach to calibration involves quantifying repetition within a set of sampled model outputs, rather than the model’s likelihood or self-verification as used in prior work. % We find this to be the case across different types of uncertainty, varying model scales and both with or without instruction tuning. Our results suggest that sampling-based confidence scores help calibrate answers to relatively unambiguous questions, with more dramatic improvements on ambiguous questions.

NLP-222-标题: This Land is Your My Land: Evaluating Geopolitical Biases in Language Models

链接: https://arxiv.org/abs/2305.14610
作者: Bryan Li, Chris Callison-Burch


Abstract: We introduce the notion of geopolitical bias – a tendency to report different geopolitical knowledge depending on the linguistic context. As a case study, we consider territorial disputes between countries. For example, for the widely contested Spratly Islands, would an LM be more likely to say they belong to China if asked in Chinese, vs. to the Philippines if asked in Tagalog? To evaluate if such biases exist, we first collect a dataset of territorial disputes from Wikipedia, then associate each territory with a set of multilingual, multiple-choice questions. This dataset, termed BorderLines, consists of 250 territories with questions in 45 languages. We pose these question sets to language models, and analyze geopolitical bias in their responses through several proposed quantitative metrics. The metrics compare between responses in different question languages as well as to the actual geopolitical situation. The phenomenon of geopolitical bias is a uniquely cross-lingual evaluation, contrasting with prior work’s monolingual (mostly English) focus on bias evaluation. Its existence shows that the knowledge of LMs, unlike multilingual humans, is inconsistent across languages.

NLP-223-标题: OpenPI2.0: An Improved Dataset for Entity Tracking in Texts

链接: https://arxiv.org/abs/2305.14603
作者: Li Zhang, Hainiu Xu, Abhinav Kommula, Niket Tandon, Chris Callison-Burch


Abstract: Representing texts as information about entities has long been deemed effective in event reasoning. We propose OpenPI2.0, an improved dataset for tracking entity states in procedural texts. OpenPI2.0 features not only canonicalized entities that facilitate evaluation, but also salience annotations including both manual labels and automatic predictions. Regarding entity salience, we provide a survey on annotation subjectivity, modeling feasibility, and downstream applications in tasks such as question answering and classical planning.

NLP-224-标题: Learning Semantic Role Labeling from Compatible Label Sequences

链接: https://arxiv.org/abs/2305.14600
作者: Tao Li, Ghazaleh Kazeminejad, Susan W. Brown, Martha Palmer, Vivek Srikumar


Abstract: This paper addresses the question of how to efficiently learn from disjoint, compatible label sequences. We argue that the compatible structures between disjoint label sets help model learning and inference. We verify this hypothesis on the task of semantic role labeling (SRL), specifically, tagging a sentence with two role sequences: VerbNet arguments and PropBank arguments. Prior work has shown that cross-task interaction improves performance. However, the two tasks are still separately decoded, running the risk of generating structurally inconsistent label sequences (as per lexicons like SEMLINK). To eliminate this issue, we first propose a simple and effective setup that jointly handles VerbNet and PropBank labels as one sequence. With this setup, we show that enforcing SEMLINK constraints during decoding constantly improves the overall F1. With special input constructions, our joint model infers VerbNet arguments from PropBank arguments with over 99% accuracy. We also propose a constrained marginal model that uses SEMLINK information during training to further benefit from the large amounts of PropBank-only data. Our models achieve state-of-the-art F1’s on VerbNet and PropBank argument labeling on the CoNLL05 dataset with strong out-of-domain generalization.

NLP-225-标题: Bridging Continuous and Discrete Spaces: Interpretable Sentence Representation Learning via Compositional Operations

链接: https://arxiv.org/abs/2305.14599
作者: James Y. Huang, Wenlin Yao, Kaiqiang Song, Hongming Zhang, Muhao Chen, Dong Yu


Abstract: Traditional sentence embedding models encode sentences into vector representations to capture useful properties such as the semantic similarity between sentences. However, in addition to similarity, sentence semantics can also be interpreted via compositional operations such as sentence fusion or difference. It is unclear whether the compositional semantics of sentences can be directly reflected as compositional operations in the embedding space. To more effectively bridge the continuous embedding and discrete text spaces, we explore the plausibility of incorporating various compositional properties into the sentence embedding space that allows us to interpret embedding transformations as compositional sentence operations. We propose InterSent, an end-to-end framework for learning interpretable sentence embeddings that supports compositional sentence operations in the embedding space. Our method optimizes operator networks and a bottleneck encoder-decoder model to produce meaningful and interpretable sentence embeddings. Experimental results demonstrate that our method significantly improves the interpretability of sentence embeddings on four textual generation tasks over existing approaches while maintaining strong performance on traditional semantic similarity tasks.

NLP-226-标题: Voices of Her: Analyzing Gender Differences in the AI Publication World

链接: https://arxiv.org/abs/2305.14597
作者: Yiwen Ding, Jiarui Liu, Zhiheng Lyu, Kun Zhang, Bernhard Schoelkopf, Zhijing Jin, Rada Mihalcea


Abstract: While several previous studies have analyzed gender bias in research, we are still missing a comprehensive analysis of gender differences in the AI community, covering diverse topics and different development trends. Using the AI Scholar dataset of 78K researchers in the field of AI, we identify several gender differences: (1) Although female researchers tend to have fewer overall citations than males, this citation difference does not hold for all academic-age groups; (2) There exist large gender homophily in co-authorship on AI papers; (3) Female first-authored papers show distinct linguistic styles, such as longer text, more positive emotion words, and more catchy titles than male first-authored papers. Our analysis provides a window into the current demographic trends in our AI community, and encourages more gender equality and diversity in the future. Our code and data are at this https URL.

NLP-227-标题: Attentiveness to Answer Choices Doesnt Always Entail High QA Accuracy

链接: https://arxiv.org/abs/2305.14596
作者: Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, Ashish Sabharwal


Abstract: When large language models (LMs) are applied in zero- or few-shot settings to discriminative tasks such as multiple-choice questions, their attentiveness (i.e., probability mass) is spread across many vocabulary tokens that are not valid choices. Such a spread across multiple surface forms with identical meaning is thought to cause an underestimation of a model’s true performance, referred to as the “surface form competition” (SFC) hypothesis. This has motivated the introduction of various probability normalization methods. However, many core questions remain unanswered. How do we measure SFC or attentiveness? Are there direct ways of increasing attentiveness on valid choices? Does increasing attentiveness always improve task accuracy? We propose a mathematical formalism for studying this phenomenon, provide a metric for quantifying attentiveness, and identify a simple method for increasing it – namely, in-context learning with even just one example containing answer choices. The formalism allows us to quantify SFC and bound its impact. Our experiments on three diverse datasets and six LMs reveal several surprising findings. For example, encouraging models to generate a valid answer choice can, in fact, be detrimental to task performance for some LMs, and prior probability normalization methods are less effective (sometimes even detrimental) to instruction-tuned LMs. We conclude with practical insights for effectively using prompted LMs for multiple-choice tasks.

NLP-228-标题: Instruction Tuning with Lexicons for Zero-Shot Style Classification

链接: https://arxiv.org/abs/2305.14592
作者: Ruohao Guo, Wei Xu, Alan Ritter


Abstract: Style is used to convey authors’ intentions and attitudes. Despite the success of large pre-trained language models on style classification, prior work relies on fine-tuning with labeled examples. Prompting large language models to classify style without fine-tuning is challenging because language styles can be difficult to define. In this study, we investigate the effectiveness of style lexicons as a means for instructing language models how to identify new styles that are unseen during training. Our experiments show that lexicon-based instructions improve transfer zero-shot performance significantly. We will release our code and data.

NLP-229-标题: ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers

链接: https://arxiv.org/abs/2305.14591
作者: Kexun Zhang, Danqing Wang, Jingtao Xia, William Yang Wang, Lei Li


Abstract: Large language models (LLMs) excel at implementing code from functionality descriptions, but struggle with algorithmic problems that require not only implementation but also identification of the suitable algorithm. Moreover, LLM-generated programs lack guaranteed correctness and require human verification. To address these challenges, we propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the creation and verify their correctness. ALGO first generates a probably correct but possibly slow reference oracle by prompting an LLM to exhaustively enumerate all the combinations of relevant variables. This oracle is then utilized to guide an arbitrary search strategy in exploring the algorithm space and to verify the algorithms synthesized. Our study shows that the LLM-generated oracles are correct for 88% of the cases. With the oracles as verifiers, ALGO can be integrated with any existing code generation model in a model-agnostic manner to enhance its performance. Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT, the current state-of-the-art model on CodeContests. We can also get 1.3x better pass rate over the ChatGPT Code Interpreter on unseen problems.

NLP-230-标题: RE2: Region-Aware Relation Extraction from Visually Rich Documents

链接: https://arxiv.org/abs/2305.14590
作者: Pritika Ramu, Sijia Wang, Lalla Mouatadid, Joy Rimchala, Lifu Huang


Abstract: Current research in form understanding predominantly relies on large pre-trained language models, necessitating extensive data for pre-training. However, the importance of layout structure (i.e., the spatial relationship between the entity blocks in the visually rich document) to relation extraction has been overlooked. In this paper, we propose REgion-Aware Relation Extraction (RE ^2 ) that leverages region-level spatial structure among the entity blocks to improve their relation prediction. We design an edge-aware graph attention network to learn the interaction between entities while considering their spatial relationship defined by their region-level representations. We also introduce a constraint objective to regularize the model towards consistency with the inherent constraints of the relation extraction task. Extensive experiments across various datasets, languages and domains demonstrate the superiority of our proposed approach.

NLP-231-标题: Evaluating end-to-end entity linking on domain-specific knowledge bases: Learning about ancient technologies from museum collections

链接: https://arxiv.org/abs/2305.14588
作者: Sebastian Cadavid-Sanchez, Khalil Kacem, Rafael Aparecido Martins Frade, Johannes Boehm, Thomas Chaney, Danial Lashkari, Daniel Simig


Abstract: To study social, economic, and historical questions, researchers in the social sciences and humanities have started to use increasingly large unstructured textual datasets. While recent advances in NLP provide many tools to efficiently process such data, most existing approaches rely on generic solutions whose performance and suitability for domain-specific tasks is not well understood. This work presents an attempt to bridge this domain gap by exploring the use of modern Entity Linking approaches for the enrichment of museum collection data. We collect a dataset comprising of more than 1700 texts annotated with 7,510 mention-entity pairs, evaluate some off-the-shelf solutions in detail using this dataset and finally fine-tune a recent end-to-end EL model on this data. We show that our fine-tuned model significantly outperforms other approaches currently available in this domain and present a proof-of-concept use case of this model. We release our dataset and our best model.

NLP-232-标题: Contextualized Topic Coherence Metrics

链接: https://arxiv.org/abs/2305.14587
作者: Hamed Rahimi, Jacob Louis Hoover, David Mimno, Hubert Naacke, Camelia Constantin, Bernd Amann


Abstract: The recent explosion in work on neural topic modeling has been criticized for optimizing automated topic evaluation metrics at the expense of actual meaningful topic identification. But human annotation remains expensive and time-consuming. We propose LLM-based methods inspired by standard human topic evaluations, in a family of metrics called Contextualized Topic Coherence (CTC). We evaluate both a fully automated version as well as a semi-automated CTC that allows human-centered evaluation of coherence while maintaining the efficiency of automated methods. We evaluate CTC relative to five other metrics on six topic models and find that it outperforms automated topic coherence methods, works well on short documents, and is not susceptible to meaningless but high-scoring topics.

NLP-233-标题: Making the Implicit Explicit: Implicit Content as a First Class Citizen in NLP

链接: https://arxiv.org/abs/2305.14583
作者: Alexander Hoyle, Rupak Sarkar, Pranav Goel, Philip Resnik


Abstract: Language is multifaceted. A given utterance can be re-expressed in equivalent forms, and its implicit and explicit content support various logical and pragmatic inferences. When processing an utterance, we consider these different aspects, as mediated by our interpretive goals – understanding that “it’s dark in here” may be a veiled direction to turn on a light. Nonetheless, NLP methods typically operate over the surface form alone, eliding this nuance. In this work, we represent language with language, and direct an LLM to decompose utterances into logical and plausible inferences. The reduced complexity of the decompositions makes them easier to embed, opening up novel applications. Variations on our technique lead to state-of-the-art improvements on sentence embedding benchmarks, a substantive application in computational political science, and to a novel construct-discovery process, which we validate with human annotations.

NLP-234-标题: Evaluating OpenAIs Whisper ASR for Punctuation Prediction and Topic Modeling of life histories of the Museum of the Person

链接: https://arxiv.org/abs/2305.14580
作者: Lucas Rafael Stefanel Gris, Ricardo Marcacini, Arnaldo Candido Junior, Edresson Casanova, Anderson Soares, Sandra Maria Aluísio


Abstract: Automatic speech recognition (ASR) systems play a key role in applications involving human-machine interactions. Despite their importance, ASR models for the Portuguese language proposed in the last decade have limitations in relation to the correct identification of punctuation marks in automatic transcriptions, which hinder the use of transcriptions by other systems, models, and even by humans. However, recently Whisper ASR was proposed by OpenAI, a general-purpose speech recognition model that has generated great expectations in dealing with such limitations. This chapter presents the first study on the performance of Whisper for punctuation prediction in the Portuguese language. We present an experimental evaluation considering both theoretical aspects involving pausing points (comma) and complete ideas (exclamation, question, and fullstop), as well as practical aspects involving transcript-based topic modeling - an application dependent on punctuation marks for promising performance. We analyzed experimental results from videos of Museum of the Person, a virtual museum that aims to tell and preserve people’s life histories, thus discussing the pros and cons of Whisper in a real-world scenario. Although our experiments indicate that Whisper achieves state-of-the-art results, we conclude that some punctuation marks require improvements, such as exclamation, semicolon and colon.

NLP-235-标题: Connecting the Dots: What Graph-Based Text Representations Work Best for Text Classification using Graph Neural Networks?

链接: https://arxiv.org/abs/2305.14578
作者: Margarita Bugueño, Gerard de Melo
备注: 17 pages, 2 figures, 15 tables. The Appendix starts on page 12


Abstract: Given the success of Graph Neural Networks (GNNs) for structure-aware machine learning, numerous studies have explored their application to text classification, as an alternative to traditional feature representation models. However, most studies considered just a specific domain and validated on data with particular characteristics. This work presents an extensive empirical investigation of graph-based text representation methods proposed for text classification, identifying practical implications and open challenges in the field. We compare several GNN architectures as well as BERT across five datasets, encompassing short and also long documents. The results show that: i) graph performance is highly related to the textual input features and domain, ii) despite its outstanding performance, BERT has difficulties converging when dealing with short texts, iii) graph methods are particularly beneficial for longer documents.

NLP-236-标题: Difference-Masking: Choosing What to Mask in Continued Pretrain ing

链接: https://arxiv.org/abs/2305.14577
作者: Alex Wilf, Syeda Nahida Akter, Leena Mathur, Paul Pu Liang, Sheryl Mathew, Mengrou Shou, Eric Nyberg, Louis-Philippe Morency


Abstract: Self-supervised learning (SSL) and the objective of masking-and-predicting in particular have led to promising SSL performance on a variety of downstream tasks. However, while most approaches randomly mask tokens, there is strong intuition from the field of education that deciding what to mask can substantially improve learning outcomes. We introduce Difference-Masking, an approach that automatically chooses what to mask during continued pretraining by considering what makes an unlabelled target domain different from the pretraining domain. Empirically, we find that Difference-Masking outperforms baselines on continued pretraining settings across four diverse language and multimodal video tasks. The cross-task applicability of Difference-Masking supports the effectiveness of our framework for SSL pretraining in language, vision, and other domains.

NLP-237-标题: Parameter-Efficient Language Model Tuning with Active Learning in Low-Resource Settings

链接: https://arxiv.org/abs/2305.14576
作者: Josip Jukić, Jan Šnajder


Abstract: Pre-trained language models (PLMs) have ignited a surge in demand for effective fine-tuning techniques, particularly in low-resource domains and languages. Active learning (AL), a set of algorithms designed to decrease labeling costs by minimizing label complexity, has shown promise in confronting the labeling bottleneck. Concurrently, adapter modules, designed for parameter-efficient fine-tuning (PEFT), have showcased notable potential in low-resource settings. However, the interplay between AL and adapter-based PEFT remains unexplored. In our study, we empirically investigate PEFT behavior with AL in low-resource settings for text classification tasks. Our findings affirm the superiority of PEFT over full-fine tuning (FFT) in low-resource settings and demonstrate that this advantage persists in AL setups. Finally, we delve into the properties of PEFT and FFT through the lens of forgetting dynamics and instance-level representations, linking them to AL instance selection behavior and the stability of PEFT. Our research underscores the synergistic potential of AL, PEFT, and TAPT in low-resource settings, paving the way for advancements in efficient and effective fine-tuning.

NLP-238-标题: Detecting and Mitigating Indirect Stereotypes in Word Embeddings

链接: https://arxiv.org/abs/2305.14574
作者: Erin George, Joyce Chew, Deanna Needell
备注: 15 pages


Abstract: Societal biases in the usage of words, including harmful stereotypes, are frequently learned by common word embedding methods. These biases manifest not only between a word and an explicit marker of its stereotype, but also between words that share related stereotypes. This latter phenomenon, sometimes called "indirect bias,‘’ has resisted prior attempts at debiasing. In this paper, we propose a novel method called Biased Indirect Relationship Modification (BIRM) to mitigate indirect bias in distributional word embeddings by modifying biased relationships between words before embeddings are learned. This is done by considering how the co-occurrence probability of a given pair of words changes in the presence of words marking an attribute of bias, and using this to average out the effect of a bias attribute. To evaluate this method, we perform a series of common tests and demonstrate that measures of bias in the word embeddings are reduced in exchange for minor reduction in the semantic quality of the embeddings. In addition, we conduct novel tests for measuring indirect stereotypes by extending the Word Embedding Association Test (WEAT) with new test sets for indirect binary gender stereotypes. With these tests, we demonstrate the presence of more subtle stereotypes not addressed by previous work. The proposed method is able to reduce the presence of some of these new stereotypes, serving as a crucial next step towards non-stereotyped word embeddings.

NLP-239-标题: From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding ACL2023

链接: https://arxiv.org/abs/2305.14571
作者: Li Sun, Florian Luisier, Kayhan Batmanghelich, Dinei Florencio, Cha Zhang
备注: Accepted to ACL 2023 Main Conference


Abstract: Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes. This fixed vocabulary limits the model’s robustness to spelling errors and its capacity to adapt to new domains. In this work, we introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach: one at the word level and another at the sequence level. Concretely, we design an intra-word module that uses a shallow Transformer architecture to learn word representations from their characters, and a deep inter-word Transformer module that contextualizes each word representation by attending to the entire word sequence. Our model thus directly operates on character sequences with explicit awareness of word boundaries, but without biased sub-word or word-level vocabulary. Experiments on various downstream tasks show that our method outperforms strong baselines. We also demonstrate that our hierarchical model is robust to textual corruption and domain shift.

NLP-240-标题: Few-shot Unified Question Answering: Tuning Models or Prompt s?

链接: https://arxiv.org/abs/2305.14569
作者: Srijan Bansal, Semih Yavuz, Bo Pang, Meghana Bhat, Yingbo Zhou


Abstract: Question-answering (QA) tasks often investigate specific question types, knowledge domains, or reasoning skills, leading to specialized models catering to specific categories of QA tasks. While recent research has explored the idea of unified QA models, such models are usually explored for high-resource scenarios and require re-training to extend their capabilities. To overcome these drawbacks, the paper explores the potential of two paradigms of tuning, model, and prompts, for unified QA under a low-resource setting. The paper provides an exhaustive analysis of their applicability using 16 QA datasets, revealing that prompt tuning can perform as well as model tuning in a few-shot setting with a good initialization. The study also shows that parameter-sharing results in superior few-shot performance, simple knowledge transfer techniques for prompt initialization can be effective, and prompt tuning achieves a significant performance boost from pre-training in a low-resource regime. The research offers insights into the advantages and limitations of prompt tuning for unified QA in a few-shot setting, contributing to the development of effective and efficient systems in low-resource scenarios.

NLP-241-标题: PEARL: Prompt ing Large Language Models to Plan and Execute Actions Over Long Documents

链接: https://arxiv.org/abs/2305.14564
作者: Simeng Sun, Yang Liu, Shuohang Wang, Chenguang Zhu, Mohit Iyyer


Abstract: Strategies such as chain-of-thought prompting improve the performance of large language models (LLMs) on complex reasoning tasks by decomposing input examples into intermediate steps. However, it remains unclear how to apply such methods to reason over long input documents, in which both the decomposition and the output of each intermediate step are non-trivial to obtain. In this work, we propose PEARL, a prompting framework to improve reasoning over long documents, which consists of three stages: action mining, plan formulation, and plan execution. More specifically, given a question about a long document, PEARL decomposes the question into a sequence of actions (e.g., SUMMARIZE, FIND_EVENT, FIND_RELATION) and then executes them over the document to obtain the answer. Each stage of PEARL is implemented via zero-shot or few-shot prompting of LLMs (in our work, GPT-4) with minimal human input. We evaluate PEARL on a challenging subset of the QuALITY dataset, which contains questions that require complex reasoning over long narrative texts. PEARL outperforms zero-shot and chain-of-thought prompting on this dataset, and ablation experiments show that each stage of PEARL is critical to its performance. Overall, PEARL is a first step towards leveraging LLMs to reason over long documents.

NLP-242-标题: Unraveling ChatGPT : A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations

链接: https://arxiv.org/abs/2305.14556
作者: Tiziano Labruna, Sofia Brenna, Andrea Zaninello, Bernardo Magnini


Abstract: Large pre-trained language models have exhibited unprecedented capabilities in producing high-quality text via prompting techniques. This fact introduces new possibilities for data collection and annotation, particularly in situations where such data is scarce, complex to gather, expensive, or even sensitive. In this paper, we explore the potential of these models to generate and annotate goal-oriented dialogues, and conduct an in-depth analysis to evaluate their quality. Our experiments employ ChatGPT, and encompass three categories of goal-oriented dialogues (task-oriented, collaborative, and explanatory), two generation modes (interactive and one-shot), and two languages (English and Italian). Based on extensive human-based evaluations, we demonstrate that the quality of generated dialogues and annotations is on par with those generated by humans.

NLP-243-标题: All Roads Lead to Rome? Exploring the Invariance of Transformer s Representations

链接: https://arxiv.org/abs/2305.14555
作者: Yuxin Ren, Qipeng Guo, Zhijing Jin, Shauli Ravfogel, Mrinmaya Sachan, Bernhard Schölkopf, Ryan Cotterell


Abstract: Transformer models bring propelling advances in various NLP tasks, thus inducing lots of interpretability research on the learned representations of the models. However, we raise a fundamental question regarding the reliability of the representations. Specifically, we investigate whether transformers learn essentially isomorphic representation spaces, or those that are sensitive to the random seeds in their pretraining process. In this work, we formulate the Bijection Hypothesis, which suggests the use of bijective methods to align different models’ representation spaces. We propose a model based on invertible neural networks, BERT-INN, to learn the bijection more effectively than other existing bijective methods such as the canonical correlation analysis (CCA). We show the advantage of BERT-INN both theoretically and through extensive experiments, and apply it to align the reproduced BERT embeddings to draw insights that are meaningful to the interpretability research. Our code is at this https URL.

NLP-244-标题: Sources of Hallucination by Large Language Models on Inference Tasks

链接: https://arxiv.org/abs/2305.14552
作者: Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, Mark Steedman


Abstract: Large Language Models (LLMs) are claimed to be capable of Natural Language Inference (NLI), necessary for applied tasks like question answering and summarization, yet this capability is under-explored. We present a series of behavioral studies on several LLM families (LLaMA, GPT-3.5, and PaLM) which probe their behavior using controlled experiments. We establish two factors which predict much of their performance, and propose that these are major sources of hallucination in generative LLM. First, the most influential factor is memorization of the training data. We show that models falsely label NLI test samples as entailing when the hypothesis is attested in the training text, regardless of the premise. We further show that named entity IDs are used as “indices” to access the memorized data. Second, we show that LLMs exploit a further corpus-based heuristic using the relative frequencies of words. We show that LLMs score significantly worse on NLI test samples which do not conform to these factors than those which do; we also discuss a tension between the two factors, and a performance trade-off.

NLP-245-标题: Interpretable Automatic Fine-grained Inconsistency Detection in Text Summarization ACL

链接: https://arxiv.org/abs/2305.14548
作者: Hou Pong Chan, Qi Zeng, Heng Ji
备注: Accepted by ACL Findings 2023. Code and data are available at this https URL


Abstract: Existing factual consistency evaluation approaches for text summarization provide binary predictions and limited insights into the weakness of summarization systems. Therefore, we propose the task of fine-grained inconsistency detection, the goal of which is to predict the fine-grained types of factual errors in a summary. Motivated by how humans inspect factual inconsistency in summaries, we propose an interpretable fine-grained inconsistency detection model, FineGrainFact, which explicitly represents the facts in the documents and summaries with semantic frames extracted by semantic role labeling, and highlights the related semantic frames to predict inconsistency. The highlighted semantic frames help verify predicted error types and correct inconsistent summaries. Experiment results demonstrate that our model outperforms strong baselines and provides evidence to support or refute the summary.

NLP-246-标题: LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

链接: https://arxiv.org/abs/2305.14540
作者: Philippe Laban, Wojciech Kryściński, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, Shafiq Joty, Chien-Sheng Wu


Abstract: With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals that most LLMs fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance, highlighting the gaps in LLMs’ ability to reason about facts and detect inconsistencies when they occur.

NLP-247-标题: Cascaded Beam Search: Plug-and-Play Terminology-Forcing For Neural Machine Translation

链接: https://arxiv.org/abs/2305.14538
作者: Frédéric Odermatt, Béni Egressy, Roger Wattenhofer
备注: 14 pages, 7 figures


Abstract: This paper presents a plug-and-play approach for translation with terminology constraints. Terminology constraints are an important aspect of many modern translation pipelines. In both specialized domains and newly emerging domains (such as the COVID-19 pandemic), accurate translation of technical terms is crucial. Recent approaches often train models to copy terminologies from the input into the output sentence by feeding the target terminology along with the input. But this requires expensive training whenever the underlying language model is changed or the system should specialize to a new domain. We propose Cascade Beam Search, a plug-and-play terminology-forcing approach that requires no training. Cascade Beam Search has two parts: 1) logit manipulation to increase the probability of target terminologies and 2) a cascading beam setup based on grid beam search, where beams are grouped by the number of terminologies they contain. We evaluate the performance of our approach by competing against the top submissions of the WMT21 terminology translation task. Our plug-and-play approach performs on par with the winning submissions without using a domain-specific language model and with no additional training.

NLP-248-标题: MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems

链接: https://arxiv.org/abs/2305.14536
作者: Jakub Macina, Nico Daheim, Sankalan Pal Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan
备注: Jakub Macina, Nico Daheim, and Sankalan Pal Chowdhury contributed equally to this work. Code and dataset available: this https URL


Abstract: Although automatic dialogue tutors hold great potential in making education personalized and more accessible, research on such systems has been hampered by a lack of sufficiently large and high-quality datasets. However, collecting such datasets remains challenging, as recording tutoring sessions raises privacy concerns and crowdsourcing leads to insufficient data quality. To address this problem, we propose a framework to semi-synthetically generate such dialogues by pairing real teachers with a large language model (LLM) scaffolded to represent common student errors. In this paper, we describe our ongoing efforts to use this framework to collect MathDial, a dataset of currently ca. 1.5k tutoring dialogues grounded in multi-step math word problems. We show that our dataset exhibits rich pedagogical properties, focusing on guiding students using sense-making questions to let them explore problems. Moreover, we outline that MathDial and its grounding annotations can be used to finetune language models to be more effective tutors (and not just solvers) and highlight remaining challenges that need to be addressed by the research community. We will release our dataset publicly to foster research in this socially important area of NLP.

NLP-249-标题: Detecting Propaganda Techniques in Code-Switched Social Media Text

链接: https://arxiv.org/abs/2305.14534
作者: Muhammad Umar Salman, Asif Hanif, Shady Shehata, Preslav Nakov


Abstract: Propaganda is a form of communication intended to influence the opinions and the mindset of the public to promote a particular agenda. With the rise of social media, propaganda has spread rapidly, leading to the need for automatic propaganda detection systems. Most work on propaganda detection has focused on high-resource languages, such as English, and little effort has been made to detect propaganda for low-resource languages. Yet, it is common to find a mix of multiple languages in social media communication, a phenomenon known as code-switching. Code-switching combines different languages within the same text, which poses a challenge for automatic systems. With this in mind, here we propose the novel task of detecting propaganda techniques in code-switched text. To support this task, we create a corpus of 1,030 texts code-switching between English and Roman Urdu, annotated with 20 propaganda techniques, which we make publicly available. We perform a number of experiments contrasting different experimental setups, and we find that it is important to model the multilinguality directly (rather than using translation) as well as to use the right fine-tuning strategy. The code and the dataset are publicly available at this https URL

NLP-250-标题: How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation

链接: https://arxiv.org/abs/2305.14533
作者: Huda Khayrallah, Zuhaib Akhtar, Edward Cohen, João Sedoc


Abstract: We release MMSMR, a Massively Multi-System MultiReference dataset to enable future work on metrics and evaluation for dialog. Automatic metrics for dialogue evaluation should be robust proxies for human judgments; however, the verification of robustness is currently far from satisfactory. To quantify the robustness correlation and understand what is necessary in a test set, we create and release an 8-reference dialog dataset by extending single-reference evaluation sets and introduce this new language learning conversation dataset. We then train 1750 systems and evaluate them on our novel test set and the DailyDialog dataset. We release the novel test set, and model hyper parameters, inference outputs, and metric scores for each system on a variety of datasets.

NLP-251-标题: Eliminating Spurious Correlations from Pre-trained Models via Data Mixing

链接: https://arxiv.org/abs/2305.14521
作者: Yihao Xue, Ali Payani, Yu Yang, Baharan Mirzasoleiman


Abstract: Machine learning models pre-trained on large datasets have achieved remarkable convergence and robustness properties. However, these models often exploit spurious correlations between certain attributes and labels, which are prevalent in the majority of examples within specific categories but are not predictive of these categories in general. The learned spurious correlations may persist even after fine-tuning on new data, which degrades models’ performance on examples that do not exhibit the spurious correlation. In this work, we propose a simple and highly effective method to eliminate spurious correlations from pre-trained models. The key idea of our method is to leverage a small set of examples with spurious attributes, and balance the spurious attributes across all classes via data mixing. We theoretically confirm the effectiveness of our method, and empirically demonstrate its state-of-the-art performance on various vision and NLP tasks, including eliminating spurious correlations from pre-trained ResNet50 on Waterbirds and CelebA, adversarially pre-trained ResNet50 on ImageNet, and BERT pre-trained on CivilComments.

NLP-252-标题: Deduction under Perturbed Evidence: Probing Student Simulation Capabilities of Large Language Models

链接: https://arxiv.org/abs/2305.14507
作者: Shashank Sonkar, Richard G. Baraniuk


Abstract: We explore whether Large Language Models (LLMs) are capable of logical reasoning with distorted facts, which we call Deduction under Perturbed Evidence (DUPE). DUPE presents a unique challenge to LLMs since they typically rely on their parameters, which encode mostly accurate information, to reason and make inferences. However, in DUPE, LLMs must reason over manipulated or falsified evidence present in their prompts, which can result in false conclusions that are valid only under the manipulated evidence. Our goal with DUPE is to determine whether LLMs can arrive at these false conclusions and identify whether the dominant factor influencing the deduction process is the encoded data in the parameters or the manipulated evidence in the prompts. To evaluate the DUPE capabilities of LLMs, we create a DUPEd version of the StrategyQA dataset, where facts are manipulated to reverse the answer to the question. Our findings show that even the most advanced GPT models struggle to reason on manipulated facts - showcasing poor DUPE skills - with accuracy dropping by 45% compared to the original dataset. We also investigate prompt settings inspired from student simulation models, which mitigate the accuracy drop to some extent. Our findings have practical implications for understanding the performance of LLMs in real-world applications such as student simulation models that involve reasoning over inaccurate information.

NLP-253-标题: RetICL: Sequential Retrieval of In-Context Examples with Reinforcement Learning

链接: https://arxiv.org/abs/2305.14502
作者: Alexander Scarlatos, Andrew Lan


Abstract: Many recent developments in large language models focus on prompting them to perform specific tasks. One effective prompting method is in-context learning, where the model performs a (possibly new) generation/prediction task given one (or more) examples. Past work has shown that the choice of examples can make a large impact on task performance. However, finding good examples is not straightforward since the definition of a representative group of examples can vary greatly depending on the task. While there are many existing methods for selecting in-context examples, they generally score examples independently, ignoring the dependency between them and the order in which they are provided to the large language model. In this work, we propose Retrieval for In-Context Learning (RetICL), a learnable method for modeling and optimally selecting examples sequentially for in-context learning. We frame the problem of sequential example selection as a Markov decision process, design an example retriever model using an LSTM, and train it using proximal policy optimization (PPO). We validate RetICL on math problem solving datasets and show that it outperforms both heuristic and learnable baselines, and achieves state-of-the-art accuracy on the TabMWP dataset. We also use case studies to show that RetICL implicitly learns representations of math problem solving strategies.

NLP-254-标题: NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive Decoders

链接: https://arxiv.org/abs/2305.14499
作者: Livio Baldini Soares, Daniel Gillick, Jeremy R. Cole, Tom Kwiatkowski


Abstract: Neural document rerankers are extremely effective in terms of accuracy. However, the best models require dedicated hardware for serving, which is costly and often not feasible. To avoid this serving-time requirement, we present a method of capturing up to 86% of the gains of a Transformer cross-attention model with a lexicalized scoring function that only requires 10-6% of the Transformer’s FLOPs per document and can be served using commodity CPUs. When combined with a BM25 retriever, this approach matches the quality of a state-of-the art dual encoder retriever, that still requires an accelerator for query encoding. We introduce NAIL (Non-Autoregressive Indexing with Language models) as a model architecture that is compatible with recent encoder-decoder and decoder-only large language models, such as T5, GPT-3 and PaLM. This model architecture can leverage existing pre-trained checkpoints and can be fine-tuned for efficiently constructing document representations that do not require neural processing of queries.

NLP-255-标题: Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement

链接: https://arxiv.org/abs/2305.14497
作者: Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Tao Gui, Qi Zhang, Xuanjing Huang
备注: Preprint


Abstract: Prompting methods such as Chain-of-Thought (CoT) have shed new light on enhancing the reasoning capabilities of large language models, and researchers have extensively explored the generation process of rationales and answers. However, they have overlooked the potential challenges posed by the poor quality of reasoning problems, which may influence the reasoning performance significantly. In this work, we propose Self-Polish (SP), a novel method that facilitates the model’s problem-solving process by prompting them to progressively refine the given problems to be more comprehensible and solvable. Specifically, the method teaches models to eliminate irrelevant information, rearrange the logic structure and organize local conditions into new ones parallelly. SP is orthogonal to all other prompting methods, making it convenient to integrate with state-of-the-art techniques for further improvement. We conduct thorough experiments on five benchmarks to illustrate the effectiveness of the proposed method. For example, with Text-davinci-003, our method boosts the performance of standard few-shot prompting by 8.0% on GSM8K and 17.8% on MultiArith; it also improves the performance of CoT by 6.0% on GSM8K and 6.0% on MathQA, respectively. Furthermore, our method also showcases impressive performance on robustness evaluation.

NLP-256-标题: Prompt position really matters in few-shot and zero-shot NLU tasks

链接: https://arxiv.org/abs/2305.14493
作者: Junyu Mao, Stuart E. Middleton, Mahesan Niranjan


Abstract: Prompt-based models have made remarkable advancements in the fields of zero-shot and few-shot learning, attracting a lot of attention from researchers. Developing an effective prompt template plays a critical role. However, prior studies have mainly focused on prompt vocabulary selection or embedding initialization with the reserved prompt position fixed. In this empirical study, we conduct the most comprehensive analysis to date of prompt position option for natural language understanding tasks. Our findings quantify the substantial impact prompt position has on model performance. We observe that the prompt position used in prior studies is often sub-optimal for both zero-shot and few-shot settings. These findings suggest prompt position optimisation as an interesting research direction alongside the existing focus on prompt engineering.

NLP-257-标题: Sociocultural Norm Similarities and Differences via Situational Alignment and Explainable Textual Entailment

链接: https://arxiv.org/abs/2305.14492
作者: Sky CH-Wang, Arkadiy Saakyan, Oliver Li, Zhou Yu, Smaranda Muresan


Abstract: Designing systems that can reason across cultures requires that they are grounded in the norms of the contexts in which they operate. However, current research on developing computational models of social norms has primarily focused on American society. Here, we propose a novel approach to discover and compare descriptive social norms across Chinese and American cultures. We demonstrate our approach by leveraging discussions on a Chinese Q&A platform-Zhihu-and the existing SocialChemistry dataset as proxies for contrasting cultural axes, align social situations cross-culturally, and extract social norms from texts using in-context learning. Embedding Chain-of-Thought prompting in a human-AI collaborative framework, we build a high-quality dataset of 3,069 social norms aligned with social situations across Chinese and American cultures alongside corresponding free-text explanations. To test the ability of models to reason about social norms across cultures, we introduce the task of explainable social norm entailment, showing that existing models under 3B parameters have significant room for improvement in both automatic and human evaluation. Further analysis of cross-cultural norm differences based on our dataset shows empirical alignment with the social orientations framework, revealing several situational and descriptive nuances in norms across these cultures.

NLP-258-标题: Are Large Language Models Robust Zero-shot Coreference Resolvers?

链接: https://arxiv.org/abs/2305.14489
作者: Nghia T. Le, Alan Ritter


Abstract: Recent progress in domain adaptation for coreference resolution relies on continued training using annotated data from target domains. At the same time, pre-trained large language models (LMs) have exhibited strong zero- and few-shot learning abilities across a wide range of NLP tasks including pronoun resolution. While this demonstrates evidence of coreference ability, previous work has mostly studied this ability using simple sentence-level datasets such as the Winograd Schema Challenge. In this work, we assess the feasibility of zero-shot learning for coreference resolution by evaluating instruction-tuned language models on more difficult, linguistically-complex coreference benchmarks (e.g., CoNLL-2012). We demonstrate that zero-shot prompting outperforms current unsupervised coreference systems. Further investigations reveal the robust zero-shot generalization ability of instruction-tuned LMs across a wide range of domains, languages, and time periods, as well as a strong reliance on high-quality mention detection systems.

NLP-259-标题: Language Model Self-improvement by Reinforcement Learning Contemplation

链接: https://arxiv.org/abs/2305.14483
作者: Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong-Hui Chen, Jiacheng Xu, Zongzhang Zhang, Yang Yu


Abstract: Large Language Models (LLMs) have exhibited remarkable performance across various natural language processing (NLP) tasks. However, fine-tuning these models often necessitates substantial supervision, which can be expensive and time-consuming to obtain. This paper introduces a novel unsupervised method called LanguageModel Self-Improvement by Reinforcement Learning Contemplation (SIRLC) that improves LLMs without reliance on external labels. Our approach is grounded in the observation that it is simpler for language models to assess text quality than to generate text. Building on this insight, SIRLC assigns LLMs dual roles as both student and teacher. As a student, the LLM generates answers to unlabeled questions, while as a teacher, it evaluates the generated text and assigns scores accordingly. The model parameters are updated using reinforcement learning to maximize the evaluation score. We demonstrate that SIRLC can be applied to various NLP tasks, such as reasoning problems, text generation, and machine translation. Our experiments show that SIRLC effectively improves LLM performance without external supervision, resulting in a 5.6% increase in answering accuracy for reasoning tasks and a rise in BERTScore from 0.82 to 0.86 for translation tasks. Furthermore, SIRLC can be applied to models of different sizes, showcasing its broad applicability.

NLP-260-标题: Is a Prestigious Job the same as a Prestigious Country? A Case Study on Multilingual Sentence Embeddings and European Countries

链接: https://arxiv.org/abs/2305.14482
作者: Jindřich Libovický
备注: 10 pages, 1 figure


Abstract: We study how multilingual sentence representations capture European countries and how this differs across European languages. We prompt the models with templated sentences that we machine-translate into 12 European languages and analyze the most prominent dimensions in the embeddings. Our analysis reveals that the most prominent country feature in the embedding is its economic strength in terms of GPD. When prompted specifically for job prestige, the embedding space clearly distinguishes high and low-prestige jobs. The occupational dimension is uncorrelated with the most dominant country dimensions for three out of four studied models. One model: Distilled Multilingual Universal Sentence Encoder, however, exhibited a connection between occupational prestige and country of origin, which is a potential source of nationality-based discrimination. Our findings are consistent across languages and, to some extent, with the exception mentioned above, across studied representation models.

NLP-261-标题: FOCUS: Effective Embedding Initialization for Specializing Pretrain ed Multilingual Models on a Single Language

链接: https://arxiv.org/abs/2305.14481
作者: Konstantin Dobler, Gerard de Melo


Abstract: Using model weights pretrained on a high-resource language as a warm start can reduce the need for data and compute to obtain high-quality language models in low-resource languages. To accommodate the new language, the pretrained vocabulary and embeddings need to be adapted. Previous work on embedding initialization for such adapted vocabularies has mostly focused on monolingual source models. In this paper, we investigate the multilingual source model setting and propose FOCUS - Fast Overlapping Token Combinations Using Sparsemax, a novel embedding initialization method that outperforms previous work when adapting XLM-R. FOCUS represents newly added tokens as combinations of tokens in the overlap of the pretrained and new vocabularies. The overlapping tokens are selected based on semantic similarity in an auxiliary token embedding space. Our implementation of FOCUS is publicly available on GitHub.

NLP-262-标题: BAND: Biomedical Alert News Dataset

链接: https://arxiv.org/abs/2305.14480
作者: Zihao Fu, Meiru Zhang, Zaiqiao Meng, Yannan Shen, Anya Okhmatovskaia, David Buckeridge, Nigel Collier


Abstract: Infectious disease outbreaks continue to pose a significant threat to human health and well-being. To improve disease surveillance and understanding of disease spread, several surveillance systems have been developed to monitor daily news alerts and social media. However, existing systems lack thorough epidemiological analysis in relation to corresponding alerts or news, largely due to the scarcity of well-annotated reports data. To address this gap, we introduce the Biomedical Alert News Dataset (BAND), which includes 1,508 samples from existing reported news articles, open emails, and alerts, as well as 30 epidemiology-related questions. These questions necessitate the model’s expert reasoning abilities, thereby offering valuable insights into the outbreak of the disease. The BAND dataset brings new challenges to the NLP world, requiring better disguise capability of the content and the ability to infer important information. We provide several benchmark tasks, including Named Entity Recognition (NER), Question Answering (QA), and Event Extraction (EE), to show how existing models are capable of handling these tasks in the epidemiology domain. To the best of our knowledge, the BAND corpus is the largest corpus of well-annotated biomedical outbreak alert news with elaborately designed questions, making it a valuable resource for epidemiologists and NLP researchers alike.

NLP-263-标题: CGCE: A Chinese Generative Chat Evaluation Benchmark for General and Financial Domains

链接: https://arxiv.org/abs/2305.14471
作者: Xuanyu Zhang, Bingbing Li, Qing Yang


Abstract: Generative chat models, such as ChatGPT and GPT-4, have revolutionized natural language generation (NLG) by incorporating instructions and human feedback to achieve significant performance improvements. However, the lack of standardized evaluation benchmarks for chat models, particularly for Chinese and domain-specific models, hinders their assessment and progress. To address this gap, we introduce the Chinese Generative Chat Evaluation (CGCE) benchmark, focusing on general and financial domains. The CGCE benchmark encompasses diverse tasks, including 200 questions in the general domain and 150 specific professional questions in the financial domain. Manual scoring evaluates factors such as accuracy, coherence, expression clarity, and completeness. The CGCE benchmark provides researchers with a standardized framework to assess and compare Chinese generative chat models, fostering advancements in NLG research.

NLP-264-标题: Run Like a Girl! Sports-Related Gender Bias in Language and Vision

链接: https://arxiv.org/abs/2305.14468
作者: Sophia Harrison, Eleonora Gualdoni, Gemma Boleda


Abstract: Gender bias in Language and Vision datasets and models has the potential to perpetuate harmful stereotypes and discrimination. We analyze gender bias in two Language and Vision datasets. Consistent with prior work, we find that both datasets underrepresent women, which promotes their invisibilization. Moreover, we hypothesize and find that a bias affects human naming choices for people playing sports: speakers produce names indicating the sport (e.g. ‘tennis player’ or ‘surfer’) more often when it is a man or a boy participating in the sport than when it is a woman or a girl, with an average of 46% vs. 35% of sports-related names for each gender. A computational model trained on these naming data reproduces the bias. We argue that both the data and the model result in representational harm against women.

NLP-265-标题: Towards Massively Multi-domain Multilingual Readability Assessment

链接: https://arxiv.org/abs/2305.14463
作者: Tarek Naous, Michael J. Ryan, Mohit Chandra, Wei Xu


Abstract: We present ReadMe++, a massively multi-domain multilingual dataset for automatic readability assessment. Prior work on readability assessment has been mostly restricted to the English language and one or two text domains. Additionally, the readability levels of sentences used in many previous datasets are assumed on the document-level other than sentence-level, which raises doubt about the quality of previous evaluations. We address those gaps in the literature by providing an annotated dataset of 6,330 sentences in Arabic, English, and Hindi collected from 64 different domains of text. Unlike previous datasets, ReadMe++ offers more domain and language diversity and is manually annotated at a sentence level using the Common European Framework of Reference for Languages (CEFR) and through a Rank-and-Rate annotation framework that reduces subjectivity in annotation. Our experiments demonstrate that models fine-tuned using ReadMe++ achieve strong cross-lingual transfer capabilities and generalization to unseen domains. ReadMe++ will be made publicly available to the research community.

NLP-266-标题: Enhancing Generation through Summarization Duality and Explicit Outline Control

链接: https://arxiv.org/abs/2305.14459
作者: Yunzhe Li, Qian Chen, Weixiang Yan, Wen Wang, Qinglin Zhang, Hari Sundaram
备注: 6 pages


Abstract: Automatically open-ended long text generation poses significant challenges due to semantic incoherence and plot implausibility. Previous works usually alleviate this problem through outlines in the form of short phrases or abstractive signals by designing unsupervised tasks, which tend to be unstable and weakly interpretable. Assuming that a summary serves as a mature outline, we introduce a two-stage, summary-enhanced outline supervised generation framework. This framework leverages the dual characteristics of the summarization task to improve outline prediction, resulting in more explicit and plausible outlines. Furthermore, we identify an underutilization issue in outline-based generation with both standard pretrained language models (e.g., GPT-2, BART) and large language models (e.g., Vicuna, ChatGPT). To address this, we propose a novel explicit outline control method for more effective utilization of generated outlines.

NLP-267-标题: Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA

链接: https://arxiv.org/abs/2305.14458
作者: David Heineman, Yao Dou, Mounica Maddela, Wei Xu


Abstract: Large language models (e.g., GPT-3.5) are uniquely capable of producing highly rated text simplification, yet current human evaluation methods fail to provide a clear understanding of systems’ specific strengths and weaknesses. To address this limitation, we introduce SALSA, an edit-based human annotation framework that enables holistic and fine-grained text simplification evaluation. We develop twenty one linguistically grounded edit types, covering the full spectrum of success and failure across dimensions of conceptual, syntactic and lexical simplicity. Using SALSA, we collect 12K edit annotations on 700 simplifications, revealing discrepancies in the distribution of transformation approaches performed by fine-tuned models, few-shot LLMs and humans, and finding GPT-3.5 performs more quality edits than humans, but still exhibits frequent errors. Using our fine-grained annotations, we develop LENS-SALSA, a reference-free automatic simplification metric, trained to predict sentence- and word-level quality simultaneously. Additionally, we introduce word-level quality estimation for simplification and report promising baseline results. Our training material, annotation toolkit, and data are released at this http URL.

NLP-268-标题: Pre-training Language Models for Comparative Reasoning

链接: https://arxiv.org/abs/2305.14457
作者: Mengxia Yu, Zhihan Zhang, Wenhao Yu, Meng Jiang


Abstract: In this paper, we propose a novel framework to pre-train language models for enhancing their abilities of comparative reasoning over texts. While recent research has developed models for NLP tasks that require comparative reasoning, they suffer from costly manual data labeling and limited generalizability to different tasks. Our approach involves a scalable method for collecting data for text-based entity comparison, which leverages both structured and unstructured data, and the design of three novel pre-training tasks. Evaluation on a range of downstream tasks including comparative question answering, question generation, and summarization shows that our pre-training framework significantly improves the comparative reasoning abilities of language models, especially under low-resource conditions. This work also releases the first integrated benchmark for comparative reasoning over texts.

NLP-269-标题: Having Beer after Prayer? Measuring Cultural Bias in Large Language Models

链接: https://arxiv.org/abs/2305.14456
作者: Tarek Naous, Michael J. Ryan, Wei Xu


Abstract: Are language models culturally biased? It is important that language models conform to the cultural aspects of the communities they serve. However, we show in this paper that language models suffer from a significant bias towards Western culture when handling and generating text in Arabic, often preferring, and producing Western-fitting content as opposed to the relevant Arab content. We quantify this bias through a likelihood scoring-based metric using naturally occurring contexts that we collect from online social media. Our experiments reveal that both Arabic monolingual and multilingual models exhibit bias towards Western culture in eight different cultural aspects: person names, food, clothing, location, literature, beverage, religion, and sports. Models also tend to exhibit more bias when prompted with Arabic sentences that are more linguistically aligned with English. These findings raise concerns about the cultural relevance of current language models. Our analyses show that providing culture-indicating tokens or culturally-relevant demonstrations to the model can help in debiasing.

NLP-270-标题: On Robustness of Finetuned Transformer -based NLP Models

链接: https://arxiv.org/abs/2305.14453
作者: Pavan Kalyan Reddy Neerudu, Subba Reddy Oota, Mounika Marreddy, Venkateswara Rao Kagita, Manish Gupta
备注: 16 pages, 8 figures


Abstract: Transformer-based pretrained models like BERT, GPT-2 and T5 have been finetuned for a large number of natural language processing (NLP) tasks, and have been shown to be very effective. However, while finetuning, what changes across layers in these models with respect to pretrained checkpoints is under-studied. Further, how robust are these models to perturbations in input text? Does the robustness vary depending on the NLP task for which the models have been finetuned? While there exists some work on studying robustness of BERT finetuned for a few NLP tasks, there is no rigorous study which compares this robustness across encoder only, decoder only and encoder-decoder models. In this paper, we study the robustness of three language models (BERT, GPT-2 and T5) with eight different text perturbations on the General Language Understanding Evaluation (GLUE) benchmark. Also, we use two metrics (CKA and STIR) to quantify changes between pretrained and finetuned language model representations across layers. GPT-2 representations are more robust than BERT and T5 across multiple types of input perturbation. Although models exhibit good robustness broadly, dropping nouns, verbs or changing characters are the most impactful. Overall, this study provides valuable insights into perturbation-specific weaknesses of popular Transformer-based models which should be kept in mind when passing inputs.

NLP-271-标题: Is Information Extraction Solved by ChatGPT ? An Analysis of Performance Evaluation Criteria Robustness and Errors

链接: https://arxiv.org/abs/2305.14450
作者: Ridong Han, Tao Peng, Chaohao Yang, Benyou Wang, Lu Liu, Xiang Wan
备注: 23 pages, version 1.0


Abstract: ChatGPT has stimulated the research boom in the field of large language models. In this paper, we assess the capabilities of ChatGPT from four perspectives including Performance, Evaluation Criteria, Robustness and Error Types. Specifically, we first evaluate ChatGPT’s performance on 17 datasets with 14 IE sub-tasks under the zero-shot, few-shot and chain-of-thought scenarios, and find a huge performance gap between ChatGPT and SOTA results. Next, we rethink this gap and propose a soft-matching strategy for evaluation to more accurately reflect ChatGPT’s performance. Then, we analyze the robustness of ChatGPT on 14 IE sub-tasks, and find that: 1) ChatGPT rarely outputs invalid responses; 2) Irrelevant context and long-tail target types greatly affect ChatGPT’s performance; 3) ChatGPT cannot understand well the subject-object relationships in RE task. Finally, we analyze the errors of ChatGPT, and find that “unannotated spans” is the most dominant error type. This raises concerns about the quality of annotated data, and indicates the possibility of annotating data with ChatGPT. The data and code are released at Github site.

NLP-272-标题: Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions ACL

链接: https://arxiv.org/abs/2305.14441
作者: Zhihan Zhang, Wenhao Yu, Zheng Ning, Mingxuan Ju, Meng Jiang
备注: Accepted at TACL. This is a pre-MIT Press publication version


Abstract: Contrast consistency, the ability of a model to make consistently correct predictions in the presence of perturbations, is an essential aspect in NLP. While studied in tasks such as sentiment analysis and reading comprehension, it remains unexplored in open-domain question answering (OpenQA) due to the difficulty of collecting perturbed questions that satisfy factuality requirements. In this work, we collect minimally edited questions as challenging contrast sets to evaluate OpenQA models. Our collection approach combines both human annotation and large language model generation. We find that the widely used dense passage retriever (DPR) performs poorly on our contrast sets, despite fitting the training set well and performing competitively on standard test sets. To address this issue, we introduce a simple and effective query-side contrastive loss with the aid of data augmentation to improve DPR training. Our experiments on the contrast sets demonstrate that DPR’s contrast consistency is improved without sacrificing its accuracy on the standard test sets.

NLP-273-标题: Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction

链接: https://arxiv.org/abs/2305.14434
作者: Yew Ken Chia, Hui Chen, Wei Han, Guizhen Chen, Sharifah Mahani Aljunied, Soujanya Poria, Lidong Bing


Abstract: Aspect Sentiment Triplet Extraction (ASTE) is a subtask of Aspect-Based Sentiment Analysis (ABSA) that considers each opinion term, their expressed sentiment, and the corresponding aspect targets. However, existing methods are limited to the in-domain setting with two domains. Hence, we propose a domain-expanded benchmark to address the in-domain, out-of-domain and cross-domain settings. We support the new benchmark by annotating more than 4000 data samples for two new domains based on hotel and cosmetics reviews. Our analysis of five existing methods shows that while there is a significant gap between in-domain and out-of-domain performance, generative methods have a strong potential for domain generalization. Our datasets, code implementation and models are available at this https URL .

NLP-274-标题: Image Manipulation via Multi-Hop Instructions – A New Dataset and Weakly-Supervised Neuro-Symbolic Approach

链接: https://arxiv.org/abs/2305.14410
作者: Harman Singh, Poorva Garg, Mohit Gupta, Kevin Shah, Arnab Kumar Mondal, Dinesh Khandelwal, Parag Singla, Dinesh Garg


Abstract: We are interested in image manipulation via natural language text – a task that is useful for multiple AI applications but requires complex reasoning over multi-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning (NSCL), which has been quite effective for the task of Visual Question Answering (VQA), for the task of image manipulation. Our system referred to as NeuroSIM can perform complex multi-hop reasoning over multi-object scenes and only requires weak supervision in the form of annotated data for VQA. NeuroSIM parses an instruction into a symbolic program, based on a Domain Specific Language (DSL) comprising of object attributes and manipulation operations, that guides its execution. We create a new dataset for the task, and extensive experiments demonstrate that NeuroSIM is highly competitive with or beats SOTA baselines that make use of supervised data for manipulation.

NLP-275-标题: AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback

链接: https://arxiv.org/abs/2305.14387
作者: Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto


Abstract: Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their ability to follow user instructions well. Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following process faces three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these challenges with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM prompts to simulate human feedback that are 45x cheaper than crowdworkers and display high agreement with humans. Second, we propose an automatic evaluation and validate it against human instructions obtained on real-world interactions. Third, we contribute reference implementations for several methods (PPO, best-of-n, expert iteration, and more) that learn from pairwise feedback. Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate eleven models on 10k pairs of real human feedback and show that rankings of models trained in AlpacaFarm match rankings of models trained on human data. As a demonstration of the research possible in AlpacaFarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference PPO implementation leads to a +10% improvement in win-rate against Davinci003. We release all components of AlpacaFarm at this https URL.

NLP-276-标题: Let GPT be a Math Tutor: Teaching Math Word Problem Solvers with Customized Exercise Generation

链接: https://arxiv.org/abs/2305.14386
作者: Zhenwen Liang, Wenhao Yu, Tanmay Rajpurohit, Peter Clark, Xiangliang Zhang, Ashwin Kaylan


Abstract: In this paper, we present a novel approach for distilling math word problem solving capabilities from large language models (LLMs) into smaller, more efficient student models. Our approach is designed to consider the student model’s weaknesses and foster a tailored learning experience by generating targeted exercises aligned with educational science principles, such as knowledge tracing and personalized learning. Concretely, we let GPT-3 be a math tutor and run two steps iteratively: 1) assessing the student model’s current learning status on a GPT-generated exercise book, and 2) improving the student model by training it with tailored exercise samples generated by GPT-3. Experimental results reveal that our approach outperforms LLMs (e.g., GPT-3 and PaLM) in accuracy across three distinct benchmarks while employing significantly fewer parameters. Furthermore, we provide a comprehensive analysis of the various components within our methodology to substantiate their efficacy.

NLP-277-标题: Finding the Pillars of Strength for Multi-Head Attention ACL2023

链接: https://arxiv.org/abs/2305.14380
作者: Jinjie Ni, Rui Mao, Zonglin Yang, Han Lei, Erik Cambria
备注: In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2023)


Abstract: Recent studies have revealed some issues of Multi-Head Attention (MHA), e.g., redundancy and over-parameterization. Specifically, the heads of MHA were originally designed to attend to information from different representation subspaces, whereas prior studies found that some attention heads likely learn similar features and can be pruned without harming performance. Inspired by the minimum-redundancy feature selection, we assume that focusing on the most representative and distinctive features with minimum resources can mitigate the above issues and lead to more effective and efficient MHAs. In particular, we propose Grouped Head Attention, trained with a self-supervised group constraint that group attention heads, where each group focuses on an essential but distinctive feature subset. We additionally propose a Voting-to-Stay procedure to remove redundant heads, thus achieving a transformer with lighter weights. Moreover, our method achieves significant performance gains on three well-established tasks while considerably compressing parameters.

NLP-278-标题: On the Transferability of Whisper-based Representations for “In-the-Wild” Cross-Task Downstream Speech Applications

链接: https://arxiv.org/abs/2305.14546
作者: Vamsikrishna Chemudupati, Marzieh Tahaei, Heitor Guimaraes, Arthur Pimentel, Anderson Avila, Mehdi Rezagholizadeh, Boxing Chen, Tiago Falk


Abstract: Large self-supervised pre-trained speech models have achieved remarkable success across various speech-processing tasks. The self-supervised training of these models leads to universal speech representations that can be used for different downstream tasks, ranging from automatic speech recognition (ASR) to speaker identification. Recently, Whisper, a transformer-based model was proposed and trained on large amount of weakly supervised data for ASR; it outperformed several state-of-the-art self-supervised models. Given the superiority of Whisper for ASR, in this paper we explore the transferability of the representation for four other speech tasks in SUPERB benchmark. Moreover, we explore the robustness of Whisper representation for ``in the wild’’ tasks where speech is corrupted by environment noise and room reverberation. Experimental results show Whisper achieves promising results across tasks and environmental conditions, thus showing potential for cross-task real-world deployment.


ML-0-标题: Differentially-Private Decision Trees with Probabilistic Robustness to Data Poisoning

链接: https://arxiv.org/abs/2305.15394
作者: Daniël Vos, Jelle Vos, Tianyu Li, Zekeriya Erkin, Sicco Verwer


Abstract: Decision trees are interpretable models that are well-suited to non-linear learning problems. Much work has been done on extending decision tree learning algorithms with differential privacy, a system that guarantees the privacy of samples within the training data. However, current state-of-the-art algorithms for this purpose sacrifice much utility for a small privacy benefit. These solutions create random decision nodes that reduce decision tree accuracy or spend an excessive share of the privacy budget on labeling leaves. Moreover, many works do not support or leak information about feature values when data is continuous. We propose a new method called PrivaTree based on private histograms that chooses good splits while consuming a small privacy budget. The resulting trees provide a significantly better privacy-utility trade-off and accept mixed numerical and categorical data without leaking additional information. Finally, while it is notoriously hard to give robustness guarantees against data poisoning attacks, we prove bounds for the expected success rates of backdoor attacks against differentially-private learners. Our experimental results show that PrivaTree consistently outperforms previous works on predictive accuracy and significantly improves robustness against backdoor attacks compared to regular decision trees.

ML-1-标题: On the Minimax Regret for Online Learning with Feedback Graphs

链接: https://arxiv.org/abs/2305.15383
作者: Khaled Eldowa, Emmanuel Esposito, Tommaso Cesari, Nicolò Cesa-Bianchi


Abstract: In this work, we improve on the upper and lower bounds for the regret of online learning with strongly observable undirected feedback graphs. The best known upper bound for this problem is \mathcalO\bigl(\sqrt\alpha T\ln K\bigr) , where K is the number of actions, \alpha is the independence number of the graph, and T is the time horizon. The \sqrt\ln K factor is known to be necessary when \alpha = 1 (the experts case). On the other hand, when \alpha = K (the bandits case), the minimax rate is known to be \Theta\bigl(\sqrtKT\bigr) , and a lower bound \Omega\bigl(\sqrt\alpha T\bigr) is known to hold for any \alpha . Our improved upper bound \mathcalO\bigl(\sqrt\alpha T(1+\ln(K/\alpha))\bigr) holds for any \alpha and matches the lower bounds for bandits and experts, while interpolating intermediate cases. To prove this result, we use FTRL with q -Tsallis entropy for a carefully chosen value of q \in [1/2, 1) that varies with \alpha . The analysis of this algorithm requires a new bound on the variance term in the regret. We also show how to extend our techniques to time-varying graphs, without requiring prior knowledge of their independence numbers. Our upper bound is complemented by an improved \Omega\bigl(\sqrt\alpha T(\ln K)/(\ln\alpha)\bigr) lower bound for all \alpha > 1 , whose analysis relies on a novel reduction to multitask learning. This shows that a logarithmic factor is necessary as soon as \alpha < K .

ML-2-标题: Stochastic Unrolled Federated Learning

链接: https://arxiv.org/abs/2305.15371
作者: Samar Hadou, Navid NaderiAlizadeh, Alejandro Ribeiro


Abstract: Algorithm unrolling has emerged as a learning-based optimization paradigm that unfolds truncated iterative algorithms in trainable neural-network optimizers. We introduce Stochastic UnRolled Federated learning (SURF), a method that expands algorithm unrolling to a federated learning scenario. Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolled optimizers to find a descent direction and the decentralized nature of federated learning. We circumvent the former challenge by feeding stochastic mini-batches to each unrolled layer and imposing descent constraints to mitigate the randomness induced by using mini-batches. We address the latter challenge by unfolding the distributed gradient descent (DGD) algorithm in a graph neural network (GNN)-based unrolled architecture, which preserves the decentralized nature of training in federated learning. We theoretically prove that our proposed unrolled optimizer converges to a near-optimal region infinitely often. Through extensive numerical experiments, we also demonstrate the effectiveness of the proposed framework in collaborative training of image classifiers.

ML-3-标题: Inverse Preference Learning: Preference-based RL without a Reward Function

链接: https://arxiv.org/abs/2305.15363
作者: Joey Hejna, Dorsa Sadigh


Abstract: Reward functions are difficult to design and often hard to align with human intent. Preference-based Reinforcement Learning (RL) algorithms address these problems by learning reward functions from human feedback. However, the majority of preference-based RL methods naïvely combine supervised reward models with off-the-shelf RL algorithms. Contemporary approaches have sought to improve performance and query complexity by using larger and more complex reward architectures such as transformers. Instead of using highly complex architectures, we develop a new and parameter-efficient algorithm, Inverse Preference Learning (IPL), specifically designed for learning from offline preference data. Our key insight is that for a fixed policy, the Q -function encodes all information about the reward function, effectively making them interchangeable. Using this insight, we completely eliminate the need for a learned reward function. Our resulting algorithm is simpler and more parameter-efficient. Across a suite of continuous control and robotics benchmarks, IPL attains competitive performance compared to more complex approaches that leverage transformer-based and non-Markovian reward functions while having fewer algorithmic hyperparameters and learned network parameters. Our code is publicly released.

ML-4-标题: A Virtual Reality Tool for Representing Visualizing and Updating Deep Learning Models

链接: https://arxiv.org/abs/2305.15353
作者: Hannes Kath, Bengt Lüers, Thiago S. Gouvêa, Daniel Sonntag


Abstract: Deep learning is ubiquitous, but its lack of transparency limits its impact on several potential application areas. We demonstrate a virtual reality tool for automating the process of assigning data inputs to different categories. A dataset is represented as a cloud of points in virtual space. The user explores the cloud through movement and uses hand gestures to categorise portions of the cloud. This triggers gradual movements in the cloud: points of the same category are attracted to each other, different groups are pushed apart, while points are globally distributed in a way that utilises the entire space. The space, time, and forces observed in virtual reality can be mapped to well-defined machine learning concepts, namely the latent space, the training epochs and the backpropagation. Our tool illustrates how the inner workings of deep neural networks can be made tangible and transparent. We expect this approach to accelerate the autonomous development of deep learning applications by end users in novel areas.

ML-5-标题: Optimal Rates for Bandit Nonstochastic Control

链接: https://arxiv.org/abs/2305.15352
作者: Y. Jennifer Sun, Stephen Newman, Elad Hazan


Abstract: Linear Quadratic Regulator (LQR) and Linear Quadratic Gaussian (LQG) control are foundational and extensively researched problems in optimal control. We investigate LQR and LQG problems with semi-adversarial perturbations and time-varying adversarial bandit loss functions. The best-known sublinear regret algorithm of~\citegradu2020non has a T^\frac34 time horizon dependence, and its authors posed an open question about whether a tight rate of \sqrtT could be achieved. We answer in the affirmative, giving an algorithm for bandit LQR and LQG which attains optimal regret (up to logarithmic factors) for both known and unknown systems. A central component of our method is a new scheme for bandit convex optimization with memory, which is of independent interest.

ML-6-标题: Black-Box Variational Inference Converges

链接: https://arxiv.org/abs/2305.15349
作者: Kyurae Kim, Kaiwen Wu, Jisu Oh, Yian Ma, Jacob R. Gardner
备注: under review


Abstract: We provide the first convergence guarantee for full black-box variational inference (BBVI), also known as Monte Carlo variational inference. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior densities with and without strong log-concavity and the location-scale variational family. Also, our analysis reveals that certain algorithm design choices commonly employed in practice, particularly, nonlinear parameterizations of the scale of the variational approximation, can result in suboptimal convergence rates. Fortunately, running BBVI with proximal stochastic gradient descent fixes these limitations, and thus achieves the strongest known convergence rate guarantees. We evaluate this theoretical insight by comparing proximal SGD against other standard implementations of BBVI on large-scale Bayesian inference problems.

ML-7-标题: READ: Recurrent Adaptation of Large Transformer s

链接: https://arxiv.org/abs/2305.15348
作者: Sid Wang, John Nguyen, Ke Li, Carole-Jean Wu


Abstract: Fine-tuning large-scale Transformers has led to the explosion of many AI applications across Natural Language Processing and Computer Vision tasks. However, fine-tuning all pre-trained model parameters becomes impractical as the model size and number of tasks increase. Parameter-efficient transfer learning (PETL) methods aim to address these challenges. While effective in reducing the number of trainable parameters, PETL methods still require significant energy and computational resources to fine-tune. In this paper, we introduce \textbfREcurrent \textbfADaption (READ) – a lightweight and memory-efficient fine-tuning method – to overcome the limitations of the current PETL approaches. Specifically, READ inserts a small RNN network alongside the backbone model so that the model does not have to back-propagate through the large backbone network. Through comprehensive empirical evaluation of the GLUE benchmark, we demonstrate READ can achieve a 56% reduction in the training memory consumption and an 84% reduction in the GPU energy usage while retraining high model quality compared to full-tuning. Additionally, the model size of READ does not grow with the backbone model size, making it a highly scalable solution for fine-tuning large Transformers.

ML-8-标题: Is Your Model “MADD”? A Novel Metric to Evaluate Algorithmic Fairness for Predictive Student Models

链接: https://arxiv.org/abs/2305.15342
作者: Mélina Verger, Sébastien Lallé, François Bouchet, Vanda Luengo
备注: 12 pages, conference


Abstract: Predictive student models are increasingly used in learning environments due to their ability to enhance educational outcomes and support stakeholders in making informed decisions. However, predictive models can be biased and produce unfair outcomes, leading to potential discrimination against some students and possible harmful long-term implications. This has prompted research on fairness metrics meant to capture and quantify such biases. Nonetheless, so far, existing fairness metrics used in education are predictive performance-oriented, focusing on assessing biased outcomes across groups of students, without considering the behaviors of the models nor the severity of the biases in the outcomes. Therefore, we propose a novel metric, the Model Absolute Density Distance (MADD), to analyze models’ discriminatory behaviors independently from their predictive performance. We also provide a complementary visualization-based analysis to enable fine-grained human assessment of how the models discriminate between groups of students. We evaluate our approach on the common task of predicting student success in online courses, using several common predictive classification models on an open educational dataset. We also compare our metric to the only predictive performance-oriented fairness metric developed in education, ABROCA. Results on this dataset show that: (1) fair predictive performance does not guarantee fair models’ behaviors and thus fair outcomes, (2) there is no direct relationship between data bias and predictive performance bias nor discriminatory behaviors bias, and (3) trained on the same data, models exhibit different discriminatory behaviors, according to different sensitive features too. We thus recommend using the MADD on models that show satisfying predictive performance, to gain a finer-grained understanding on how they behave and to refine models selection and their usage.

ML-9-标题: A Deep Generative Model for Interactive Data Annotation through Direct Manipulation in Latent Space

链接: https://arxiv.org/abs/2305.15337
作者: Hannes Kath, Thiago S. Gouvêa, Daniel Sonntag


Abstract: The impact of machine learning (ML) in many fields of application is constrained by lack of annotated data. Among existing tools for ML-assisted data annotation, one little explored tool type relies on an analogy between the coordinates of a graphical user interface and the latent space of a neural network for interaction through direct manipulation. In the present work, we 1) expand the paradigm by proposing two new analogies: time and force as reflecting iterations and gradients of network training; 2) propose a network model for learning a compact graphical representation of the data that takes into account both its internal structure and user provided annotations; and 3) investigate the impact of model hyperparameters on the learned graphical representations of the data, identifying candidate model variants for a future user study.

ML-10-标题: Breaking the Curse of Quality Saturation with User-Centric Ranking KDD2023

链接: https://arxiv.org/abs/2305.15333
作者: Zhuokai Zhao, Yang Yang, Wenyu Wang, Chihuang Liu, Yu Shi, Wenjie Hu, Haotian Zhang, Shuang Yang
备注: accepted to publish at KDD 2023


Abstract: A key puzzle in search, ads, and recommendation is that the ranking model can only utilize a small portion of the vastly available user interaction data. As a result, increasing data volume, model size, or computation FLOPs will quickly suffer from diminishing returns. We examined this problem and found that one of the root causes may lie in the so-called item-centric'' formulation, which has an unbounded vocabulary and thus uncontrolled model complexity. To mitigate quality saturation, we introduce an alternative formulation named user-centric ranking’', which is based on a transposed view of the dyadic user-item interaction data. We show that this formulation has a promising scaling property, enabling us to train better-converged models on substantially larger data sets.

ML-11-标题: No-Regret Online Prediction with Strategic Experts

链接: https://arxiv.org/abs/2305.15331
作者: Omid Sadeghi, Maryam Fazel


Abstract: We study a generalization of the online binary prediction with expert advice framework where at each round, the learner is allowed to pick m\geq 1 experts from a pool of K experts and the overall utility is a modular or submodular function of the chosen experts. We focus on the setting in which experts act strategically and aim to maximize their influence on the algorithm’s predictions by potentially misreporting their beliefs about the events. Among others, this setting finds applications in forecasting competitions where the learner seeks not only to make predictions by aggregating different forecasters but also to rank them according to their relative performance. Our goal is to design algorithms that satisfy the following two requirements: 1) \textitIncentive-compatible : Incentivize the experts to report their beliefs truthfully, and 2) \textitNo-regret : Achieve sublinear regret with respect to the true beliefs of the best fixed set of m experts in hindsight. Prior works have studied this framework when m=1 and provided incentive-compatible no-regret algorithms for the problem. We first show that a simple reduction of our problem to the m=1 setting is neither efficient nor effective. Then, we provide algorithms that utilize the specific structure of the utility functions to achieve the two desired goals.

ML-12-标题: The Crucial Role of Normalization in Sharpness-Aware Minimization

链接: https://arxiv.org/abs/2305.15287
作者: Yan Dai, Kwangjun Ahn, Suvrit Sra


Abstract: Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically and empirically study the effect of normalization in SAM for both convex and non-convex functions, revealing two key roles played by normalization: i) it helps in stabilizing the algorithm; and ii) it enables the algorithm to drift along a continuum (manifold) of minima – a property identified by recent theoretical works that is the key to better performance. We further argue that these two properties of normalization make SAM robust against the choice of hyper-parameters, supporting the practicality of SAM. Our conclusions are backed by various experiments.

ML-13-标题: Replicable Reinforcement Learning

链接: https://arxiv.org/abs/2305.15284
作者: Eric Eaton, Marcel Hussing, Michael Kearns, Jessica Sorrell


Abstract: The replicability crisis in the social, behavioral, and data sciences has led to the formulation of algorithm frameworks for replicability – i.e., a requirement that an algorithm produce identical outputs (with high probability) when run on two different samples from the same underlying distribution. While still in its infancy, provably replicable algorithms have been developed for many fundamental tasks in machine learning and statistics, including statistical query learning, the heavy hitters problem, and distribution testing. In this work we initiate the study of replicable reinforcement learning, providing a provably replicable algorithm for parallel value iteration, and a provably replicable version of R-max in the episodic setting. These are the first formal replicability results for control problems, which present different challenges for replication than batch learning settings.

ML-14-标题: Successor-Predecessor Intrinsic Exploration

链接: https://arxiv.org/abs/2305.15277
作者: Changmin Yu, Neil Burgess, Maneesh Sahani, Sam Gershman


Abstract: Exploration is essential in reinforcement learning, particularly in environments where external rewards are sparse. Here we focus on exploration with intrinsic rewards, where the agent transiently augments the external rewards with self-generated intrinsic rewards. Although the study of intrinsic rewards has a long history, existing methods focus on composing the intrinsic reward based on measures of future prospects of states, ignoring the information contained in the retrospective structure of transition sequences. Here we argue that the agent can utilise retrospective information to generate explorative behaviour with structure-awareness, facilitating efficient exploration based on global instead of local information. We propose Successor-Predecessor Intrinsic Exploration (SPIE), an exploration algorithm based on a novel intrinsic reward combining prospective and retrospective information. We show that SPIE yields more efficient and ethologically plausible exploratory behaviour in environments with sparse rewards and bottleneck states than competing methods. We also implement SPIE in deep reinforcement learning agents, and show that the resulting agent achieves stronger empirical performance than existing methods on sparse-reward Atari games.

ML-15-标题: Robust Sparse Mean Estimation via Incremental Learning

链接: https://arxiv.org/abs/2305.15276
作者: Jianhao Ma, Rui Ray Chen, Yinghui He, Salar Fattahi, Wei Hu


Abstract: In this paper, we study the problem of robust sparse mean estimation, where the goal is to estimate a k -sparse mean from a collection of partially corrupted samples drawn from a heavy-tailed distribution. Existing estimators face two critical challenges in this setting. First, they are limited by a conjectured computational-statistical tradeoff, implying that any computationally efficient algorithm needs \tilde\Omega(k^2) samples, while its statistically-optimal counterpart only requires \tilde O(k) samples. Second, the existing estimators fall short of practical use as they scale poorly with the ambient dimension. This paper presents a simple mean estimator that overcomes both challenges under moderate conditions: it runs in near-linear time and memory (both with respect to the ambient dimension) while requiring only \tilde O(k) samples to recover the true mean. At the core of our method lies an incremental learning phenomenon: we introduce a simple nonconvex framework that can incrementally learn the top- k nonzero elements of the mean while keeping the zero elements arbitrarily small. Unlike existing estimators, our method does not need any prior knowledge of the sparsity level k . We prove the optimality of our estimator by providing a matching information-theoretic lower bound. Finally, we conduct a series of simulations to corroborate our theoretical findings. Our code is available at this https URL.

ML-16-标题: Training Energy-Based Normalizing Flow with Score-Matching Objectives

链接: https://arxiv.org/abs/2305.15267
作者: Chen-Hao Chao, Wei-Fang Sun, Yen-Chang Hsu, Zsolt Kira, Chun-Yi Lee


Abstract: In this paper, we establish a connection between the parameterization of flow-based and energy-based generative models, and present a new flow-based modeling approach called energy-based normalizing flow (EBFlow). We demonstrate that by optimizing EBFlow with score-matching objectives, the computation of Jacobian determinants for linear transformations can be entirely bypassed. This feature enables the use of arbitrary linear layers in the construction of flow-based models without increasing the computational time complexity of each training iteration from \mathcalO(D^2L) to \mathcalO(D^3L) for an L -layered model that accepts D -dimensional inputs. This makes the training of EBFlow more efficient than the commonly-adopted maximum likelihood training method. In addition to the reduction in runtime, we enhance the training stability and empirical performance of EBFlow through a number of techniques developed based on our analysis on the score-matching methods. The experimental results demonstrate that our approach achieves a significant speedup compared to maximum likelihood estimation, while outperforming prior efficient training techniques with a noticeable margin in terms of negative log-likelihood (NLL).

ML-17-标题: Collaborative World Models: An Online-Offline Transfer RL Approach

链接: https://arxiv.org/abs/2305.15260
作者: Qi Wang, Junming Yang, Yunbo Wang, Xin Jin, Wenjun Zeng, Xiaokang Yang


Abstract: Training visual reinforcement learning (RL) models in offline datasets is challenging due to overfitting issues in representation learning and overestimation problems in value function. In this paper, we propose a transfer learning method called Collaborative World Models (CoWorld) to improve the performance of visual RL under offline conditions. The core idea is to use an easy-to-interact, off-the-shelf simulator to train an auxiliary RL model as the online ``test bed’’ for the offline policy learned in the target domain, which provides a flexible constraint for the value function – Intuitively, we want to mitigate the overestimation problem of value functions outside the offline data distribution without impeding the exploration of actions with potential advantages. Specifically, CoWorld performs domain-collaborative representation learning to bridge the gap between online and offline hidden state distributions. Furthermore, it performs domain-collaborative behavior learning that enables the source RL agent to provide target-aware value estimation, allowing for effective offline policy regularization. Experiments show that CoWorld significantly outperforms existing methods in offline visual control tasks in DeepMind Control and Meta-World.

ML-18-标题: Decision-Aware Actor-Critic with Function Approximation and Theoretical Guarantees

链接: https://arxiv.org/abs/2305.15249
作者: Sharan Vaswani, Amirreza Kazemi, Reza Babanezhad, Nicolas Le Roux
备注: 44 pages


Abstract: Actor-critic (AC) methods are widely used in reinforcement learning (RL) and benefit from the flexibility of using any policy gradient method as the actor and value-based method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a joint objective for training the actor and critic in a decision-aware fashion. We use the proposed objective to design a generic, AC algorithm that can easily handle any function approximation. We explicitly characterize the conditions under which the resulting algorithm guarantees monotonic policy improvement, regardless of the choice of the policy and critic parameterization. Instantiating the generic algorithm results in an actor that involves maximizing a sequence of surrogate functions (similar to TRPO, PPO) and a critic that involves minimizing a closely connected objective. Using simple bandit examples, we provably establish the benefit of the proposed critic objective over the standard squared error. Finally, we empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.

ML-19-标题: Machine Unlearning: its nature scope and importance for a “delete culture”

链接: https://arxiv.org/abs/2305.15242
作者: Luciano Floridi


Abstract: The article explores the cultural shift from recording to deleting information in the digital age and its implications on privacy, intellectual property (IP), and Large Language Models like ChatGPT. It begins by defining a delete culture where information, in principle legal, is made unavailable or inaccessible because unacceptable or undesirable, especially but not only due to its potential to infringe on privacy or IP. Then it focuses on two strategies in this context: deleting, to make information unavailable; and blocking, to make it inaccessible. The article argues that both strategies have significant implications, particularly for machine learning (ML) models where information is not easily made unavailable. However, the emerging research area of Machine Unlearning (MU) is highlighted as a potential solution. MU, still in its infancy, seeks to remove specific data points from ML models, effectively making them ‘forget’ completely specific information. If successful, MU could provide a feasible means to manage the overabundance of information and ensure a better protection of privacy and IP. However, potential ethical risks, such as misuse, overuse, and underuse of MU, should be systematically studied to devise appropriate policies.

ML-20-标题: Ethics and Deep Learning

链接: https://arxiv.org/abs/2305.15239
作者: Travis LaCroix, Simon J. D. Prince
备注: Copyright in this Work has been licensed exclusively to The MIT Press, this https URL, which will be releasing the final version to the public in 2023. All inquiries regarding rights should be addressed to The MIT Press, Rights and Permissions Department


Abstract: This article appears as chapter 21 of Prince (2023, Understanding Deep Learning); a complete draft of the textbook is available here: this http URL. This chapter considers potential harms arising from the design and use of AI systems. These include algorithmic bias, lack of explainability, data privacy violations, militarization, fraud, and environmental concerns. The aim is not to provide advice on being more ethical. Instead, the goal is to express ideas and start conversations in key areas that have received attention in philosophy, political science, and the broader social sciences.

ML-21-标题: On the road to more accurate mobile cellular traffic predictions

链接: https://arxiv.org/abs/2305.15234
作者: Natalia Vassileva Vesselinova


Abstract: The main contribution reported in the paper is a novel paradigm through which mobile cellular traffic forecasting is made substantially more accurate. Specifically, by incorporating freely available road metrics we characterise the data generation process and spatial dependencies. Therefore, this provides a means for improving the forecasting estimates. We employ highway flow and average speed variables together with a cellular network traffic metric in a light learning structure to predict the short-term future load on a cell covering a segment of a highway. This is in sharp contrast to prior art that mainly studies urban scenarios (with pedestrian and limited vehicular speeds) and develops machine learning approaches that use exclusively network metrics and meta information to make mid-term and long-term predictions. The learning structure can be used at a cell or edge level, and can find application in both federated and centralised learning.

ML-22-标题: Short and Straight: Geodesics on Differentiable Manifolds

链接: https://arxiv.org/abs/2305.15228
作者: Daniel Kelshaw, Luca Magri


Abstract: Manifolds discovered by machine learning models provide a compact representation of the underlying data. Geodesics on these manifolds define locally length-minimising curves and provide a notion of distance, which are key for reduced-order modelling, statistical inference, and interpolation. In this work, we first analyse existing methods for computing length-minimising geodesics. We find that these are not suitable for obtaining valid paths, and thus, geodesic distances. We remedy these shortcomings by leveraging numerical tools from differential geometry, which provide the means to obtain Hamiltonian-conserving geodesics. Second, we propose a model-based parameterisation for distance fields and geodesic flows on continuous manifolds. Our approach exploits a manifold-aware extension to the Eikonal equation, eliminating the need for approximations or discretisation. Finally, we develop a curvature-based training mechanism, sampling and scaling points in regions of the manifold exhibiting larger values of the Ricci scalar. This sampling and scaling approach ensures that we capture regions of the manifold subject to higher degrees of geodesic deviation. Our proposed methods provide principled means to compute valid geodesics and geodesic distances on manifolds. This work opens opportunities for latent-space interpolation, optimal control, and distance computation on differentiable manifolds.

ML-23-标题: Shadow Cones: Unveiling Partial Orders in Hyperbolic Space

链接: https://arxiv.org/abs/2305.15215
作者: Tao Yu, Toni J.B. Liu, Albert Tseng, Christopher De Sa


Abstract: Hyperbolic space has been shown to produce superior low-dimensional embeddings of hierarchical structures that are unattainable in Euclidean space. Building upon this, the entailment cone formulation of Ganea et al. uses geodesically convex cones to embed partial orderings in hyperbolic space. However, these entailment cones lack intuitive interpretations due to their definitions via complex concepts such as tangent vectors and the exponential map in Riemannian space. In this paper, we present shadow cones, an innovative framework that provides a physically intuitive interpretation for defining partial orders on general manifolds. This is achieved through the use of metaphoric light sources and object shadows, inspired by the sun-earth-moon relationship. Shadow cones consist of two primary classes: umbral and penumbral cones. Our results indicate that shadow cones offer robust representation and generalization capabilities across a variety of datasets, such as WordNet and ConceptNet, thereby outperforming the top-performing entailment cones. Our findings indicate that shadow cones offer an innovative, general approach to geometrically encode partial orders, enabling better representation and analysis of datasets with hierarchical structures.

ML-24-标题: Relating Implicit Bias and Adversarial Attacks through Intrinsic Dimension

链接: https://arxiv.org/abs/2305.15203
作者: Lorenzo Basile, Nikos Karantzas, Alberto D’Onofrio, Luca Bortolussi, Alex Rodriguez, Fabio Anselmi


Abstract: Despite their impressive performance in classification, neural networks are known to be vulnerable to adversarial attacks. These attacks are small perturbations of the input data designed to fool the model. Naturally, a question arises regarding the potential connection between the architecture, settings, or properties of the model and the nature of the attack. In this work, we aim to shed light on this problem by focusing on the implicit bias of the neural network, which refers to its inherent inclination to favor specific patterns or outcomes. Specifically, we investigate one aspect of the implicit bias, which involves the essential Fourier frequencies required for accurate image classification. We conduct tests to assess the statistical relationship between these frequencies and those necessary for a successful attack. To delve into this relationship, we propose a new method that can uncover non-linear correlations between sets of coordinates, which, in our case, are the aforementioned frequencies. By exploiting the entanglement between intrinsic dimension and correlation, we provide empirical evidence that the network bias in Fourier space and the target frequencies of adversarial attacks are closely tied.

ML-25-标题: Feature-aligned N-BEATS with Sinkhorn divergence

链接: https://arxiv.org/abs/2305.15196
作者: Myeongho Jeon, Myungjoo Kang, Joonhun Lee, Kyunghyun Park


Abstract: In this study, we propose Feature-aligned N-BEATS as a domain generalization model for univariate time series forecasting problems. The proposed model is an extension of the doubly residual stacking architecture of N-BEATS (Oreshkin et al. [34]) into a representation learning framework. The model is a new structure that involves marginal feature probability measures (i.e., pushforward measures of multiple source domains) induced by the intricate composition of residual operators of N-BEATS in each stack and aligns them stack-wise via an entropic regularized Wasserstein distance referred to as the Sinkhorn divergence (Genevay et al. [14]). The loss function consists of a typical forecasting loss for multiple source domains and an alignment loss calculated with the Sinkhorn divergence, which allows the model to learn invariant features stack-wise across multiple source data sequences while retaining N-BEATS’s interpretable design. We conduct a comprehensive experimental evaluation of the proposed approach and the results demonstrate the model’s forecasting and generalization capabilities in comparison with methods based on the original N-BEATS.

ML-26-标题: Adaptive Policy Learning to Additional Tasks

链接: https://arxiv.org/abs/2305.15193
作者: Wenjian Hao, Zehui Lu, Zihao Liang, Tianyu Zhou, Shaoshuai Mou


Abstract: This paper develops a policy learning method for tuning a pre-trained policy to adapt to additional tasks without altering the original task. A method named Adaptive Policy Gradient (APG) is proposed in this paper, which combines Bellman’s principle of optimality with the policy gradient approach to improve the convergence rate. This paper provides theoretical analysis which guarantees the convergence rate and sample complexity of \mathcalO(1/T) and \mathcalO(1/\epsilon) , respectively, where T denotes the number of iterations and \epsilon denotes the accuracy of the resulting stationary policy. Furthermore, several challenging numerical simulations, including cartpole, lunar lander, and robot arm, are provided to show that APG obtains similar performance compared to existing deterministic policy gradient methods while utilizing much less data and converging at a faster rate.

ML-27-标题: Black-Box vs. Gray-Box: A Case Study on Learning Table Tennis Ball Trajectory Prediction with Spin and Impacts

链接: https://arxiv.org/abs/2305.15189
作者: Jan Achterhold, Philip Tobuschat, Hao Ma, Dieter Buechler, Michael Muehlebach, Joerg Stueckler
备注: Accepted for publication at the 5th Annual Conference on Learning for Dynamics and Control (L4DC) 2023. With supplementary material


Abstract: In this paper, we present a method for table tennis ball trajectory filtering and prediction. Our gray-box approach builds on a physical model. At the same time, we use data to learn parameters of the dynamics model, of an extended Kalman filter, and of a neural model that infers the ball’s initial condition. We demonstrate superior prediction performance of our approach over two black-box approaches, which are not supplied with physical prior knowledge. We demonstrate that initializing the spin from parameters of the ball launcher using a neural network drastically improves long-time prediction performance over estimating the spin purely from measured ball positions. An accurate prediction of the ball trajectory is crucial for successful returns. We therefore evaluate the return performance with a pneumatic artificial muscular robot and achieve a return rate of 29/30 (97.7%).

ML-28-标题: Policy Learning based on Deep Koopman Representation

链接: https://arxiv.org/abs/2305.15188
作者: Wenjian Hao, Paulo C. Heredia, Bowen Huang, Zehui Lu, Zihao Liang, Shaoshuai Mou


Abstract: This paper proposes a policy learning algorithm based on the Koopman operator theory and policy gradient approach, which seeks to approximate an unknown dynamical system and search for optimal policy simultaneously, using the observations gathered through interaction with the environment. The proposed algorithm has two innovations: first, it introduces the so-called deep Koopman representation into the policy gradient to achieve a linear approximation of the unknown dynamical system, all with the purpose of improving data efficiency; second, the accumulated errors for long-term tasks induced by approximating system dynamics are avoided by applying Bellman’s principle of optimality. Furthermore, a theoretical analysis is provided to prove the asymptotic convergence of the proposed algorithm and characterize the corresponding sampling complexity. These conclusions are also supported by simulations on several challenging benchmark environments.

ML-29-标题: Using Models Based on Cognitive Theory to Predict Human Behavior in Traffic: A Case Study

链接: https://arxiv.org/abs/2305.15187
作者: Julian F. Schumann, Aravinda Ramakrishnan Srinivasan, Jens Kober, Gustav Markkula, Arkady Zgonnikov
备注: 6 pages, 2 figures


Abstract: The development of automated vehicles has the potential to revolutionize transportation, but they are currently unable to ensure a safe and time-efficient driving style. Reliable models predicting human behavior are essential for overcoming this issue. While data-driven models are commonly used to this end, they can be vulnerable in safety-critical edge cases. This has led to an interest in models incorporating cognitive theory, but as such models are commonly developed for explanatory purposes, this approach’s effectiveness in behavior prediction has remained largely untested so far. In this article, we investigate the usefulness of the \emphCommotions model – a novel cognitively plausible model incorporating the latest theories of human perception, decision-making, and motor control – for predicting human behavior in gap acceptance scenarios, which entail many important traffic interactions such as lane changes and intersections. We show that this model can compete with or even outperform well-established data-driven prediction models across several naturalistic datasets. These results demonstrate the promise of incorporating cognitive theory in behavior prediction models for automated vehicles.

ML-30-标题: Mixture of Experts with Uncertainty Voting for Imbalanced Deep Regression Problems

链接: https://arxiv.org/abs/2305.15178
作者: Yuchang Jiang, Vivien Sainte Fare Garnot, Konrad Schindler, Jan Dirk Wegner


Abstract: Data imbalance is ubiquitous when applying machine learning to real-world problems, particularly regression problems. If training data are imbalanced, the learning is dominated by the densely covered regions of the target distribution, consequently, the learned regressor tends to exhibit poor performance in sparsely covered regions. Beyond standard measures like over-sampling or re-weighting, there are two main directions to handle learning from imbalanced data. For regression, recent work relies on the continuity of the distribution; whereas for classification there has been a trend to employ mixture-of-expert models and let some ensemble members specialize in predictions for the sparser regions. Here, we adapt the mixture-of-experts approach to the regression setting. A main question when using this approach is how to fuse the predictions from multiple experts into one output. Drawing inspiration from recent work on probabilistic deep learning, we propose to base the fusion on the aleatoric uncertainties of individual experts, thus obviating the need for a separate aggregation module. In our method, dubbed MOUV, each expert predicts not only an output value but also its uncertainty, which in turn serves as a statistically motivated criterion to rely on the right experts. We compare our method with existing alternatives on multiple public benchmarks and show that MOUV consistently outperforms the prior art, while at the same time producing better calibrated uncertainty estimates. Our code is available at link-upon-publication.

ML-31-标题: Simultaneous identification of models and parameters of scientific simulators

链接: https://arxiv.org/abs/2305.15174
作者: Cornelius Schröder, Jakob H. Macke


Abstract: Many scientific models are composed of multiple discrete components, and scien tists often make heuristic decisions about which components to include. Bayesian inference provides a mathematical framework for systematically selecting model components, but defining prior distributions over model components and developing associated inference schemes has been challenging. We approach this problem in an amortized simulation-based inference framework: We define implicit model priors over a fixed set of candidate components and train neural networks to infer joint probability distributions over both, model components and associated parameters from simulations. To represent distributions over model components, we introduce a conditional mixture of multivariate binary distributions in the Grassmann formalism. Our approach can be applied to any compositional stochastic simulator without requiring access to likelihood evaluations. We first illustrate our method on a simple time series model with redundant components and show that it can retrieve joint posterior distribution over a set of symbolic expressions and their parameters while accurately capturing redundancy with strongly correlated posteriors. We then apply our approach to drift-diffusion models, a commonly used model class in cognitive neuroscience. After validating the method on synthetic data, we show that our approach explains experimental data as well as previous methods, but that our fully probabilistic approach can help to discover multiple data-consistent model configurations, as well as reveal non-identifiable model components and parameters. Our method provides a powerful tool for data-driven scientific inquiry which will allow scientists to systematically identify essential model components and make uncertainty-informed modelling decisions.

ML-32-标题: Personalized DP-SGD using Sampling Mechanisms

链接: https://arxiv.org/abs/2305.15165
作者: Geon Heo, Junseok Seo, Steven Euijong Whang
备注: 10 pages, 5 figures


Abstract: Personalized privacy becomes critical in deep learning for Trustworthy AI. While Differentially Private Stochastic Gradient Descent (DP-SGD) is widely used in deep learning methods supporting privacy, it provides the same level of privacy to all individuals, which may lead to overprotection and low utility. In practice, different users may require different privacy levels, and the model can be improved by using more information about the users with lower privacy requirements. There are also recent works on differential privacy of individuals when using DP-SGD, but they are mostly about individual privacy accounting and do not focus on satisfying different privacy levels. We thus extend DP-SGD to support a recent privacy notion called ( \Phi , \Delta )-Personalized Differential Privacy (( \Phi , \Delta )-PDP), which extends an existing PDP concept called \Phi -PDP. Our algorithm uses a multi-round personalized sampling mechanism and embeds it within the DP-SGD iterations. Experiments on real datasets show that our algorithm outperforms DP-SGD and simple combinations of DP-SGD with existing PDP mechanisms in terms of model performance and efficiency due to its embedded sampling mechanism.

ML-33-标题: Towards More Suitable Personalization in Federated Learning via Decentralized Partial Model Training

链接: https://arxiv.org/abs/2305.15157
作者: Yifan Shi, Yingqi Liu, Yan Sun, Zihao Lin, Li Shen, Xueqian Wang, Dacheng Tao
备注: 26 pages


Abstract: Personalized federated learning (PFL) aims to produce the greatest personalized model for each client to face an insurmountable problem–data heterogeneity in real FL systems. However, almost all existing works have to face large communication burdens and the risk of disruption if the central server fails. Only limited efforts have been used in a decentralized way but still suffers from inferior representation ability due to sharing the full model with its neighbors. Therefore, in this paper, we propose a personalized FL framework with a decentralized partial model training called DFedAlt. It personalizes the “right” components in the modern deep models by alternately updating the shared and personal parameters to train partially personalized models in a peer-to-peer manner. To further promote the shared parameters aggregation process, we propose DFedSalt integrating the local Sharpness Aware Minimization (SAM) optimizer to update the shared parameters. It adds proper perturbation in the direction of the gradient to overcome the shared model inconsistency across clients. Theoretically, we provide convergence analysis of both algorithms in the general non-convex setting for decentralized partial model training in PFL. Our experiments on several real-world data with various data partition settings demonstrate that (i) decentralized training is more suitable for partial personalization, which results in state-of-the-art (SOTA) accuracy compared with the SOTA PFL baselines; (ii) the shared parameters with proper perturbation make partial personalized FL more suitable for decentralized training, where DFedSalt achieves most competitive performance.

ML-34-标题: Momentum Provably Improves Error Feedback!

链接: https://arxiv.org/abs/2305.15155
作者: Ilyas Fatkhullin, Alexander Tyurin, Peter Richtárik


Abstract: Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al [2014] proposed an error feedback (EF) mechanism, which we refer to as EF14, as an immensely effective heuristic for mitigating this issue. However, despite steady algorithmic and theoretical advances in the EF field in the last decade, our understanding is far from complete. In this work we address one of the most pressing issues. In particular, in the canonical nonconvex setting, all known variants of EF rely on very large batch sizes to converge, which can be prohibitive in practice. We propose a surprisingly simple fix which removes this issue both theoretically, and in practice: the application of Polyak’s momentum to the latest incarnation of EF due to Richtárik et al. [2021] known as EF21. Our algorithm, for which we coin the name EF21-SGDM, improves the communication and sample complexities of previous error feedback algorithms under standard smoothness and bounded variance assumptions, and does not require any further strong assumptions such as bounded gradient dissimilarity. Moreover, we propose a double momentum version of our method that improves the complexities even further. Our proof seems to be novel even when compression is removed from the method, and as such, our proof technique is of independent interest in the study of nonconvex stochastic optimization enriched with Polyak’s momentum.

ML-35-标题: Theoretically Principled Federated Learning for Balancing Privacy and Utility

链接: https://arxiv.org/abs/2305.15148
作者: Xiaojin Zhang, Wenjie Li, Kai Chen, Shutao Xia, Qiang Yang


Abstract: We propose a general learning framework for the protection mechanisms that protects privacy via distorting model parameters, which facilitates the trade-off between privacy and utility. The algorithm is applicable to arbitrary privacy measurements that maps from the distortion to a real value. It can achieve personalized utility-privacy trade-off for each model parameter, on each client, at each communication round in federated learning. Such adaptive and fine-grained protection can improve the effectiveness of privacy-preserved federated learning. Theoretically, we show that gap between the utility loss of the protection hyperparameter output by our algorithm and that of the optimal protection hyperparameter is sub-linear in the total number of iterations. The sublinearity of our algorithm indicates that the average gap between the performance of our algorithm and that of the optimal performance goes to zero when the number of iterations goes to infinity. Further, we provide the convergence rate of our proposed algorithm. We conduct empirical results on benchmark datasets to verify that our method achieves better utility than the baseline methods under the same privacy budget.

ML-36-标题: From Tempered to Benign Overfitting in ReLU Neural Networks

链接: https://arxiv.org/abs/2305.15141
作者: Guy Kornowski, Gilad Yehudai, Ohad Shamir
备注: 43 pages


Abstract: Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on “benign overfitting”, where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as “tempered overfitting”, where the performance is non-optimal yet also non-trivial, and degrades as a function of the noise level. However, a theoretical justification of this claim for non-linear NNs has been lacking so far. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions. Thus, we show that the input dimension has a crucial role on the type of overfitting in this setting, which we also validate empirically for intermediate dimensions. Overall, our results shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and the type of resulting overfitting on the other hand.

ML-37-标题: Beyond Individual Input for Deep Anomaly Detection on Tabular Data

链接: https://arxiv.org/abs/2305.15121
作者: Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, Bich-Liên Doan


Abstract: Anomaly detection is crucial in various domains, such as finance, healthcare, and cybersecurity. In this paper, we propose a novel deep anomaly detection method for tabular data that leverages Non-Parametric Transformers (NPTs), a model initially proposed for supervised tasks, to capture both feature-feature and sample-sample dependencies. In a reconstruction-based framework, we train the NPT model to reconstruct masked features of normal samples. We use the model’s ability to reconstruct the masked features during inference to generate an anomaly score. To the best of our knowledge, our proposed method is the first to combine both feature-feature and sample-sample dependencies for anomaly detection on tabular datasets. We evaluate our method on an extensive benchmark of tabular datasets and demonstrate that our approach outperforms existing state-of-the-art methods based on both the F1-Score and AUROC. Moreover, our work opens up new research directions for exploring the potential of NPTs for other tasks on tabular data.

ML-38-标题: Fairness in Streaming Submodular Maximization over a Matroid Constraint ICML23

链接: https://arxiv.org/abs/2305.15118
作者: Marwa El Halabi, Federico Fusco, Ashkan Norouzi-Fard, Jakab Tardos, Jakub Tarnawski
备注: Accepted to ICML 23


Abstract: Streaming submodular maximization is a natural model for the task of selecting a representative subset from a large-scale dataset. If datapoints have sensitive attributes such as gender or race, it becomes important to enforce fairness to avoid bias and discrimination. This has spurred significant interest in developing fair machine learning algorithms. Recently, such algorithms have been developed for monotone submodular maximization under a cardinality constraint. In this paper, we study the natural generalization of this problem to a matroid constraint. We give streaming algorithms as well as impossibility results that provide trade-offs between efficiency, quality and fairness. We validate our findings empirically on a range of well-known real-world applications: exemplar-based clustering, movie recommendation, and maximum coverage in social networks.

ML-39-标题: FedZero: Leveraging Renewable Excess Energy in Federated Learning

链接: https://arxiv.org/abs/2305.15092
作者: Philipp Wiesner, Ramin Khalili, Dennis Grinwald, Pratik Agrawal, Lauritz Thamsen, Odej Kao


Abstract: Federated Learning (FL) is an emerging machine learning technique that enables distributed model training across data silos or edge devices without data sharing. Yet, FL inevitably introduces inefficiencies compared to centralized model training, which will further increase the already high energy usage and associated carbon emissions of machine learning in the future. Although the scheduling of workloads based on the availability of low-carbon energy has received considerable attention in recent years, it has not yet been investigated in the context of FL. However, FL is a highly promising use case for carbon-aware computing, as training jobs constitute of energy-intensive batch processes scheduled in geo-distributed environments. We propose FedZero, a FL system that operates exclusively on renewable excess energy and spare capacity of compute infrastructure to effectively reduce the training’s operational carbon emissions to zero. Based on energy and load forecasts, FedZero leverages the spatio-temporal availability of excess energy by cherry-picking clients for fast convergence and fair participation. Our evaluation, based on real solar and load traces, shows that FedZero converges considerably faster under the mentioned constraints than state-of-the-art approaches, is highly scalable, and is robust against forecasting errors.

ML-40-标题: Test like you Train in Implicit Deep Learning

链接: https://arxiv.org/abs/2305.15042
作者: Zaccharie Ramzi, Pierre Ablin, Gabriel Peyré, Thomas Moreau


Abstract: Implicit deep learning has recently gained popularity with applications ranging from meta-learning to Deep Equilibrium Networks (DEQs). In its general formulation, it relies on expressing some components of deep learning pipelines implicitly, typically via a root equation called the inner problem. In practice, the solution of the inner problem is approximated during training with an iterative procedure, usually with a fixed number of inner iterations. During inference, the inner problem needs to be solved with new data. A popular belief is that increasing the number of inner iterations compared to the one used during training yields better performance. In this paper, we question such an assumption and provide a detailed theoretical analysis in a simple setting. We demonstrate that overparametrization plays a key role: increasing the number of iterations at test time cannot improve performance for overparametrized networks. We validate our theory on an array of implicit deep-learning problems. DEQs, which are typically overparametrized, do not benefit from increasing the number of iterations at inference while meta-learning, which is typically not overparametrized, benefits from it.

ML-41-标题: An Unsupervised Method for Estimating Class Separability of Dataset s with Application to LLMs Fine-Tuning

链接: https://arxiv.org/abs/2305.15016
作者: Najah Ghalyan, Kostis Gourgoulias, Yash Satsangi, Sean Moran, Maxime Labonne, Joseph Sabelja


Abstract: This paper proposes an unsupervised method that leverages topological characteristics of data manifolds to estimate class separability of the data without requiring labels. Experiments conducted in this paper on several datasets demonstrate a clear correlation and consistency between the class separability estimated by the proposed method with supervised metrics like Fisher Discriminant Ratio~(FDR) and cross-validation of a classifier, which both require labels. This can enable implementing learning paradigms aimed at learning from both labeled and unlabeled data, like semi-supervised and transductive learning. This would be particularly useful when we have limited labeled data and a relatively large unlabeled dataset that can be used to enhance the learning process. The proposed method is implemented for language model fine-tuning with automated stopping criterion by monitoring class separability of the embedding-space manifold in an unsupervised setting. The proposed methodology has been first validated on synthetic data, where the results show a clear consistency between class separability estimated by the proposed method and class separability computed by FDR. The method has been also implemented on both public and internal data. The results show that the proposed method can effectively aid – without the need for labels – a decision on when to stop or continue the fine-tuning of a language model and which fine-tuning iteration is expected to achieve a maximum classification performance through quantification of the class separability of the embedding manifold.

ML-42-标题: Local SGD Accelerates Convergence by Exploiting Second Order Information of the Loss Function

链接: https://arxiv.org/abs/2305.15013
作者: Linxuan Pan, Shenghui Song


Abstract: With multiple iterations of updates, local statistical gradient descent (L-SGD) has been proven to be very effective in distributed machine learning schemes such as federated learning. In fact, many innovative works have shown that L-SGD with independent and identically distributed (IID) data can even outperform SGD. As a result, extensive efforts have been made to unveil the power of L-SGD. However, existing analysis failed to explain why the multiple local updates with small mini-batches of data (L-SGD) can not be replaced by the update with one big batch of data and a larger learning rate (SGD). In this paper, we offer a new perspective to understand the strength of L-SGD. We theoretically prove that, with IID data, L-SGD can effectively explore the second order information of the loss function. In particular, compared with SGD, the updates of L-SGD have much larger projection on the eigenvectors of the Hessian matrix with small eigenvalues, which leads to faster convergence. Under certain conditions, L-SGD can even approach the Newton method. Experiment results over two popular datasets validate the theoretical results.

ML-43-标题: Adversarial robustness of amortized Bayesian inference

链接: https://arxiv.org/abs/2305.14984
作者: Manuel Glöckler, Michael Deistler, Jakob H. Macke


Abstract: Bayesian inference usually requires running potentially costly inference procedures separately for every new observation. In contrast, the idea of amortized Bayesian inference is to initially invest computational cost in training an inference network on simulated data, which can subsequently be used to rapidly perform inference (i.e., to return estimates of posterior distributions) for new observations. This approach has been applied to many real-world models in the sciences and engineering, but it is unclear how robust the approach is to adversarial perturbations in the observed data. Here, we study the adversarial robustness of amortized Bayesian inference, focusing on simulation-based estimation of multi-dimensional posterior distributions. We show that almost unrecognizable, targeted perturbations of the observations can lead to drastic changes in the predicted posterior and highly unrealistic posterior predictive samples, across several benchmark tasks and a real-world example from neuroscience. We propose a computationally efficient regularization scheme based on penalizing the Fisher information of the conditional density estimator, and show how it improves the adversarial robustness of amortized Bayesian inference.

ML-44-标题: Probabilistic Exponential Integrators

链接: https://arxiv.org/abs/2305.14978
作者: Nathanael Bosch, Philipp Hennig, Filip Tronarp


Abstract: Probabilistic solvers provide a flexible and efficient framework for simulation, uncertainty quantification, and inference in dynamical systems. However, like standard solvers, they suffer performance penalties for certain stiff systems, where small steps are required not for reasons of numerical accuracy but for the sake of stability. This issue is greatly alleviated in semi-linear problems by the probabilistic exponential integrators developed in this paper. By including the fast, linear dynamics in the prior, we arrive at a class of probabilistic integrators with favorable properties. Namely, they are proven to be L-stable, and in a certain case reduce to a classic exponential integrator – with the added benefit of providing a probabilistic account of the numerical error. The method is also generalized to arbitrary non-linear systems by imposing piece-wise semi-linearity on the prior via Jacobians of the vector field at the previous estimates, resulting in probabilistic exponential Rosenbrock methods. We evaluate the proposed methods on multiple stiff differential equations and demonstrate their improved stability and efficiency over established probabilistic solvers. The present contribution thus expands the range of problems that can be effectively tackled within probabilistic numerics.

ML-45-标题: Block-local learning with probabilistic latent representations

链接: https://arxiv.org/abs/2305.14974
作者: David Kappel, Khaleelulla Khan Nazeer, Cabrel Teguemne Fokam, Christian Mayr, Anand Subramoney


Abstract: The ubiquitous backpropagation algorithm requires sequential updates across blocks of a network, introducing a locking problem. Moreover, backpropagation relies on the transpose of weight matrices to calculate updates, introducing a weight transport problem across blocks. Both these issues prevent efficient parallelisation and horizontal scaling of models across devices. We propose a new method that introduces a twin network that propagates information backwards from the targets to the input to provide auxiliary local losses. Forward and backward propagation can work in parallel and with different sets of weights, addressing the problems of weight transport and locking. Our approach derives from a statistical interpretation of end-to-end training which treats activations of network layers as parameters of probability distributions. The resulting learning framework uses these parameters locally to assess the matching between forward and backward information. Error backpropagation is then performed locally within each block, leading to `block-local’ learning. Several previously proposed alternatives to error backpropagation emerge as special cases of our model. We present results on various tasks and architectures, including transformers, demonstrating state-of-the-art performance using block-local learning. These results provide a new principled framework to train very large networks in a distributed setting and can also be applied in neuromorphic systems.

ML-46-标题: Focus Your Attention (with Adaptive IIR Filters)

链接: https://arxiv.org/abs/2305.14952
作者: Shahar Lutati, Itamar Zimerman, Lior Wolf
备注: 11 pages, 4 figures


Abstract: We present a new layer in which dynamic (i.e.,input-dependent) Infinite Impulse Response (IIR) filters of order two are used to process the input sequence prior to applying conventional attention. The input is split into chunks, and the coefficients of these filters are determined based on previous chunks to maintain causality. Despite their relatively low order, the causal adaptive filters are shown to focus attention on the relevant sequence elements. The layer performs on-par with state of the art networks, with a fraction of the parameters and with time complexity that is sub-quadratic with input size. The obtained layer is favorable to layers such as Heyna, GPT2, and Mega, both with respect to the number of parameters and the obtained level of performance on multiple long-range sequence problems.

ML-47-标题: SVDinsTN: An Integrated Method for Tensor Network Representation with Efficient Structure Search

链接: https://arxiv.org/abs/2305.14912
作者: Yu-Bang Zheng, Xi-Le Zhao, Junhua Zeng, Chao Li, Qibin Zhao, Heng-Chao Li, Ting-Zhu Huang


Abstract: Tensor network (TN) representation is a powerful technique for data analysis and machine learning. It practically involves a challenging TN structure search (TN-SS) problem, which aims to search for the optimal structure to achieve a compact representation. Existing TN-SS methods mainly adopt a bi-level optimization method that leads to excessive computational costs due to repeated structure evaluations. To address this issue, we propose an efficient integrated (single-level) method named SVD-inspired TN decomposition (SVDinsTN), eliminating the need for repeated tedious structure evaluation. By inserting a diagonal factor for each edge of the fully-connected TN, we calculate TN cores and diagonal factors simultaneously, with factor sparsity revealing the most compact TN structure. Experimental results on real-world data demonstrate that SVDinsTN achieves approximately 10^2\sim10^3 times acceleration in runtime compared to the existing TN-SS methods while maintaining a comparable level of representation ability.

ML-48-标题: Reconstructive Neuron Pruning for Backdoor Defense ICML23

链接: https://arxiv.org/abs/2305.14876
作者: Yige Li, Xixiang Lyu, Xingjun Ma, Nodens Koren, Lingjuan Lyu, Bo Li, Yu-Gang Jiang
备注: Accepted by ICML23


Abstract: Deep neural networks (DNNs) have been found to be vulnerable to backdoor attacks, raising security concerns about their deployment in mission-critical applications. While existing defense methods have demonstrated promising results, it is still not clear how to effectively remove backdoor-associated neurons in backdoored DNNs. In this paper, we propose a novel defense called \emphReconstructive Neuron Pruning (RNP) to expose and prune backdoor neurons via an unlearning and then recovering process. Specifically, RNP first unlearns the neurons by maximizing the model’s error on a small subset of clean samples and then recovers the neurons by minimizing the model’s error on the same data. In RNP, unlearning is operated at the neuron level while recovering is operated at the filter level, forming an asymmetric reconstructive learning procedure. We show that such an asymmetric process on only a few clean samples can effectively expose and prune the backdoor neurons implanted by a wide range of attacks, achieving a new state-of-the-art defense performance. Moreover, the unlearned model at the intermediate step of our RNP can be directly used to improve other backdoor defense tasks including backdoor removal, trigger recovery, backdoor label detection, and backdoor sample detection. Code is available at \urlthis https URL.

ML-49-标题: Timeseries-aware Uncertainty Wrappers for Uncertainty Quantification of Information-Fusion-Enhanced AI Models based on Machine Learning DSN

链接: https://arxiv.org/abs/2305.14872
作者: Janek Groß, Michael Kläs, Lisa Jöckel, Pascal Gerber
备注: 8 pages, 7 figures, VERDI workshop collocated with the DSN conference 2023


Abstract: As the use of Artificial Intelligence (AI) components in cyber-physical systems is becoming more common, the need for reliable system architectures arises. While data-driven models excel at perception tasks, model outcomes are usually not dependable enough for safety-critical applications. In this work,we present a timeseries-aware uncertainty wrapper for dependable uncertainty estimates on timeseries data. The uncertainty wrapper is applied in combination with information fusion over successive model predictions in time. The application of the uncertainty wrapper is demonstrated with a traffic sign recognition use case. We show that it is possible to increase model accuracy through information fusion and additionally increase the quality of uncertainty estimates through timeseries-aware input quality features.

ML-50-标题: Pre-RMSNorm and Pre-CRMSNorm Transformer s: Equivalent and Efficient Pre-LN Transformer s

链接: https://arxiv.org/abs/2305.14858
作者: Zixuan Jiang, Jiaqi Gu, Hanqing Zhu, David Z. Pan
备注: 15 pages, 5 tables, code available at this https URL


Abstract: Transformers have achieved great success in machine learning applications. Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and stabilizing the training of Transformers. While LayerNorm recenters and rescales input vectors, RMSNorm only rescales the vectors by their RMS value. Despite being more computationally efficient, RMSNorm may compromise the representation ability of Transformers. There is currently no consensus regarding the preferred normalization technique, as some models employ LayerNorm while others utilize RMSNorm, especially in recent large language models. It is challenging to convert Transformers with one normalization to the other type. While there is an ongoing disagreement between the two normalization types, we propose a solution to unify two mainstream Transformer architectures, Pre-LN and Pre-RMSNorm Transformers. By removing the inherent redundant mean information in the main branch of Pre-LN Transformers, we can reduce LayerNorm to RMSNorm, achieving higher efficiency. We further propose the Compressed RMSNorm (CRMSNorm) and Pre-CRMSNorm Transformer based on a lossless compression of the zero-mean vectors. We formally establish the equivalence of Pre-LN, Pre-RMSNorm, and Pre-CRMSNorm Transformer variants in both training and inference. It implies that Pre-LN Transformers can be substituted with Pre-©RMSNorm counterparts at almost no cost, offering the same arithmetic functionality along with free efficiency improvement. Experiments demonstrate that we can reduce the training and inference time of Pre-LN Transformers by up to 10%.

ML-51-标题: SWAMP: Sparse Weight Averaging with Multiple Particles for Iterative Magnitude Pruning

链接: https://arxiv.org/abs/2305.14852
作者: Moonseok Choi, Hyungi Lee, Giung Nam, Juho Lee


Abstract: Given the ever-increasing size of modern neural networks, the significance of sparse architectures has surged due to their accelerated inference speeds and minimal memory demands. When it comes to global pruning techniques, Iterative Magnitude Pruning (IMP) still stands as a state-of-the-art algorithm despite its simple nature, particularly in extremely sparse regimes. In light of the recent finding that the two successive matching IMP solutions are linearly connected without a loss barrier, we propose Sparse Weight Averaging with Multiple Particles (SWAMP), a straightforward modification of IMP that achieves performance comparable to an ensemble of two IMP solutions. For every iteration, we concurrently train multiple sparse models, referred to as particles, using different batch orders yet the same matching ticket, and then weight average such models to produce a single mask. We demonstrate that our method consistently outperforms existing baselines across different sparsities through extensive experiments on various data and neural network structures.

ML-52-标题: Building Transportation Foundation Model via Generative Graph Transformer

链接: https://arxiv.org/abs/2305.14826
作者: Xuhong Wang, Ding Wang, Liang Chen, Yilun Lin


Abstract: Efficient traffic management is crucial for maintaining urban mobility, especially in densely populated areas where congestion, accidents, and delays can lead to frustrating and expensive commutes. However, existing prediction methods face challenges in terms of optimizing a single objective and understanding the complex composition of the transportation system. Moreover, they lack the ability to understand the macroscopic system and cannot efficiently utilize big data. In this paper, we propose a novel approach, Transportation Foundation Model (TFM), which integrates the principles of traffic simulation into traffic prediction. TFM uses graph structures and dynamic graph generation algorithms to capture the participatory behavior and interaction of transportation system actors. This data-driven and model-free simulation method addresses the challenges faced by traditional systems in terms of structural complexity and model accuracy and provides a foundation for solving complex transportation problems with real data. The proposed approach shows promising results in accurately predicting traffic outcomes in an urban transportation setting.

ML-53-标题: Can Copyright be Reduced to Privacy?

链接: https://arxiv.org/abs/2305.14822
作者: Niva Elkin-Koren, Uri Hacohen, Roi Livni, Shay Moran


Abstract: There is an increasing concern that generative AI models may produce outputs that are remarkably similar to the copyrighted input content on which they are trained. This worry has escalated as the quality and complexity of generative models have immensely improved, and the availability of large datasets containing copyrighted material has increased. Researchers are actively exploring strategies to mitigate the risk of producing infringing samples, and a recent line of work suggests to employ techniques such as differential privacy and other forms of algorithmic stability to safeguard copyrighted content. In this work, we examine the question whether algorithmic stability techniques such as differential privacy are suitable to ensure the responsible use of generative models without inadvertently violating copyright laws. We argue that there are fundamental differences between privacy and copyright that should not be overlooked. In particular we highlight that although algorithmic stability may be perceived as a practical tool to detect copying, it does not necessarily equate to copyright protection. Therefore, if it is adopted as standard for copyright infringement, it may undermine copyright law intended purposes.

ML-54-标题: Provable Offline Reinforcement Learning with Human Feedback

链接: https://arxiv.org/abs/2305.14816
作者: Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun


Abstract: In this paper, we investigate the problem of offline reinforcement learning with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.

ML-55-标题: What functions can Graph Neural Networks compute on random graphs? The role of Positional Encoding

链接: https://arxiv.org/abs/2305.14814
作者: Nicolas Keriven (CNRS, IRISA), Samuel Vaiter (CNRS, LJAD)


Abstract: We aim to deepen the theoretical understanding of Graph Neural Networks (GNNs) on large graphs, with a focus on their expressive power. Existing analyses relate this notion to the graph isomorphism problem, which is mostly relevant for graphs of small sizes, or studied graph classification or regression tasks, while prediction tasks on nodes are far more relevant on large graphs. Recently, several works showed that, on very general random graphs models, GNNs converge to certains functions as the number of nodes grows. In this paper, we provide a more complete and intuitive description of the function space generated by equivariant GNNs for node-tasks, through general notions of convergence that encompass several previous examples. We emphasize the role of input node features, and study the impact of node Positional Encodings (PEs), a recent line of work that has been shown to yield state-of-the-art results in practice. Through the study of several examples of PEs on large random graphs, we extend previously known universality results to significantly more general models. Our theoretical results hint at some normalization tricks, which is shown numerically to have a positive impact on GNN generalization on synthetic and real data. Our proofs contain new concentration inequalities of independent interest.

ML-56-标题: Zero-shot Task Preference Addressing Enabled by Imprecise Bayesian Continual Learning

链接: https://arxiv.org/abs/2305.14782
作者: Pengyuan Lu, Michele Caprio, Eric Eaton, Insup Lee


Abstract: Like generic multi-task learning, continual learning has the nature of multi-objective optimization, and therefore faces a trade-off between the performance of different tasks. That is, to optimize for the current task distribution, it may need to compromise performance on some tasks to improve on others. This means there exist multiple models that are each optimal at different times, each addressing a distinct task-performance trade-off. Researchers have discussed how to train particular models to address specific preferences on these trade-offs. However, existing algorithms require additional sample overheads – a large burden when there are multiple, possibly infinitely many, preferences. As a response, we propose Imprecise Bayesian Continual Learning (IBCL). Upon a new task, IBCL (1) updates a knowledge base in the form of a convex hull of model parameter distributions and (2) obtains particular models to address preferences with zero-shot. That is, IBCL does not require any additional training overhead to construct preference-addressing models from its knowledge base. We show that models obtained by IBCL have guarantees in identifying the preferred parameters. Moreover, experiments show that IBCL is able to locate the Pareto set of parameters given a preference, maintain similar to better performance than baseline methods, and significantly reduce training overhead via zero-shot preference addressing.

ML-57-标题: A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification

链接: https://arxiv.org/abs/2305.14752
作者: Yiannis Charalambous, Norbert Tihanyi, Ridhi Jain, Youcheng Sun, Mohamed Amine Ferrag, Lucas C. Cordeiro


Abstract: In this paper we present a novel solution that combines the capabilities of Large Language Models (LLMs) with Formal Verification strategies to verify and automatically repair software vulnerabilities. Initially, we employ Bounded Model Checking (BMC) to locate the software vulnerability and derive a counterexample. The counterexample provides evidence that the system behaves incorrectly or contains a vulnerability. The counterexample that has been detected, along with the source code, are provided to the LLM engine. Our approach involves establishing a specialized prompt language for conducting code debugging and generation to understand the vulnerability’s root cause and repair the code. Finally, we use BMC to verify the corrected version of the code generated by the LLM. As a proof of concept, we create ESBMC-AI based on the Efficient SMT-based Context-Bounded Model Checker (ESBMC) and a pre-trained Transformer model, specifically gpt-3.5-turbo, to detect and fix errors in C programs. Our experimentation involved generating a dataset comprising 1000 C code samples, each consisting of 20 to 50 lines of code. Notably, our proposed method achieved an impressive success rate of up to 80% in repairing vulnerable code encompassing buffer overflow and pointer dereference failures. We assert that this automated approach can effectively incorporate into the software development lifecycle’s continuous integration and deployment (CI/CD) process.

ML-58-标题: Multi-State RNA Design with Geometric Multi-Graph Neural Networks

链接: https://arxiv.org/abs/2305.14749
作者: Chaitanya K. Joshi, Arian R. Jamasb, Ramon Viñas, Charles Harris, Simon Mathis, Pietro Liò


Abstract: Computational RNA design has broad applications across synthetic biology and therapeutic development. Fundamental to the diverse biological functions of RNA is its conformational flexibility, enabling single sequences to adopt a variety of distinct 3D states. Currently, computational biomolecule design tasks are often posed as inverse problems, where sequences are designed based on adopting a single desired structural conformation. In this work, we propose gRNAde, a geometric RNA design pipeline that operates on sets of 3D RNA backbone structures to explicitly account for and reflect RNA conformational diversity in its designs. We demonstrate the utility of gRNAde for improving native sequence recovery over single-state approaches on a new large-scale 3D RNA design dataset, especially for multi-state and structurally diverse RNAs. Our code is available at this https URL

ML-59-标题: Applications of Machine Learning in Detecting Afghan Fake Banknotes

链接: https://arxiv.org/abs/2305.14745
作者: Hamida Ashna, Ziaullah Momand


Abstract: Fake currency, unauthorized imitation money lacking government approval, constitutes a form of fraud. Particularly in Afghanistan, the prevalence of fake currency poses significant challenges and detrimentally impacts the economy. While banks and commercial establishments employ authentication machines, the public lacks access to such systems, necessitating a program that can detect counterfeit banknotes accessible to all. This paper introduces a method using image processing to identify counterfeit Afghan banknotes by analyzing specific security features. Extracting first and second order statistical features from input images, the WEKA machine learning tool was employed to construct models and perform classification with Random Forest, PART, and Naïve Bayes algorithms. The Random Forest algorithm achieved exceptional accuracy of 99% in detecting fake Afghan banknotes, indicating the efficacy of the proposed method as a solution for identifying counterfeit currency.

ML-60-标题: On the Generalization of Diffusion Model

链接: https://arxiv.org/abs/2305.14712
作者: Mingyang Yi, Jiacheng Sun, Zhenguo Li


Abstract: The diffusion probabilistic generative models are widely used to generate high-quality data. Though they can synthetic data that does not exist in the training set, the rationale behind such generalization is still unexplored. In this paper, we formally define the generalization of the generative model, which is measured by the mutual information between the generated data and the training set. The definition originates from the intuition that the model which generates data with less correlation to the training set exhibits better generalization ability. Meanwhile, we show that for the empirical optimal diffusion model, the data generated by a deterministic sampler are all highly related to the training set, thus poor generalization. This result contradicts the observation of the trained diffusion model’s (approximating empirical optima) extrapolation ability (generating unseen data). To understand this contradiction, we empirically verify the difference between the sufficiently trained diffusion model and the empirical optima. We found, though obtained through sufficient training, there still exists a slight difference between them, which is critical to making the diffusion model generalizable. Moreover, we propose another training objective whose empirical optimal solution has no potential generalization problem. We empirically show that the proposed training objective returns a similar model to the original one, which further verifies the generalization ability of the trained diffusion model.

ML-61-标题: Regret Matching: (In)Stability and Fast Convergence in Games

链接: https://arxiv.org/abs/2305.14709
作者: Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, Haipeng Luo


Abstract: Regret Matching+ (RM+) and its variants are important algorithms for solving large-scale games. However, a theoretical understanding of their success in practice is still a mystery. Moreover, recent advances on fast convergence in games are limited to no-regret algorithms such as online mirror descent, which satisfy stability. In this paper, we first give counterexamples showing that RM+ and its predictive version can be unstable, which might cause other players to suffer large regret. We then provide two fixes: restarting and chopping off the positive orthant that RM+ works in. We show that these fixes are sufficient to get O(T^1/4) individual regret and O(1) social regret in normal-form games via RM+ with predictions. We also apply our stabilizing techniques to clairvoyant updates in the uncoupled learning setting for RM+ and prove desirable results akin to recent works for Clairvoyant online mirror descent. Our experiments show the advantages of our algorithms over vanilla RM±based algorithms in matrix and extensive-form games.

ML-62-标题: PruMUX: Augmenting Data Multiplexing with Model Compression ACL2023

链接: https://arxiv.org/abs/2305.14706
作者: Yushan Su, Vishvak Murahari, Karthik Narasimhan, Kai Li
备注: Findings of the Association for Computational Linguistics (ACL 2023)


Abstract: As language models increase in size by the day, methods for efficient inference are critical to leveraging their capabilities for various applications. Prior work has investigated techniques like model pruning, knowledge distillation, and data multiplexing to increase model throughput without sacrificing accuracy. In this paper, we combine two such methods – structured pruning and data multiplexing – to compound the speedup gains obtained by either method. Our approach, PruMUX, obtains up to 7.5-29.5X throughput improvement over BERT-base model with accuracy threshold from 80% to 74%. We further study various combinations of parameters (such as sparsity and multiplexing factor) in the two techniques to provide a comprehensive analysis of the tradeoff between accuracy and throughput in the resulting models. We then propose Auto-PruMUX, a meta-level model that can predict the high-performance parameters for pruning and multiplexing given a desired accuracy loss budget, providing a practical method to leverage the combination effectively.

ML-63-标题: An Evaluation on Practical Batch Bayesian Sampling Algorithms for Online Adaptive Traffic Experimentation

链接: https://arxiv.org/abs/2305.14704
作者: Zezhong Zhang, Ted Yuan


Abstract: To speed up online testing, adaptive traffic experimentation through multi-armed bandit algorithms is rising as an essential complementary alternative to the fixed horizon A/B testing. Based on recent research on best arm identification and statistical inference with adaptively collected data, this paper derives and evaluates four Bayesian batch bandit algorithms (NB-TS, WB-TS, NB-TTTS, WB-TTTS), which are combinations of two ways of weighting batches (Naive Batch and Weighted Batch) and two Bayesian sampling strategies (Thompson Sampling and Top-Two Thompson Sampling) to adaptively determine traffic allocation. These derived Bayesian sampling algorithms are practically based on summary batch statistics of a reward metric for pilot experiments, where one of the combination WB-TTTS in this paper seems to be newly discussed. The comprehensive evaluation on the four Bayesian sampling algorithms covers trustworthiness, sensitivity and regret of a testing methodology. Moreover, the evaluation includes 4 real-world eBay experiments and 40 reproducible synthetic experiments to reveal the learnings, which covers both stationary and non-stationary situations. Our evaluation reveals that, (a) There exist false positives inflation with equivalent best arms, while seldom discussed in literatures; (b) To control false positives, connections between convergence of posterior optimal probabilities and neutral posterior reshaping are discovered; © WB-TTTS shows competitive recall, higher precision, and robustness against non-stationary trend; (d) NB-TS outperforms on minimizing regret trials except on precision and robustness; (e) WB-TTTS is a promising alternative if regret of A/B Testing is affordable, otherwise NB-TS is still a powerful choice with regret consideration for pilot experiments.

ML-64-标题: Can Transformer s Learn to Solve Problems Recursively?

链接: https://arxiv.org/abs/2305.14699
作者: Shizhuo Dylan Zhang, Curt Tigges, Stella Biderman, Maxim Raginsky, Talia Ringer


Abstract: Neural networks have in recent years shown promise for helping software engineers write programs and even formally verify them. While semantic information plays a crucial part in these processes, it remains unclear to what degree popular neural architectures like transformers are capable of modeling that information. This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification proofs through the lens of mechanistic interpretability, focusing in particular on structural recursion. Structural recursion is at the heart of tasks on which symbolic tools currently outperform neural models, like inferring semantic relations between datatypes and emulating program behavior. We evaluate the ability of transformer models to learn to emulate the behavior of structurally recursive functions from input-output examples. Our evaluation includes empirical and conceptual analyses of the limitations and capabilities of transformer models in approximating these functions, as well as reconstructions of the ``shortcut" algorithms the model learns. By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions. Our work provides a new foundation for understanding the behavior of neural networks that fail to solve the very tasks they are trained for.

ML-65-标题: Generalizing Importance Weighting to A Universal Solver for Distribution Shift Problems

链接: https://arxiv.org/abs/2305.14690
作者: Tongtong Fang, Nan Lu, Gang Niu, Masashi Sugiyama


Abstract: Distribution shift (DS) may have two levels: the distribution itself changes, and the support (i.e., the set where the probability density is non-zero) also changes. When considering the support change between the training and test distributions, there can be four cases: (i) they exactly match; (ii) the training support is wider (and thus covers the test support); (iii) the test support is wider; (iv) they partially overlap. Existing methods are good at cases (i) and (ii), while cases (iii) and (iv) are more common nowadays but still under-explored. In this paper, we generalize importance weighting (IW), a golden solver for cases (i) and (ii), to a universal solver for all cases. Specifically, we first investigate why IW may fail in cases (iii) and (iv); based on the findings, we propose generalized IW (GIW) that could handle cases (iii) and (iv) and would reduce to IW in cases (i) and (ii). In GIW, the test support is split into an in-training (IT) part and an out-of-training (OOT) part, and the expected risk is decomposed into a weighted classification term over the IT part and a standard classification term over the OOT part, which guarantees the risk consistency of GIW. Then, the implementation of GIW consists of three components: (a) the split of validation data is carried out by the one-class support vector machine, (b) the first term of the empirical risk can be handled by any IW algorithm given training data and IT validation data, and © the second term just involves OOT validation data. Experiments demonstrate that GIW is a universal solver for DS problems, outperforming IW methods in cases (iii) and (iv).

ML-66-标题: On progressive sharpening flat minima and generalisation

链接: https://arxiv.org/abs/2305.14683
作者: Lachlan Ewen MacDonald, Jack Valmadre, Simon Lucey


Abstract: We present a new approach to understanding the relationship between loss curvature and generalisation in deep learning. Specifically, we use existing empirical analyses of the spectrum of deep network loss Hessians to ground an ansatz tying together the loss Hessian and the input-output Jacobian of a deep neural network. We then prove a series of theoretical results which quantify the degree to which the input-output Jacobian of a model approximates its Lipschitz norm over a data distribution, and deduce a novel generalisation bound in terms of the empirical Jacobian. We use our ansatz, together with our theoretical results, to give a new account of the recently observed progressive sharpening phenomenon, as well as the generalisation properties of flat minima. Experimental evidence is provided to validate our claims.

ML-67-标题: Revenge of MLP in Sequential Recommendation

链接: https://arxiv.org/abs/2305.14675
作者: Yiheng Jiang, Yuanbo Xu


Abstract: Sequential recommendation models sequences of historical user-item interactive behaviors (or referred as token) to better infer dynamic preferences. Fueled by the improved neural network architectures such as RNN, CNN and Transformer, this field has enjoyed rapid performance boost in the past years. Recent progress on all-MLP models lights on an efficient method with less intensive computation, token-mixing MLP, to learn the transformation patterns among historical behaviors. However, due to the inherent fully-connection design that allows the unrestricted cross-token communication and ignores the chronological order, we find that directly applying token-mixing MLP into sequential recommendation leads to subpar performance. In this paper, we present a purely MLP-based sequential recommendation architecture TriMLP with a novel \underlineTriangular Mixer where the modified \underlineMLP endows tokens with ordered interactions. As the cross-token interaction in MLP is actually matrix multiplication, Triangular Mixer drops the lower-triangle neurons in the weight matrix and thus blocks the connections from future tokens, which prevents information leakage and improves prediction capability under the standard auto-regressive training fashion. To further model long and short-term preferences on fine-grained level, the mixer adopts a dual-branch structure based on the delicate MLP described above, namely global and local mixing, to separately capture the sequential long-range dependencies and local patterns. Empirical study on 9 different scale datasets (contain 50K\textasciitilde20M behaviors) of various benchmarks, including MovieLens, Amazon and Tenrec, demonstrates that TriMLP attains promising and stable accuracy/efficiency trade-off, i.e., averagely surpasses several state-of-the-art baselines by 5.32% and saves 8.44% inference time cost.

ML-68-标题: RSRM: Reinforcement Symbolic Regression Machine

链接: https://arxiv.org/abs/2305.14656
作者: Yilong Xu, Yang Liu, Hao Sun
备注: 18 pages


Abstract: In nature, the behaviors of many complex systems can be described by parsimonious math equations. Automatically distilling these equations from limited data is cast as a symbolic regression process which hitherto remains a grand challenge. Keen efforts in recent years have been placed on tackling this issue and demonstrated success in symbolic regression. However, there still exist bottlenecks that current methods struggle to break when the discrete search space tends toward infinity and especially when the underlying math formula is intricate. To this end, we propose a novel Reinforcement Symbolic Regression Machine (RSRM) that masters the capability of uncovering complex math equations from only scarce data. The RSRM model is composed of three key modules: (1) a Monte Carlo tree search (MCTS) agent that explores optimal math expression trees consisting of pre-defined math operators and variables, (2) a Double Q-learning block that helps reduce the feasible search space of MCTS via properly understanding the distribution of reward, and (3) a modulated sub-tree discovery block that heuristically learns and defines new math operators to improve representation ability of math expression trees. Biding of these modules yields the state-of-the-art performance of RSRM in symbolic regression as demonstrated by multiple sets of benchmark examples. The RSRM model shows clear superiority over several representative baseline models.

ML-69-标题: Learning Survival Distribution with Implicit Survival Function

链接: https://arxiv.org/abs/2305.14655
作者: Yu Ling, Weimin Tan, Bo Yan


Abstract: Survival analysis aims at modeling the relationship between covariates and event occurrence with some untracked (censored) samples. In implementation, existing methods model the survival distribution with strong assumptions or in a discrete time space for likelihood estimation with censorship, which leads to weak generalization. In this paper, we propose Implicit Survival Function (ISF) based on Implicit Neural Representation for survival distribution estimation without strong assumptions,and employ numerical integration to approximate the cumulative distribution function for prediction and optimization. Experimental results show that ISF outperforms the state-of-the-art methods in three public datasets and has robustness to the hyperparameter controlling estimation precision.

ML-70-标题: A Joint Time-frequency Domain Transformer for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2305.14649
作者: Yushu Chen, Shengzhuo Liu, Jinzhe Yang, Hao Jing, Wenlai Zhao, Guangwen Yang


Abstract: To enhance predicting performance while minimizing computational demands, this paper introduces a joint time-frequency domain Transformer (JTFT) for multivariate forecasting. The method exploits the sparsity of time series in the frequency domain using a small number of learnable frequencies to extract temporal dependencies effectively. Alongside the frequency domain representation, a fixed number of the most recent data points are directly encoded in the time domain, bolstering the learning of local relationships and mitigating the adverse effects of non-stationarity. JTFT achieves linear complexity since the length of the internal representation remains independent of the input sequence length. Additionally, a low-rank attention layer is proposed to efficiently capture cross-dimensional dependencies and prevent performance degradation due to the entanglement of temporal and channel-wise modeling. Experiments conducted on six real-world datasets demonstrate that JTFT outperforms state-of-the-art methods.

ML-71-标题: KARNet: Kalman Filter Augmented Recurrent Neural Network for Learning World Models in Autonomous Driving Tasks

链接: https://arxiv.org/abs/2305.14644
作者: Hemanth Manjunatha, Andrey Pak, Dimitar Filev, Panagiotis Tsiotras
备注: arXiv admin note: substantial text overlap with arXiv:2205.08712


Abstract: Autonomous driving has received a great deal of attention in the automotive industry and is often seen as the future of transportation. The development of autonomous driving technology has been greatly accelerated by the growth of end-to-end machine learning techniques that have been successfully used for perception, planning, and control tasks. An important aspect of autonomous driving planning is knowing how the environment evolves in the immediate future and taking appropriate actions. An autonomous driving system should effectively use the information collected from the various sensors to form an abstract representation of the world to maintain situational awareness. For this purpose, deep learning models can be used to learn compact latent representations from a stream of incoming data. However, most deep learning models are trained end-to-end and do not incorporate any prior knowledge (e.g., from physics) of the vehicle in the architecture. In this direction, many works have explored physics-infused neural network (PINN) architectures to infuse physics models during training. Inspired by this observation, we present a Kalman filter augmented recurrent neural network architecture to learn the latent representation of the traffic flow using front camera images only. We demonstrate the efficacy of the proposed model in both imitation and reinforcement learning settings using both simulated and real-world datasets. The results show that incorporating an explicit model of the vehicle (states estimated using Kalman filtering) in the end-to-end learning significantly increases performance.

ML-72-标题: Newton-Cotes Graph Neural Networks: On the Time Evolution of Dynamic Systems

链接: https://arxiv.org/abs/2305.14642
作者: Lingbing Guo, Weiqing Wang, Zhuo Chen, Ningyu Zhang, Zequn Sun, Yixuan Lai, Qiang Zhang, Huajun Chen
备注: Under review


Abstract: Reasoning system dynamics is one of the most important analytical approaches for many scientific studies. With the initial state of a system as input, the recent graph neural networks (GNNs)-based methods are capable of predicting the future state distant in time with high accuracy. Although these methods have diverse designs in modeling the coordinates and interacting forces of the system, we show that they actually share a common paradigm that learns the integration of the velocity over the interval between the initial and terminal coordinates. However, their integrand is constant w.r.t. time. Inspired by this observation, we propose a new approach to predict the integration based on several velocity estimations with Newton-Cotes formulas and prove its effectiveness theoretically. Extensive experiments on several benchmarks empirically demonstrate consistent and significant improvement compared with the state-of-the-art methods.

ML-73-标题: Graphy Analysis Using a GPU-based Parallel Algorithm: Quantum Clustering

链接: https://arxiv.org/abs/2305.14641
作者: Zhe Wang, ZhiJie He, Ding Liu


Abstract: The article introduces a new method for applying Quantum Clustering to graph structures. Quantum Clustering (QC) is a novel density-based unsupervised learning method that determines cluster centers by constructing a potential function. In this method, we use the Graph Gradient Descent algorithm to find the centers of clusters. GPU parallelization is utilized for computing potential values. We also conducted experiments on five widely used datasets and evaluated using four indicators. The results show superior performance of the method. Finally, we discuss the influence of \sigma on the experimental results.

ML-74-标题: Inverse Reinforcement Learning with the Average Reward Criterion

链接: https://arxiv.org/abs/2305.14608
作者: Feiyang Wu, Jingyang Ke, Anqi Wu


Abstract: We study the problem of Inverse Reinforcement Learning (IRL) with an average-reward criterion. The goal is to recover an unknown policy and a reward function when the agent only has samples of states and actions from an experienced agent. Previous IRL methods assume that the expert is trained in a discounted environment, and the discount factor is known. This work alleviates this assumption by proposing an average-reward framework with efficient learning algorithms. We develop novel stochastic first-order methods to solve the IRL problem under the average-reward setting, which requires solving an Average-reward Markov Decision Process (AMDP) as a subproblem. To solve the subproblem, we develop a Stochastic Policy Mirror Descent (SPMD) method under general state and action spaces that needs \mathcalO(1/\varepsilon) steps of gradient computation. Equipped with SPMD, we propose the Inverse Policy Mirror Descent (IPMD) method for solving the IRL problem with a \mathcalO(1/\varepsilon^2) complexity. To the best of our knowledge, the aforementioned complexity results are new in IRL. Finally, we corroborate our analysis with numerical experiments using the MuJoCo benchmark and additional control tasks.

ML-75-标题: Operationalizing Counterfactual Metrics: Incentives Ranking and Information Asymmetry

链接: https://arxiv.org/abs/2305.14595
作者: Serena Wang, Stephen Bates, P. M. Aronow, Michael I. Jordan


Abstract: From the social sciences to machine learning, it has been well documented that metrics to be optimized are not always aligned with social welfare. In healthcare, Dranove et al. [12] showed that publishing surgery mortality metrics actually harmed the welfare of sicker patients by increasing provider selection behavior. Using a principal-agent model, we directly study the incentive misalignments that arise from such average treated outcome metrics, and show that the incentives driving treatment decisions would align with maximizing total patient welfare if the metrics (i) accounted for counterfactual untreated outcomes and (ii) considered total welfare instead of average welfare among treated patients. Operationalizing this, we show how counterfactual metrics can be modified to satisfy desirable properties when used for ranking. Extending to realistic settings when the providers observe more about patients than the regulatory agencies do, we bound the decay in performance by the degree of information asymmetry between the principal and the agent. In doing so, our model connects principal-agent information asymmetry with unobserved heterogeneity in causal inference.

ML-76-标题: torchgfn: A PyTorch GFlowNet library

链接: https://arxiv.org/abs/2305.14594
作者: Salem Lahlou, Joseph D. Viviano, Victor Schmidt


Abstract: The increasing popularity of generative flow networks (GFlowNets or GFNs) is accompanied with a proliferation of code sources. This hinders the implementation of new features, such as training losses, that can readily be compared to existing ones, on a set of common environments. In addition to slowing down research in the field of GFlowNets, different code bases use different conventions, that might be confusing for newcomers. torchgfn is a library built on top of PyTorch, that aims at addressing both problems. It provides user with a simple API for environments, and useful abstractions for samplers and losses. Multiple examples are provided, replicating published results. The code is available in this https URL.

ML-77-标题: Robust Explanations for Deep Neural Networks via Pseudo Neural Tangent Kernel Surrogate Models

链接: https://arxiv.org/abs/2305.14585
作者: Andrew Engel, Zhichao Wang, Natalie S. Frank, Ioana Dumitriu, Sutanay Choudhury, Anand Sarwate, Tony Chiang
备注: 9 pages, 4 figures, 3 tables


Abstract: One of the ways recent progress has been made on explainable AI has been via explain-by-example strategies, specifically, through data attribution tasks. The feature spaces used to attribute decisions to training data, however, have not been compared against one another as to whether they form a truly representative surrogate model of the neural network (NN). Here, we demonstrate the efficacy of surrogate linear feature spaces to neural networks through two means: (1) we establish that a normalized psuedo neural tangent kernel (pNTK) is more correlated to the neural network decision functions than embedding based and influence based alternatives in both computer vision and large language model architectures; (2) we show that the attributions created from the normalized pNTK more accurately select perturbed training data in a data poisoning attribution task than these alternatives. Based on these observations, we conclude that kernel linear models are effective surrogate models across multiple classification architectures and that pNTK-based kernels are the most appropriate surrogate feature space of all kernels studied.

ML-78-标题: Interpretation of Time-Series Deep Models: A Survey

链接: https://arxiv.org/abs/2305.14582
作者: Ziqi Zhao, Yucheng Shi, Shushan Wu, Fan Yang, Wenzhan Song, Ninghao Liu
备注: 18 pages, 3 figures, 1 table


Abstract: Deep learning models developed for time-series associated tasks have become more widely researched nowadays. However, due to the unintuitive nature of time-series data, the interpretability problem – where we understand what is under the hood of these models – becomes crucial. The advancement of similar studies in computer vision has given rise to many post-hoc methods, which can also shed light on how to explain time-series models. In this paper, we present a wide range of post-hoc interpretation methods for time-series models based on backpropagation, perturbation, and approximation. We also want to bring focus onto inherently interpretable models, a novel category of interpretation where human-understandable information is designed within the models. Furthermore, we introduce some common evaluation metrics used for the explanations, and propose several directions of future researches on the time-series interpretability problem. As a highlight, our work summarizes not only the well-established interpretation methods, but also a handful of fairly recent and under-developed techniques, which we hope to capture their essence and spark future endeavours to innovate and improvise.

ML-79-标题: GiPH: Generalizable Placement Learning for Adaptive Heterogeneous Computing

链接: https://arxiv.org/abs/2305.14562
作者: Yi Hu, Chaoran Zhang, Edward Andert, Harshul Singh, Aviral Shrivastava, James Laudon, Yanqi Zhou, Bob Iannucci, Carlee Joe-Wong
备注: to be published in Proceedings of Machine Learning and Systems 5 (MLSys 2023)


Abstract: Careful placement of a computational application within a target device cluster is critical for achieving low application completion time. The problem is challenging due to its NP-hardness and combinatorial nature. In recent years, learning-based approaches have been proposed to learn a placement policy that can be applied to unseen applications, motivated by the problem of placing a neural network across cloud servers. These approaches, however, generally assume the device cluster is fixed, which is not the case in mobile or edge computing settings, where heterogeneous devices move in and out of range for a particular application. We propose a new learning approach called GiPH, which learns policies that generalize to dynamic device clusters via 1) a novel graph representation gpNet that efficiently encodes the information needed for choosing a good placement, and 2) a scalable graph neural network (GNN) that learns a summary of the gpNet information. GiPH turns the placement problem into that of finding a sequence of placement improvements, learning a policy for selecting this sequence that scales to problems of arbitrary size. We evaluate GiPH with a wide range of task graphs and device clusters and show that our learned policy rapidly find good placements for new problem instances. GiPH finds placements with up to 30.5% lower completion times, searching up to 3X faster than other search-based placement policies.

ML-80-标题: Negative Feedback Training: A Novel Concept to Improve Robustness of NVCiM DNN Accelerators

链接: https://arxiv.org/abs/2305.14561
作者: Yifan Qin, Zheyu Yan, Wujie Wen, Xiaobo Sharon Hu, Yiyu Shi


Abstract: Compute-in-Memory (CiM) utilizing non-volatile memory (NVM) devices presents a highly promising and efficient approach for accelerating deep neural networks (DNNs). By concurrently storing network weights and performing matrix operations within the same crossbar structure, CiM accelerators offer DNN inference acceleration with minimal area requirements and exceptional energy efficiency. However, the stochasticity and intrinsic variations of NVM devices often lead to performance degradation, such as reduced classification accuracy, compared to expected outcomes. Although several methods have been proposed to mitigate device variation and enhance robustness, most of them rely on overall modulation and lack constraints on the training process. Drawing inspiration from the negative feedback mechanism, we introduce a novel training approach that uses a multi-exit mechanism as negative feedback to enhance the performance of DNN models in the presence of device variation. Our negative feedback training method surpasses state-of-the-art techniques by achieving an impressive improvement of up to 12.49% in addressing DNN robustness against device variation.

ML-81-标题: Sequence Modeling is a Robust Contender for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2305.14550
作者: Prajjwal Bhargava, Rohan Chitnis, Alborz Geramifard, Shagun Sodhani, Amy Zhang


Abstract: Offline reinforcement learning (RL) allows agents to learn effective, return-maximizing policies from a static dataset. Three major paradigms for offline RL are Q-Learning, Imitation Learning, and Sequence Modeling. A key open question is: which paradigm is preferred under what conditions? We study this question empirically by exploring the performance of representative algorithms – Conservative Q-Learning (CQL), Behavior Cloning (BC), and Decision Transformer (DT) – across the commonly used D4RL and Robomimic benchmarks. We design targeted experiments to understand their behavior concerning data suboptimality and task complexity. Our key findings are: (1) Sequence Modeling requires more data than Q-Learning to learn competitive policies but is more robust; (2) Sequence Modeling is a substantially better choice than both Q-Learning and Imitation Learning in sparse-reward and low-quality data settings; and (3) Sequence Modeling and Imitation Learning are preferable as task horizon increases, or when data is obtained from suboptimal human demonstrators. Based on the overall strength of Sequence Modeling, we also investigate architectural choices and scaling trends for DT on Atari and D4RL and make design recommendations. We find that scaling the amount of data for DT by 5x gives a 2.5x average score improvement on Atari.

ML-82-标题: Bulk-Switching Memristor-based Compute-In-Memory Module for Deep Neural Network Training

链接: https://arxiv.org/abs/2305.14547
作者: Yuting Wu, Qiwen Wang, Ziyu Wang, Xinxin Wang, Buvna Ayyagari, Siddarth Krishnan, Michael Chudzik, Wei D. Lu


Abstract: The need for deep neural network (DNN) models with higher performance and better functionality leads to the proliferation of very large models. Model training, however, requires intensive computation time and energy. Memristor-based compute-in-memory (CIM) modules can perform vector-matrix multiplication (VMM) in situ and in parallel, and have shown great promises in DNN inference applications. However, CIM-based model training faces challenges due to non-linear weight updates, device variations, and low-precision in analog computing circuits. In this work, we experimentally implement a mixed-precision training scheme to mitigate these effects using a bulk-switching memristor CIM module. Lowprecision CIM modules are used to accelerate the expensive VMM operations, with high precision weight updates accumulated in digital units. Memristor devices are only changed when the accumulated weight update value exceeds a pre-defined threshold. The proposed scheme is implemented with a system-on-chip (SoC) of fully integrated analog CIM modules and digital sub-systems, showing fast convergence of LeNet training to 97.73%. The efficacy of training larger models is evaluated using realistic hardware parameters and shows that that analog CIM modules can enable efficient mix-precision DNN training with accuracy comparable to full-precision software trained models. Additionally, models trained on chip are inherently robust to hardware variations, allowing direct mapping to CIM inference chips without additional re-training.

ML-83-标题: Disincentivizing Polarization in Social Networks

链接: https://arxiv.org/abs/2305.14537
作者: Christian Borgs, Jennifer Chayes, Christian Ikeokwu, Ellen Vitercik


Abstract: On social networks, algorithmic personalization drives users into filter bubbles where they rarely see content that deviates from their interests. We present a model for content curation and personalization that avoids filter bubbles, along with algorithmic guarantees and nearly matching lower bounds. In our model, the platform interacts with n users over T timesteps, choosing content for each user from k categories. The platform receives stochastic rewards as in a multi-arm bandit. To avoid filter bubbles, we draw on the intuition that if some users are shown some category of content, then all users should see at least a small amount of that content. We first analyze a naive formalization of this intuition and show it has unintended consequences: it leads to ``tyranny of the majority’’ with the burden of diversification borne disproportionately by those with minority interests. This leads us to our model which distributes this burden more equitably. We require that the probability any user is shown a particular type of content is at least \gamma times the average probability all users are shown that type of content. Full personalization corresponds to \gamma = 0 and complete homogenization corresponds to \gamma = 1 ; hence, \gamma encodes a hard cap on the level of personalization. We also analyze additional formulations where the platform can exceed its cap but pays a penalty proportional to its constraint violation. We provide algorithmic guarantees for optimizing recommendations subject to these constraints. These include nearly matching upper and lower bounds for the entire range of \gamma \in [0,1] showing that the reward of a multi-agent variant of UCB is nearly optimal. Using real-world preference data, we empirically verify that under our model, users share the burden of diversification with only minor utility loss under our constraints.

ML-84-标题: Uncertainty Quantification over Graph with Conformalized Graph Neural Networks

链接: https://arxiv.org/abs/2305.14535
作者: Kexin Huang, Ying Jin, Emmanuel Candes, Jure Leskovec


Abstract: Graph Neural Networks (GNNs) are powerful machine learning prediction models on graph-structured data. However, GNNs lack rigorous uncertainty estimates, limiting their reliable deployment in settings where the cost of errors is significant. We propose conformalized GNN (CF-GNN), extending conformal prediction (CP) to graph-based models for guaranteed uncertainty estimates. Given an entity in the graph, CF-GNN produces a prediction set/interval that provably contains the true label with pre-defined coverage probability (e.g. 90%). We establish a permutation invariance condition that enables the validity of CP on graph data and provide an exact characterization of the test-time coverage. Moreover, besides valid coverage, it is crucial to reduce the prediction set size/interval length for practical use. We observe a key connection between non-conformity scores and network structures, which motivates us to develop a topology-aware output correction model that learns to update the prediction and produces more efficient prediction sets/intervals. Extensive experiments show that CF-GNN achieves any pre-defined target marginal coverage while significantly reducing the prediction set/interval size by up to 74% over the baselines. It also empirically achieves satisfactory conditional coverage over various raw and network features.

ML-85-标题: Basis Function Encoding of Numerical Features in Factorization Machines for Improved Accuracy

链接: https://arxiv.org/abs/2305.14528
作者: Alex Shtoff, Elie Abboud, Rotem Stram, Oren Somekh


Abstract: Factorization machine (FM) variants are widely used for large scale real-time content recommendation systems, since they offer an excellent balance between model accuracy and low computational costs for training and inference. These systems are trained on tabular data with both numerical and categorical columns. Incorporating numerical columns poses a challenge, and they are typically incorporated using a scalar transformation or binning, which can be either learned or chosen a-priori. In this work, we provide a systematic and theoretically-justified way to incorporate numerical features into FM variants by encoding them into a vector of function values for a set of functions of one’s choice. We view factorization machines as approximators of segmentized functions, namely, functions from a field’s value to the real numbers, assuming the remaining fields are assigned some given constants, which we refer to as the segment. From this perspective, we show that our technique yields a model that learns segmentized functions of the numerical feature spanned by the set of functions of one’s choice, namely, the spanning coefficients vary between segments. Hence, to improve model accuracy we advocate the use of functions known to have strong approximation power, and offer the B-Spline basis due to its well-known approximation power, availability in software libraries, and efficiency. Our technique preserves fast training and inference, and requires only a small modification of the computational graph of an FM model. Therefore, it is easy to incorporate into an existing system to improve its performance. Finally, we back our claims with a set of experiments, including synthetic, performance evaluation on several data-sets, and an A/B test on a real online advertising system which shows improved performance.

ML-86-标题: CongFu: Conditional Graph Fusion for Drug Synergy Prediction

链接: https://arxiv.org/abs/2305.14517
作者: Oleksii Tsepa, Bohdan Naida, Bo Wang


Abstract: Drug synergy, characterized by the amplified combined effect of multiple drugs, presents a critical phenomenon for optimizing therapeutic outcomes. However, limited data on drug synergy, arising from the vast number of possible drug combinations and computational costs, motivate the need for predictive methods. In this work, we introduce CongFu, a novel Conditional Graph Fusion Layer, designed to predict drug synergy. CongFu employs an attention mechanism and a bottleneck to extract local graph contexts and conditionally fuse graph data within a global context. Its modular architecture enables flexible replacement of layer modules, including readouts and graph encoders, facilitating customization for diverse applications. To evaluate the performance of CongFu, we conduct comprehensive experiments on four datasets, encompassing three distinct setups for drug synergy prediction. Remarkably, CongFu achieves state-of-the-art results on 11 out of 12 benchmark datasets, demonstrating its ability to capture intricate patterns of drug synergy. Through extensive ablation studies, we validate the significance of individual layer components, affirming their contributions to overall predictive performance. By addressing the challenge of predicting drug synergy in untested drug pairs, CongFu opens new avenues for optimizing drug combinations and advancing personalized medicine.

ML-87-标题: Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

链接: https://arxiv.org/abs/2305.14516
作者: Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zheng, Brian Coutinho, Saeed Rashidi, Changhai Man, Tushar Krishna


Abstract: Benchmarking and co-design are essential for driving optimizations and innovation around ML models, ML software, and next-generation hardware. Full workload benchmarks, e.g. MLPerf, play an essential role in enabling fair comparison across different software and hardware stacks especially once systems are fully designed and deployed. However, the pace of AI innovation demands a more agile methodology to benchmark creation and usage by simulators and emulators for future system co-design. We propose Chakra, an open graph schema for standardizing workload specification capturing key operations and dependencies, also known as Execution Trace (ET). In addition, we propose a complementary set of tools/capabilities to enable collection, generation, and adoption of Chakra ETs by a wide range of simulators, emulators, and benchmarks. For instance, we use generative AI models to learn latent statistical properties across thousands of Chakra ETs and use these models to synthesize Chakra ETs. These synthetic ETs can obfuscate key proprietary information and also target future what-if scenarios. As an example, we demonstrate an end-to-end proof-of-concept that converts PyTorch ETs to Chakra ETs and uses this to drive an open-source training system simulator (ASTRA-sim). Our end-goal is to build a vibrant industry-wide ecosystem of agile benchmarks and tools to drive future AI system co-design.

ML-88-标题: Knowledge Graphs Querying SIGMOD

链接: https://arxiv.org/abs/2305.14485
作者: Arijit Khan
备注: accepted at ACM SIGMOD Record 2023


Abstract: Knowledge graphs (KGs) such as DBpedia, Freebase, YAGO, Wikidata, and NELL were constructed to store large-scale, real-world facts as (subject, predicate, object) triples – that can also be modeled as a graph, where a node (a subject or an object) represents an entity with attributes, and a directed edge (a predicate) is a relationship between two entities. Querying KGs is critical in web search, question answering (QA), semantic search, personal assistants, fact checking, and recommendation. While significant progress has been made on KG construction and curation, thanks to deep learning recently we have seen a surge of research on KG querying and QA. The objectives of our survey are two-fold. First, research on KG querying has been conducted by several communities, such as databases, data mining, semantic web, machine learning, information retrieval, and natural language processing (NLP), with different focus and terminologies; and also in diverse topics ranging from graph databases, query languages, join algorithms, graph patterns matching, to more sophisticated KG embedding and natural language questions (NLQs). We aim at uniting different interdisciplinary topics and concepts that have been developed for KG querying. Second, many recent advances on KG and query embedding, multimodal KG, and KG-QA come from deep learning, IR, NLP, and computer vision domains. We identify important challenges of KG querying that received less attention by graph databases, and by the DB community in general, e.g., incomplete KG, semantic matching, multimodal data, and NLQs. We conclude by discussing interesting opportunities for the data management community, for instance, KG as a unified data model and vector-based query processing.

ML-89-标题: A Block-Coordinate Approach of Multi-level Optimization with an Application to Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2305.14477
作者: Serge Gratton, Valentin Mercier, Elisa Riccietti, Philippe L. Toint


Abstract: Multi-level methods are widely used for the solution of large-scale problems, because of their computational advantages and exploitation of the complementarity between the involved sub-problems. After a re-interpretation of multi-level methods from a block-coordinate point of view, we propose a multi-level algorithm for the solution of nonlinear optimization problems and analyze its evaluation complexity. We apply it to the solution of partial differential equations using physics-informed neural networks (PINNs) and show on a few test problems that the approach results in better solutions and significant computational savings

ML-90-标题: Fourier Neural Operators for Arbitrary Resolution Climate Data Downscaling ICLR2023

链接: https://arxiv.org/abs/2305.14452
作者: Qidong Yang, Alex Hernandez-Garcia, Paula Harder, Venkatesh Ramesh, Prasanna Sattegeri, Daniela Szwarcman, Campbell D. Watson, David Rolnick
备注: Presented at the ICLR 2023 workshop on “Tackling Climate Change with Machine Learning”


Abstract: Climate simulations are essential in guiding our understanding of climate change and responding to its effects. However, it is computationally expensive to resolve complex climate processes at high spatial resolution. As one way to speed up climate simulations, neural networks have been used to downscale climate variables from fast-running low-resolution simulations, but high-resolution training data are often unobtainable or scarce, greatly limiting accuracy. In this work, we propose a downscaling method based on the Fourier neural operator. It trains with data of a small upsampling factor and then can zero-shot downscale its input to arbitrary unseen high resolution. Evaluated both on ERA5 climate model data and on the Navier-Stokes equation solution data, our downscaling model significantly outperforms state-of-the-art convolutional and generative adversarial downscaling models, both in standard single-resolution downscaling and in zero-shot generalization to higher upsampling factors. Furthermore, we show that our method also outperforms state-of-the-art data-driven partial differential equation solvers on Navier-Stokes equations. Overall, our work bridges the gap between simulation of a physical process and interpolation of low-resolution output, showing that it is possible to combine both approaches and significantly improve upon each other.

ML-91-标题: Kernel Interpolation with Sparse Grids NEURIPS

链接: https://arxiv.org/abs/2305.14451
作者: Mohit Yadav, Daniel Sheldon, Cameron Musco
备注: Accepted at Neural Information Processing Systems (NeurIPS) 2022


Abstract: Structured kernel interpolation (SKI) accelerates Gaussian process (GP) inference by interpolating the kernel covariance function using a dense grid of inducing points, whose corresponding kernel matrix is highly structured and thus amenable to fast linear algebra. Unfortunately, SKI scales poorly in the dimension of the input points, since the dense grid size grows exponentially with the dimension. To mitigate this issue, we propose the use of sparse grids within the SKI framework. These grids enable accurate interpolation, but with a number of points growing more slowly with dimension. We contribute a novel nearly linear time matrix-vector multiplication algorithm for the sparse grid kernel matrix. Next, we describe how sparse grids can be combined with an efficient interpolation scheme based on simplices. With these changes, we demonstrate that SKI can be scaled to higher dimensions while maintaining accuracy.

ML-92-标题: Graph Meets LLM: A Novel Approach to Collaborative Filtering for Robust Conversational Understanding

链接: https://arxiv.org/abs/2305.14449
作者: Zheng Chen, Ziyan Jiang, Fan Yang


Abstract: Conversational AI systems (e.g. Alexa, Siri, Google Assistant, etc.) need to understand queries with defects to ensure robust conversational understanding and reduce user frictions. The defective queries are often induced by user ambiguities and mistakes, or errors in the automatic speech recognition (ASR) and natural language understanding (NLU). Personalized query rewriting (personalized QR) targets reducing defects in the torso and tail user query traffic, and it typically relies on an index of past successful user interactions with the conversational AI. This paper presents our “Collaborative Query Rewriting” approach that focuses on rewriting novel user interactions unseen in the user history. This approach builds a “user Feedback Interaction Graph” (FIG) consisting of historical user-entity interactions, and leverages multi-hop customer affinity to enrich each user’s index (i.e. the Collaborative User Index) that would help cover future unseen defective queries. To counteract the precision degradation from the enlarged index, we introduced additional transformer layers to the L1 retrieval model and added multi-hop affinity and guardrail features to the L2 re-ranking model. Given the production constraints of storage cost and runtime retrieval latency, managing the size of the Collaborative User Index is important. As the user index can be pre-computed, we explored using a Large Language Model (LLM) for multi-hop customer affinity retrieval on the Video/Music domains. In particular, this paper looked into the Dolly-V2 7B model. Given limited user index size, We found the user index derived from fine-tuned Dolly-V2 generation significantly enhanced coverage of unseen user interactions. Consequently, this boosted QR performance on unseen user interactions compared to the graph traversal based user index.

ML-93-标题: Deep Learning based Forecasting: a case study from the online fashion industry

链接: https://arxiv.org/abs/2305.14406
作者: Manuel Kunz, Stefan Birr, Mones Raslan, Lei Ma, Zhen Li, Adele Gouttes, Mateusz Koren, Tofigh Naghibi, Johannes Stephan, Mariia Bulycheva, Matthias Grzeschik, Armin Kekić, Michael Narodovitch, Kashif Rasul, Julian Sieber, Tim Januschowski


Abstract: Demand forecasting in the online fashion industry is particularly amendable to global, data-driven forecasting models because of the industry’s set of particular challenges. These include the volume of data, the irregularity, the high amount of turn-over in the catalog and the fixed inventory assumption. While standard deep learning forecasting approaches cater for many of these, the fixed inventory assumption requires a special treatment via controlling the relationship between price and demand closely. In this case study, we describe the data and our modelling approach for this forecasting problem in detail and present empirical results that highlight the effectiveness of our approach.

ML-94-标题: NeuralMatrix: Moving Entire Neural Networks to General Matrix Multiplication for Efficient Inference NEURIPS2023

链接: https://arxiv.org/abs/2305.14405
作者: Ruiqi Sun, Jie Zhao, Xin He, Yiran Li, An Zou
备注: 12 pages, 4 figures, Submitted to 37th Conference on Neural Information Processing Systems (NeurIPS 2023)


Abstract: In this study, we introduce NeuralMatrix, a novel framework that enables the computation of versatile deep neural networks (DNNs) on a single general matrix multiplication (GEMM) accelerator. The proposed approach overcomes the specificity limitations of ASIC-based accelerators while achieving application-specific acceleration levels compared to general-purpose processors such as CPUs and GPUs. We address the challenges of mapping both linear and nonlinear operations in DNN computation to general matrix multiplications and the impact of using a GEMM accelerator on DNN inference accuracy. Extensive experiments are conducted on various DNN models from three popular categories (i.e., CNN, Transformers, and GNN) as illustrative backbone models. Our results demonstrate that DNNs suffer only up to a 2.02% accuracy loss after being converted to general matrix multiplication, while achieving 113x to 19.44x improvements in throughput per power compared to CPUs and GPUs.

ML-95-标题: Improving Speech Emotion Recognition Performance using Differentiable Architecture Search

链接: https://arxiv.org/abs/2305.14402
作者: Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Björn Schuller


Abstract: Speech Emotion Recognition (SER) is a critical enabler of emotion-aware communication in human-computer interactions. Deep Learning (DL) has improved the performance of SER models by improving model complexity. However, designing DL architectures requires prior experience and experimental evaluations. Encouragingly, Neural Architecture Search (NAS) allows automatic search for an optimum DL model. In particular, Differentiable Architecture Search (DARTS) is an efficient method of using NAS to search for optimised models. In this paper, we propose DARTS for a joint CNN and LSTM architecture for improving SER performance. Our choice of the CNN LSTM coupling is inspired by results showing that similar models offer improved performance. While SER researchers have considered CNNs and RNNs separately, the viability of using DARTs jointly for CNN and LSTM still needs exploration. Experimenting with the IEMOCAP dataset, we demonstrate that our approach outperforms best-reported results using DARTS for SER.

ML-96-标题: Review ing Evolution of Learning Functions and Semantic Information Measures for Understanding Deep Learning

链接: https://arxiv.org/abs/2305.14397
作者: Chenguang Lu
备注: 34 pages, 9 figures. published in Entropy, 2023


Abstract: A new trend in deep learning, represented by Mutual Information Neural Estimation (MINE) and Information Noise Contrast Estimation (InfoNCE), is emerging. In this trend, similarity functions and Estimated Mutual Information (EMI) are used as learning and objective functions. Coincidentally, EMI is essentially the same as Semantic Mutual Information (SeMI) proposed by the author 30 years ago. This paper first reviews the evolutionary histories of semantic information measures and learning functions. Then, it briefly introduces the author’s semantic information G theory with the rate-fidelity function R(G) (G denotes SeMI, and R(G) extends R(D)) and its applications to multi-label learning, the maximum Mutual Information (MI) classification, and mixture models. Then it discusses how we should understand the relationship between SeMI and Shan-non’s MI, two generalized entropies (fuzzy entropy and coverage entropy), Autoencoders, Gibbs distributions, and partition functions from the perspective of the R(G) function or the G theory. An important conclusion is that mixture models and Restricted Boltzmann Machines converge because SeMI is maximized, and Shannon’s MI is minimized, making information efficiency G/R close to 1. A potential opportunity is to simplify deep learning by using Gaussian channel mixture models for pre-training deep neural networks’ latent layers without considering gradients. It also discusses how the SeMI measure is used as the reward function (reflecting purposiveness) for reinforcement learning. The G theory helps interpret deep learning but is far from enough. Combining semantic information theory and deep learning will accelerate their development.

ML-97-标题: FITNESS: A Causal De-correlation Approach for Mitigating Bias in Machine Learning Software

链接: https://arxiv.org/abs/2305.14396
作者: Ying Xiao, Shangwen Wang, Sicen Liu, Dingyuan Xue, Xian Zhan, Yepang Liu
备注: 12 pages, 7 figures and 6 tables


Abstract: Software built on top of machine learning algorithms is becoming increasingly prevalent in a variety of fields, including college admissions, healthcare, insurance, and justice. The effectiveness and efficiency of these systems heavily depend on the quality of the training datasets. Biased datasets can lead to unfair and potentially harmful outcomes, particularly in such critical decision-making systems where the allocation of resources may be affected. This can exacerbate discrimination against certain groups and cause significant social disruption. To mitigate such unfairness, a series of bias-mitigating methods are proposed. Generally, these studies improve the fairness of the trained models to a certain degree but with the expense of sacrificing the model performance. In this paper, we propose FITNESS, a bias mitigation approach via de-correlating the causal effects between sensitive features (e.g., the sex) and the label. Our key idea is that by de-correlating such effects from a causality perspective, the model would avoid making predictions based on sensitive features and thus fairness could be improved. Furthermore, FITNESS leverages multi-objective optimization to achieve a better performance-fairness trade-off. To evaluate the effectiveness, we compare FITNESS with 7 state-of-the-art methods in 8 benchmark tasks by multiple metrics. Results show that FITNESS can outperform the state-of-the-art methods on bias mitigation while preserve the model’s performance: it improved the model’s fairness under all the scenarios while decreased the model’s performance under only 26.67% of the scenarios. Additionally, FITNESS surpasses the Fairea Baseline in 96.72% cases, outperforming all methods we compared.

ML-98-标题: Unsupervised Spiking Neural Network Model of Prefrontal Cortex to study Task Switching with Synaptic deficiency

链接: https://arxiv.org/abs/2305.14394
作者: Ashwin Viswanathan Kannan, Goutam Mylavarapu, Johnson P Thomas


Abstract: In this study, we build a computational model of Prefrontal Cortex (PFC) using Spiking Neural Networks (SNN) to understand how neurons adapt and respond to tasks switched under short and longer duration of stimulus changes. We also explore behavioral deficits arising out of the PFC lesions by simulating lesioned states in our Spiking architecture model. Although there are some computational models of the PFC, SNN’s have not been used to model them. In this study, we use SNN’s having parameters close to biologically plausible values and train the model using unsupervised Spike Timing Dependent Plasticity (STDP) learning rule. Our model is based on connectionist architectures and exhibits neural phenomena like sustained activity which helps in generating short-term or working memory. We use these features to simulate lesions by deactivating synaptic pathways and record the weight adjustments of learned patterns and capture the accuracy of learning tasks in such conditions. All our experiments are trained and recorded using a real-world Fashion MNIST (FMNIST) dataset and through this work, we bridge the gap between bio-realistic models and those that perform well in pattern recognition tasks

ML-99-标题: A Rational Model of Dimension-reduced Human Categorization

链接: https://arxiv.org/abs/2305.14383
作者: Yifan Hong, Chen Wang


Abstract: Existing models in cognitive science typically assume human categorization as graded generalization behavior in a multidimensional psychological space. However, category representations in these models may suffer from the curse of dimensionality in a natural setting. People generally rely on a tractable yet sufficient set of features to understand the complex environment. We propose a rational model of categorization based on a hierarchical mixture of probabilistic principal components, that simultaneously learn category representations and an economical collection of features. The model captures dimensional biases in human categorization and supports zero-shot learning. We further exploit a generative process within a low-dimensional latent space to provide a better account of categorization with high-dimensional stimuli. We validate the model with simulation and behavioral experiments.

ML-100-标题: Unsupervised Discovery of Continuous Skills on a Sphere

链接: https://arxiv.org/abs/2305.14377
作者: Takahisa Imagawa, Takuya Hiraoka, Yoshimasa Tsuruoka
备注: 14 pages, 12 figures


Abstract: Recently, methods for learning diverse skills to generate various behaviors without external rewards have been actively studied as a form of unsupervised reinforcement learning. However, most of the existing methods learn a finite number of discrete skills, and thus the variety of behaviors that can be exhibited with the learned skills is limited. In this paper, we propose a novel method for learning potentially an infinite number of different skills, which is named discovery of continuous skills on a sphere (DISCS). In DISCS, skills are learned by maximizing mutual information between skills and states, and each skill corresponds to a continuous value on a sphere. Because the representations of skills in DISCS are continuous, infinitely diverse skills could be learned. We examine existing methods and DISCS in the MuJoCo Ant robot control environments and show that DISCS can learn much more diverse skills than the other methods.

ML-101-标题: Learning to Rank the Importance of Nodes in Road Networks Based on Multi-Graph Fusion

链接: https://arxiv.org/abs/2305.14375
作者: Ming Xu, Jing Zhang


Abstract: Identifying important nodes with strong propagation capabilities in road networks is a significant topic in the field of urban planning. However, existing methods for evaluating nodes importance consider only topological information and traffic volumes, ignoring the diversity of characteristics in road networks, such as the number of lanes and average speed of road segments, limiting their performance. To address this issue, this paper proposes a graph learning-based node ranking method (MGL2Rank) that integrates the rich characteristics of the road network. In this method, we first develop a sampling algorithm (MGWalk) that utilizes multi-graph fusion to establish association between road segments based on their attributes. Then, an embedding module is proposed to learn latent representation for each road segment. Finally, the obtained node representation is used to learn importance ranking of road segments. We conduct simulation experiments on the regional road network of Shenyang city and demonstrate the effectiveness of our proposed method. The data and source code of MGL2Rank are available at this https URL.

ML-102-标题: Inferring Attracting Basins of Power System with Machine Learning

链接: https://arxiv.org/abs/2305.14374
作者: Yao Du, Qing Li, Huawei Fan, Meng Zhan, Jinghua Xiao, Xingang Wang
备注: 13 pages, 7 figures


Abstract: Power systems dominated by renewable energy encounter frequently large, random disturbances, and a critical challenge faced in power-system management is how to anticipate accurately whether the perturbed systems will return to the functional state after the transient or collapse. Whereas model-based studies show that the key to addressing the challenge lies in the attracting basins of the functional and dysfunctional states in the phase space, the finding of the attracting basins for realistic power systems remains a challenge, as accurate models describing the system dynamics are generally unavailable. Here we propose a new machine learning technique, namely balanced reservoir computing, to infer the attracting basins of a typical power system based on measured data. Specifically, trained by the time series of a handful of perturbation events, we demonstrate that the trained machine can predict accurately whether the system will return to the functional state in response to a large, random perturbation, thereby reconstructing the attracting basin of the functional state. The working mechanism of the new machine is analyzed, and it is revealed that the success of the new machine is attributed to the good balance between the echo and fading properties of the reservoir network; the effect of noisy signals on the prediction performance is also investigated, and a stochastic-resonance-like phenomenon is observed. Finally, we demonstrate that the new technique can be also utilized to infer the attracting basins of coexisting attractors in typical chaotic systems.

ML-103-标题: An Ensemble Semi-Supervised Adaptive Resonance Theory Model with Explanation Capability for Pattern Classification

链接: https://arxiv.org/abs/2305.14373
作者: Farhad Pourpanah, Chee Peng Lim, Ali Etemad, Q. M. Jonathan Wu
备注: 13 pages, 8 figures


Abstract: Most semi-supervised learning (SSL) models entail complex structures and iterative training processes as well as face difficulties in interpreting their predictions to users. To address these issues, this paper proposes a new interpretable SSL model using the supervised and unsupervised Adaptive Resonance Theory (ART) family of networks, which is denoted as SSL-ART. Firstly, SSL-ART adopts an unsupervised fuzzy ART network to create a number of prototype nodes using unlabeled samples. Then, it leverages a supervised fuzzy ARTMAP structure to map the established prototype nodes to the target classes using labeled samples. Specifically, a one-to-many (OtM) mapping scheme is devised to associate a prototype node with more than one class label. The main advantages of SSL-ART include the capability of: (i) performing online learning, (ii) reducing the number of redundant prototype nodes through the OtM mapping scheme and minimizing the effects of noisy samples, and (iii) providing an explanation facility for users to interpret the predicted outcomes. In addition, a weighted voting strategy is introduced to form an ensemble SSL-ART model, which is denoted as WESSL-ART. Every ensemble member, i.e., SSL-ART, assigns \colorblacka different weight to each class based on its performance pertaining to the corresponding class. The aim is to mitigate the effects of training data sequences on all SSL-ART members and improve the overall performance of WESSL-ART. The experimental results on eighteen benchmark data sets, three artificially generated data sets, and a real-world case study indicate the benefits of the proposed SSL-ART and WESSL-ART models for tackling pattern classification problems.

ML-104-标题: Continually Learned Pavlovian Signalling Without Forgetting for Human-in-the-Loop Robotic Control NEURIPS

链接: https://arxiv.org/abs/2305.14365
作者: Adam S. R. Parker, Michael R. Dawson, Patrick M. Pilarski
备注: 12 pages inc. supplementary, 7 figures, 3 algorithms, Published the NeurIPS Workshop on Human in the Loop Learning, Nov 28 - Dec 8 2022


Abstract: Artificial limbs are sophisticated devices to assist people with tasks of daily living. Despite advanced robotic prostheses demonstrating similar motion capabilities to biological limbs, users report them difficult and non-intuitive to use. Providing more effective feedback from the device to the user has therefore become a topic of increased interest. In particular, prediction learning methods from the field of reinforcement learning – specifically, an approach termed Pavlovian signalling – have been proposed as one approach for better modulating feedback in prostheses since they can adapt during continuous use. One challenge identified in these learning methods is that they can forget previously learned predictions when a user begins to successfully act upon delivered feedback. The present work directly addresses this challenge, contributing new evidence on the impact of algorithmic choices, such as on- or off-policy methods and representation choices, on the Pavlovian signalling from a machine to a user during their control of a robotic arm. Two conditions of algorithmic differences were studied using different scenarios of controlling a robotic arm: an automated motion system and human participant piloting. Contrary to expectations, off-policy learning did not provide the expected solution to the forgetting problem. We instead identified beneficial properties of a look-ahead state representation that made existing approaches able to learn (and not forget) predictions in support of Pavlovian signalling. This work therefore contributes new insight into the challenges of providing learned predictive feedback from a prosthetic device, and demonstrates avenues for more dynamic signalling in future human-machine interactions.

ML-105-标题: On the robust learning mixtures of linear regressions

链接: https://arxiv.org/abs/2305.15317
作者: Ying Huang, Liang Chen


Abstract: In this note, we consider the problem of robust learning mixtures of linear regressions. We connect mixtures of linear regressions and mixtures of Gaussians with a simple thresholding, so that a quasi-polynomial time algorithm can be obtained under some mild separation condition. This algorithm has significantly better robustness than the previous result.

ML-106-标题: Error Feedback Shines when Features are Rare

链接: https://arxiv.org/abs/2305.15264
作者: Peter Richtárik, Elnur Gasanov, Konstantin Burlachenko


Abstract: We provide the first proof that gradient descent \left(\colorgreen\sf GD\right) with greedy sparsification \left(\colorgreen\sf TopK\right) and error feedback \left(\colorgreen\sf EF\right) can obtain better communication complexity than vanilla \colorgreen\sf GD when solving the distributed optimization problem \min_x\in \mathbbR^d f(x)=\frac1n\sum_i=1^n f_i(x) , where n = # of clients, d = # of features, and f_1,\dots,f_n are smooth nonconvex functions. Despite intensive research since 2014 when \colorgreen\sf EF was first proposed by Seide et al., this problem remained open until now. We show that \colorgreen\sf EF shines in the regime when features are rare, i.e., when each feature is present in the data owned by a small number of clients only. To illustrate our main result, we show that in order to find a random vector \hatx such that \lVert \nabla f(\hatx) \rVert^2 \leq \varepsilon in expectation, \colorgreen\sf GD with the \colorgreen\sf Top1 sparsifier and \colorgreen\sf EF requires \cal O \left(\left( L+\colorbluer \sqrt \frac\colorredcn \min \left( \frac\colorredcn \max_i L_i^2, \frac1n\sum_i=1^n L_i^2 \right) \right) \frac1\varepsilon \right) bits to be communicated by each worker to the server only, where L is the smoothness constant of f , L_i is the smoothness constant of f_i , \colorredc is the maximal number of clients owning any feature ( 1\leq \colorredc \leq n ), and \colorbluer is the maximal number of features owned by any client ( 1\leq \colorbluer \leq d ). Clearly, the communication complexity improves as \colorredc decreases (i.e., as features become more rare), and can be much better than the \cal O(\colorbluer L \frac1\varepsilon) communication complexity of \colorgreen\sf GD in the same regime.

ML-107-标题: Attention to Mean-Fields for Particle Cloud Generation

链接: https://arxiv.org/abs/2305.15254
作者: Benno Käch, Isabell Melzer-Pellmann


Abstract: The generation of collider data using machine learning has emerged as a prominent research topic in particle physics due to the increasing computational challenges associated with traditional Monte Carlo simulation methods, particularly for future colliders with higher luminosity. Although generating particle clouds is analogous to generating point clouds, accurately modelling the complex correlations between the particles presents a considerable challenge. Additionally, variable particle cloud sizes further exacerbate these difficulties, necessitating more sophisticated models. In this work, we propose a novel model that utilizes an attention-based aggregation mechanism to address these challenges. The model is trained in an adversarial training paradigm, ensuring that both the generator and critic exhibit permutation equivariance/invariance with respect to their input. A novel feature matching loss in the critic is introduced to stabilize the training. The proposed model performs competitively to the state-of-art whilst having significantly fewer parameters.

ML-108-标题: Generalized Bayesian Inference for Scientific Simulators via Amortized Cost Estimation

链接: https://arxiv.org/abs/2305.15208
作者: Richard Gao, Michael Deistler, Jakob H. Macke


Abstract: Simulation-based inference (SBI) enables amortized Bayesian inference for simulators with implicit likelihoods. But when we are primarily interested in the quality of predictive simulations, or when the model cannot exactly reproduce the observed data (i.e., is misspecified), targeting the Bayesian posterior may be overly restrictive. Generalized Bayesian Inference (GBI) aims to robustify inference for (misspecified) simulator models, replacing the likelihood-function with a cost function that evaluates the goodness of parameters relative to data. However, GBI methods generally require running multiple simulations to estimate the cost function at each parameter value during inference, making the approach computationally infeasible for even moderately complex simulators. Here, we propose amortized cost estimation (ACE) for GBI to address this challenge: We train a neural network to approximate the cost function, which we define as the expected distance between simulations produced by a parameter and observed data. The trained network can then be used with MCMC to infer GBI posteriors for any observation without running additional simulations. We show that, on several benchmark tasks, ACE accurately predicts cost and provides predictive simulations that are closer to synthetic observations than other SBI methods, especially for misspecified simulators. Finally, we apply ACE to infer parameters of the Hodgkin-Huxley model given real intracellular recordings from the Allen Cell Types Database. ACE identifies better data-matching parameters while being an order of magnitude more simulation-efficient than a standard SBI method. In summary, ACE combines the strengths of SBI methods and GBI to perform robust and simulation-amortized inference for scientific simulators.

ML-109-标题: Explaining the Uncertain: Stochastic Shapley Values for Gaussian Process Models

链接: https://arxiv.org/abs/2305.15167
作者: Siu Lun Chau, Krikamol Muandet, Dino Sejdinovic
备注: 26 pages, 6 figures


Abstract: We present a novel approach for explaining Gaussian processes (GPs) that can utilize the full analytical covariance structure present in GPs. Our method is based on the popular solution concept of Shapley values extended to stochastic cooperative games, resulting in explanations that are random variables. The GP explanations generated using our approach satisfy similar favorable axioms to standard Shapley values and possess a tractable covariance function across features and data observations. This covariance allows for quantifying explanation uncertainties and studying the statistical dependencies between explanations. We further extend our framework to the problem of predictive explanation, and propose a Shapley prior over the explanation function to predict Shapley values for new data based on previously computed ones. Our extensive illustrations demonstrate the effectiveness of the proposed approach.

ML-110-标题: SyNDock: N Rigid Protein Docking via Learnable Group Synchronization

链接: https://arxiv.org/abs/2305.15156
作者: Yuanfeng Ji, Yatao Bian, Guoji Fu, Peilin Zhao, Ping Luo


Abstract: The regulation of various cellular processes heavily relies on the protein complexes within a living cell, necessitating a comprehensive understanding of their three-dimensional structures to elucidate the underlying mechanisms. While neural docking techniques have exhibited promising outcomes in binary protein docking, the application of advanced neural architectures to multimeric protein docking remains uncertain. This study introduces SyNDock, an automated framework that swiftly assembles precise multimeric complexes within seconds, showcasing performance that can potentially surpass or be on par with recent advanced approaches. SyNDock possesses several appealing advantages not present in previous approaches. Firstly, SyNDock formulates multimeric protein docking as a problem of learning global transformations to holistically depict the placement of chain units of a complex, enabling a learning-centric solution. Secondly, SyNDock proposes a trainable two-step SE(3) algorithm, involving initial pairwise transformation and confidence estimation, followed by global transformation synchronization. This enables effective learning for assembling the complex in a globally consistent manner. Lastly, extensive experiments conducted on our proposed benchmark dataset demonstrate that SyNDock outperforms existing docking software in crucial performance metrics, including accuracy and runtime. For instance, it achieves a 4.5% improvement in performance and a remarkable millionfold acceleration in speed.

ML-111-标题: MotifRetro: Exploring the Combinability-Consistency Trade-offs in retrosynthesis via Dynamic Motif Editing

链接: https://arxiv.org/abs/2305.15153
作者: Zhangyang Gao, Xingran Chen, Cheng Tan, Stan Z. Li


Abstract: Is there a unified framework for graph-based retrosynthesis prediction? Through analysis of full-, semi-, and non-template retrosynthesis methods, we discovered that they strive to strike an optimal balance between combinability and consistency: \textitShould atoms be combined as motifs to simplify the molecular editing process, or should motifs be broken down into atoms to reduce the vocabulary and improve predictive consistency? Recent works have studied several specific cases, while none of them explores different combinability-consistency trade-offs. Therefore, we propose MotifRetro, a dynamic motif editing framework for retrosynthesis prediction that can explore the entire trade-off space and unify graph-based models. MotifRetro comprises two components: RetroBPE, which controls the combinability-consistency trade-off, and a motif editing model, where we introduce a novel LG-EGAT module to dynamiclly add motifs to the molecule. We conduct extensive experiments on USPTO-50K to explore how the trade-off affects the model performance and finally achieve state-of-the-art performance.

ML-112-标题: Knowledge-Design: Pushing the Limit of Protein Deign via Knowledge Refinement

链接: https://arxiv.org/abs/2305.15151
作者: Zhangyang Gao, Cheng Tan, Stan Z. Li


Abstract: Recent studies have shown competitive performance in protein design that aims to find the amino acid sequence folding into the desired structure. However, most of them disregard the importance of predictive confidence, fail to cover the vast protein space, and do not incorporate common protein knowledge. After witnessing the great success of pretrained models on diverse protein-related tasks and the fact that recovery is highly correlated with confidence, we wonder whether this knowledge can push the limits of protein design further. As a solution, we propose a knowledge-aware module that refines low-quality residues. We also introduce a memory-retrieval mechanism to save more than 50% of the training time. We extensively evaluate our proposed method on the CATH, TS50, and TS500 datasets and our results show that our Knowledge-Design method outperforms the previous PiFold method by approximately 9% on the CATH dataset. Specifically, Knowledge-Design is the first method that achieves 60+% recovery on CATH, TS50 and TS500 benchmarks. We also provide additional analysis to demonstrate the effectiveness of our proposed method. The code will be publicly available.

ML-113-标题: ReSync: Riemannian Subgradient-based Robust Rotation Synchronization

链接: https://arxiv.org/abs/2305.15136
作者: Huikang Liu, Xiao Li, Anthony Man-Cho So
备注: 24 pages, 3 figures


Abstract: This work presents ReSync, a Riemannian subgradient-based algorithm for solving the robust rotation synchronization problem, which arises in various engineering applications. ReSync solves a least-unsquared minimization formulation over the rotation group, which is nonsmooth and nonconvex, and aims at recovering the underlying rotations directly. We provide strong theoretical guarantees for ReSync under the random corruption setting. Specifically, we first show that the initialization procedure of ReSync yields a proper initial point that lies in a local region around the ground-truth rotations. We next establish the weak sharpness property of the aforementioned formulation and then utilize this property to derive the local linear convergence of ReSync to the ground-truth rotations. By combining these guarantees, we conclude that ReSync converges linearly to the ground-truth rotations under appropriate conditions. Experiment results demonstrate the effectiveness of ReSync.

ML-114-标题: Reconstruction forecasting and stability of chaotic dynamics from partial data

链接: https://arxiv.org/abs/2305.15111
作者: Elise Özalp, Georgios Margazoglou, Luca Magri


Abstract: The forecasting and computation of the stability of chaotic systems from partial observations are tasks for which traditional equation-based methods may not be suitable. In this computational paper, we propose data-driven methods to (i) infer the dynamics of unobserved (hidden) chaotic variables (full-state reconstruction); (ii) time forecast the evolution of the full state; and (iii) infer the stability properties of the full state. The tasks are performed with long short-term memory (LSTM) networks, which are trained with observations (data) limited to only part of the state: (i) the low-to-high resolution LSTM (LH-LSTM), which takes partial observations as training input, and requires access to the full system state when computing the loss; and (ii) the physics-informed LSTM (PI-LSTM), which is designed to combine partial observations with the integral formulation of the dynamical system’s evolution equations. First, we derive the Jacobian of the LSTMs. Second, we analyse a chaotic partial differential equation, the Kuramoto-Sivashinsky (KS), and the Lorenz-96 system. We show that the proposed networks can forecast the hidden variables, both time-accurately and statistically. The Lyapunov exponents and covariant Lyapunov vectors, which characterize the stability of the chaotic attractors, are correctly inferred from partial observations. Third, the PI-LSTM outperforms the LH-LSTM by successfully reconstructing the hidden chaotic dynamics when the input dimension is smaller or similar to the Kaplan-Yorke dimension of the attractor. This work opens new opportunities for reconstructing the full state, inferring hidden variables, and computing the stability of chaotic systems from partial data.

ML-115-标题: A Rigorous Link between Deep Ensembles and (Variational) Bayesian Methods

链接: https://arxiv.org/abs/2305.15027
作者: Veit David Wild, Sahra Ghalebikesabi, Dino Sejdinovic, Jeremias Knoblauch


Abstract: We establish the first mathematically rigorous link between Bayesian, variational Bayesian, and ensemble methods. A key step towards this it to reformulate the non-convex optimisation problem typically encountered in deep learning as a convex optimisation in the space of probability measures. On a technical level, our contribution amounts to studying generalised variational inference through the lense of Wasserstein gradient flows. The result is a unified theory of various seemingly disconnected approaches that are commonly used for uncertainty quantification in deep learning – including deep ensembles and (variational) Bayesian methods. This offers a fresh perspective on the reasons behind the success of deep ensembles over procedures based on parameterised variational inference, and allows the derivation of new ensembling schemes with convergence guarantees. We showcase this by proposing a family of interacting deep ensembles with direct parallels to the interactions of particle systems in thermodynamics, and use our theory to prove the convergence of these algorithms to a well-defined global minimiser on the space of probability measures.

ML-116-标题: Hierarchical clustering with dot products recovers hidden tree structure

链接: https://arxiv.org/abs/2305.15022
作者: Annie Gray, Alexander Modell, Patrick Rubin-Delanchy, Nick Whiteley


Abstract: In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward’s method, and HDBSCAN.

ML-117-标题: Deep Learning for Survival Analysis: A Review

链接: https://arxiv.org/abs/2305.14961
作者: Simon Wiegrebe, Philipp Kopper, Raphael Sonabend, Andreas Bender
备注: 24 pages, 6 figures, 2 tables, 1 interactive table


Abstract: The influx of deep learning (DL) techniques into the field of survival analysis in recent years, coupled with the increasing availability of high-dimensional omics data and unstructured data like images or text, has led to substantial methodological progress; for instance, learning from such high-dimensional or unstructured data. Numerous modern DL-based survival methods have been developed since the mid-2010s; however, they often address only a small subset of scenarios in the time-to-event data setting - e.g., single-risk right-censored survival tasks - and neglect to incorporate more complex (and common) settings. Partially, this is due to a lack of exchange between experts in the respective fields. In this work, we provide a comprehensive systematic review of DL-based methods for time-to-event analysis, characterizing them according to both survival- and DL-related attributes. In doing so, we hope to provide a helpful overview to practitioners who are interested in DL techniques applicable to their specific use case as well as to enable researchers from both fields to identify directions for future investigation. We provide a detailed characterization of the methods included in this review as an open-source, interactive table: this https URL. As this research area is advancing rapidly, we encourage the research community to contribute to keeping the information up to date.

ML-118-标题: Learning Rate Free Bayesian Inference in Constrained Domains

链接: https://arxiv.org/abs/2305.14943
作者: Louis Sharrock, Lester Mackey, Christopher Nemeth


Abstract: We introduce a suite of new particle-based algorithms for sampling on constrained domains which are entirely learning rate free. Our approach leverages coin betting ideas from convex optimisation, and the viewpoint of constrained sampling as a mirrored optimisation problem on the space of probability measures. Based on this viewpoint, we also introduce a unifying framework for several existing constrained sampling algorithms, including mirrored Langevin dynamics and mirrored Stein variational gradient descent. We demonstrate the performance of our algorithms on a range of numerical examples, including sampling from targets on the simplex, sampling with fairness constraints, and constrained sampling problems in post-selection inference. Our results indicate that our algorithms achieve competitive performance with existing constrained sampling methods, without the need to tune any hyperparameters.

ML-119-标题: CoinEM: Tuning-Free Particle-Based Variational Inference for Latent Variable Models

链接: https://arxiv.org/abs/2305.14916
作者: Louis Sharrock, Daniel Dodd, Christopher Nemeth


Abstract: We introduce two new particle-based algorithms for learning latent variable models via marginal maximum likelihood estimation, including one which is entirely tuning-free. Our methods are based on the perspective of marginal maximum likelihood estimation as an optimization problem: namely, as the minimization of a free energy functional. One way to solve this problem is to consider the discretization of a gradient flow associated with the free energy. We study one such approach, which resembles an extension of the popular Stein variational gradient descent algorithm. In particular, we establish a descent lemma for this algorithm, which guarantees that the free energy decreases at each iteration. This method, and any other obtained as the discretization of the gradient flow, will necessarily depend on a learning rate which must be carefully tuned by the practitioner in order to ensure convergence at a suitable rate. With this in mind, we also propose another algorithm for optimizing the free energy which is entirely learning rate free, based on coin betting techniques from convex optimization. We validate the performance of our algorithms across a broad range of numerical experiments, including several high-dimensional settings. Our results are competitive with existing particle-based methods, without the need for any hyperparameter tuning.

ML-120-标题: Masked Bayesian Neural Networks : Theoretical Guarantee and its Posterior Inference ICML2023

链接: https://arxiv.org/abs/2305.14765
作者: Insung Kong, Dongyoon Yang, Jongjin Lee, Ilsang Ohn, Gyuseung Baek, Yongdai Kim
备注: 30 pages, ICML 2023 proceedings. arXiv admin note: substantial text overlap with arXiv:2206.00853


Abstract: Bayesian approaches for learning deep neural networks (BNN) have been received much attention and successfully applied to various applications. Particularly, BNNs have the merit of having better generalization ability as well as better uncertainty quantification. For the success of BNN, search an appropriate architecture of the neural networks is an important task, and various algorithms to find good sparse neural networks have been proposed. In this paper, we propose a new node-sparse BNN model which has good theoretical properties and is computationally feasible. We prove that the posterior concentration rate to the true model is near minimax optimal and adaptive to the smoothness of the true model. In particular the adaptiveness is the first of its kind for node-sparse BNNs. In addition, we develop a novel MCMC algorithm which makes the Bayesian inference of the node-sparse BNN model feasible in practice.

ML-121-标题: Under-Parameterized Double Descent for Ridge Regularized Least Squares Denoising of Data on a Line

链接: https://arxiv.org/abs/2305.14689
作者: Rishi Sonthalia, Xinyue Li, Bochao Gu


Abstract: The relationship between the number of training data points, the number of parameters in a statistical model, and the generalization capabilities of the model has been widely studied. Previous work has shown that double descent can occur in the over-parameterized regime, and believe that the standard bias-variance trade-off holds in the under-parameterized regime. In this paper, we present a simple example that provably exhibits double descent in the under-parameterized regime. For simplicity, we look at the ridge regularized least squares denoising problem with data on a line embedded in high-dimension space. By deriving an asymptotically accurate formula for the generalization error, we observe sample-wise and parameter-wise double descent with the peak in the under-parameterized regime rather than at the interpolation point or in the over-parameterized regime. Further, the peak of the sample-wise double descent curve corresponds to a peak in the curve for the norm of the estimator, and adjusting \mu , the strength of the ridge regularization, shifts the location of the peak. We observe that parameter-wise double descent occurs for this model for small \mu . For larger values of \mu , we observe that the curve for the norm of the estimator has a peak but that this no longer translates to a peak in the generalization error. Moreover, we study the training error for this problem. The considered problem setup allows for studying the interaction between two regularizers. We provide empirical evidence that the model implicitly favors using the ridge regularizer over the input data noise regularizer. Thus, we show that even though both regularizers regularize the same quantity, i.e., the norm of the estimator, they are not equivalent.

ML-122-标题: Supermodular Rank: Set Function Decomposition and Optimization

链接: https://arxiv.org/abs/2305.14632
作者: Rishi Sonthalia, Anna Seigal, Guido Montufar


Abstract: We define the supermodular rank of a function on a lattice. This is the smallest number of terms needed to decompose it into a sum of supermodular functions. The supermodular summands are defined with respect to different partial orders. We characterize the maximum possible value of the supermodular rank and describe the functions with fixed supermodular rank. We analogously define the submodular rank. We use submodular decompositions to optimize set functions. Given a bound on the submodular rank of a set function, we formulate an algorithm that splits an optimization problem into submodular subproblems. We show that this method improves the approximation ratio guarantees of several algorithms for monotone set function maximization and ratio of set functions minimization, at a computation overhead that depends on the submodular rank.

ML-123-标题: Taylor Learning

链接: https://arxiv.org/abs/2305.14606
作者: James Schmidt


Abstract: Empirical risk minimization stands behind most optimization in supervised machine learning. Under this scheme, labeled data is used to approximate an expected cost (risk), and a learning algorithm updates model-defining parameters in search of an empirical risk minimizer, with the aim of thereby approximately minimizing expected cost. Parameter update is often done by some sort of gradient descent. In this paper, we introduce a learning algorithm to construct models for real analytic functions using neither gradient descent nor empirical risk minimization. Observing that such functions are defined by local information, we situate familiar Taylor approximation methods in the context of sampling data from a distribution, and prove a nonuniform learning result.

ML-124-标题: Discriminative calibration

链接: https://arxiv.org/abs/2305.14593
作者: Yuling Yao, Justin Domke


Abstract: To check the accuracy of Bayesian computations, it is common to use rank-based simulation-based calibration (SBC). However, SBC has drawbacks: The test statistic is somewhat ad-hoc, interactions are difficult to examine, multiple testing is a challenge, and the resulting p-value is not a divergence metric. We propose to replace the marginal rank test with a flexible classification approach that learns test statistics from data. This measure typically has a higher statistical power than the SBC rank test and returns an interpretable divergence measure of miscalibration, computed from classification accuracy. This approach can be used with different data generating processes to address likelihood-free inference or traditional inference methods like Markov chain Monte Carlo or variational inference. We illustrate an automated implementation using neural networks and statistically-inspired features, and validate the method with numerical and real data experiments.

ML-125-标题: DF2M: An Explainable Deep Bayesian Nonparametric Model for High-Dimensional Functional Time Series

链接: https://arxiv.org/abs/2305.14543
作者: Yirui Liu, Xinghao Qiao, Yulong Pei, Liying Wang


Abstract: In this paper, we present Deep Functional Factor Model (DF2M), a Bayesian nonparametric model for analyzing high-dimensional functional time series. The DF2M makes use of the Indian Buffet Process and the multi-task Gaussian Process with a deep kernel function to capture non-Markovian and nonlinear temporal dynamics. Unlike many black-box deep learning models, the DF2M provides an explainable way to use neural networks by constructing a factor model and incorporating deep neural networks within the kernel function. Additionally, we develop a computationally efficient variational inference algorithm for inferring the DF2M. Empirical results from four real-world datasets demonstrate that the DF2M offers better explainability and superior predictive accuracy compared to conventional deep learning models for high-dimensional functional time series.

ML-126-标题: An Improved Variational Approximate Posterior for the Deep Wishart Process

链接: https://arxiv.org/abs/2305.14454
作者: Sebastian Ober, Ben Anson, Edward Milsom, Laurence Aitchison


Abstract: Deep kernel processes are a recently introduced class of deep Bayesian models that have the flexibility of neural networks, but work entirely with Gram matrices. They operate by alternately sampling a Gram matrix from a distribution over positive semi-definite matrices, and applying a deterministic transformation. When the distribution is chosen to be Wishart, the model is called a deep Wishart process (DWP). This particular model is of interest because its prior is equivalent to a deep Gaussian process (DGP) prior, but at the same time it is invariant to rotational symmetries, leading to a simpler posterior distribution. Practical inference in the DWP was made possible in recent work (“A variational approximate posterior for the deep Wishart process” Ober and Aitchison 2021a) where the authors used a generalisation of the Bartlett decomposition of the Wishart distribution as the variational approximate posterior. However, predictive performance in that paper was less impressive than one might expect, with the DWP only beating a DGP on a few of the UCI datasets used for comparison. In this paper, we show that further generalising their distribution to allow linear combinations of rows and columns in the Bartlett decomposition results in better predictive performance, while incurring negligible additional computation cost.

ML-127-标题: Optimal Preconditioning and Fisher Adaptive Langevin Sampling

链接: https://arxiv.org/abs/2305.14442
作者: Michalis K. Titsias
备注: 20 pages, 14 figures


Abstract: We define an optimal preconditioning for the Langevin diffusion by analytically maximizing the expected squared jumped distance. This yields as the optimal preconditioning an inverse Fisher information covariance matrix, where the covariance matrix is computed as the outer product of log target gradients averaged under the target. We apply this result to the Metropolis adjusted Langevin algorithm (MALA) and derive a computationally efficient adaptive MCMC scheme that learns the preconditioning from the history of gradients produced as the algorithm runs. We show in several experiments that the proposed algorithm is very robust in high dimensions and significantly outperforms other methods, including a closely related adaptive MALA scheme that learns the preconditioning with standard adaptive MCMC as well as the position-dependent Riemannian manifold MALA sampler.

ML-128-标题: Brain Structure-Function Fusing Representation Learning using Adversarial Decomposed-VAE for Analyzing MCI

链接: https://arxiv.org/abs/2305.14404
作者: Qiankun Zuo, Baiying Lei, Ning Zhong, Yi Pan, Shuqiang Wang


Abstract: Integrating the brain structural and functional connectivity features is of great significance in both exploring brain science and analyzing cognitive impairment clinically. However, it remains a challenge to effectively fuse structural and functional features in exploring the brain network. In this paper, a novel brain structure-function fusing-representation learning (BSFL) model is proposed to effectively learn fused representation from diffusion tensor imaging (DTI) and resting-state functional magnetic resonance imaging (fMRI) for mild cognitive impairment (MCI) analysis. Specifically, the decomposition-fusion framework is developed to first decompose the feature space into the union of the uniform and the unique spaces for each modality, and then adaptively fuse the decomposed features to learn MCI-related representation. Moreover, a knowledge-aware transformer module is designed to automatically capture local and global connectivity features throughout the brain. Also, a uniform-unique contrastive loss is further devised to make the decomposition more effective and enhance the complementarity of structural and functional features. The extensive experiments demonstrate that the proposed model achieves better performance than other competitive methods in predicting and analyzing MCI. More importantly, the proposed model could be a potential tool for reconstructing unified brain networks and predicting abnormal connections during the degenerative processes in MCI.

ML-129-标题: Predicting Stock Market Time-Series Data using CNN-LSTM Neural Network Model

链接: https://arxiv.org/abs/2305.14378
作者: Aadhitya A, Rajapriya R, Vineetha R S, Anurag M Bagde
备注: 8 pages, 9 figures, 5 tables


Abstract: Stock market is often important as it represents the ownership claims on businesses. Without sufficient stocks, a company cannot perform well in finance. Predicting a stock market performance of a company is nearly hard because every time the prices of a company stock keeps changing and not constant. So, its complex to determine the stock data. But if the previous performance of a company in stock market is known, then we can track the data and provide predictions to stockholders in order to wisely take decisions on handling the stocks to a company. To handle this, many machine learning models have been invented but they didn’t succeed due to many reasons like absence of advanced libraries, inaccuracy of model when made to train with real time data and much more. So, to track the patterns and the features of data, a CNN-LSTM Neural Network can be made. Recently, CNN is now used in Natural Language Processing (NLP) based applications, so by identifying the features from stock data and converting them into tensors, we can obtain the features and then send it to LSTM neural network to find the patterns and thereby predicting the stock market for given period of time. The accuracy of the CNN-LSTM NN model is found to be high even when allowed to train on real-time stock market data. This paper describes about the features of the custom CNN-LSTM model, experiments we made with the model (like training with stock market datasets, performance comparison with other models) and the end product we obtained at final stage.

ML-130-标题: PTGB: Pre-Train Graph Neural Networks for Brain Network Analysis

链接: https://arxiv.org/abs/2305.14376
作者: Yi Yang, Hejie Cui, Carl Yang
备注: Accepted to CHIL 2023, 19 pages


Abstract: The human brain is the central hub of the neurobiological system, controlling behavior and cognition in complex ways. Recent advances in neuroscience and neuroimaging analysis have shown a growing interest in the interactions between brain regions of interest (ROIs) and their impact on neural development and disorder diagnosis. As a powerful deep model for analyzing graph-structured data, Graph Neural Networks (GNNs) have been applied for brain network analysis. However, training deep models requires large amounts of labeled data, which is often scarce in brain network datasets due to the complexities of data acquisition and sharing restrictions. To make the most out of available training data, we propose PTGB, a GNN pre-training framework that captures intrinsic brain network structures, regardless of clinical outcomes, and is easily adaptable to various downstream tasks. PTGB comprises two key components: (1) an unsupervised pre-training technique designed specifically for brain networks, which enables learning from large-scale datasets without task-specific labels; (2) a data-driven parcellation atlas mapping pipeline that facilitates knowledge transfer across datasets with different ROI systems. Extensive evaluations using various GNN models have demonstrated the robust and superior performance of PTGB compared to baseline methods.

ML-131-标题: A Survey on the Role of Artificial Intelligence in the Prediction and Diagnosis of Schizophrenia

链接: https://arxiv.org/abs/2305.14370
作者: Narges Ramesh, Yasmin Ghodsi, Hamidreza Bolhasani


Abstract: Machine learning is employed in healthcare to draw approximate conclusions regarding human diseases and mental health problems. Compared to older traditional methods, it can help to analyze data more efficiently and produce better and more dependable results. Millions of people are affected by schizophrenia, which is a chronic mental disorder that can significantly impact their lives. Many machine learning algorithms have been developed to predict and prevent this disease, and they can potentially be implemented in the diagnosis of individuals who have it. This survey aims to review papers that have focused on the use of deep learning to detect and predict schizophrenia using EEG signals, functional magnetic resonance imaging (fMRI), and diffusion magnetic resonance imaging (dMRI). With our chosen search strategy, we assessed ten publications from 2019 to 2022. All studies achieved successful predictions of more than 80%. This review provides summaries of the studies and compares their notable aspects. In the field of artificial intelligence (AI) and machine learning (ML) for schizophrenia, significant advances have been made due to the availability of ML tools, and we are optimistic that this field will continue to grow.

ML-132-标题: Learning low-dimensional dynamics from whole-brain data improves task capture

链接: https://arxiv.org/abs/2305.14369
作者: Eloy Geenjaar, Donghyun Kim, Riyasat Ohib, Marlena Duda, Amrit Kashyap, Sergey Plis, Vince Calhoun
备注: 9 pages, 4 figures


Abstract: The neural dynamics underlying brain activity are critical to understanding cognitive processes and mental disorders. However, current voxel-based whole-brain dimensionality reduction techniques fall short of capturing these dynamics, producing latent timeseries that inadequately relate to behavioral tasks. To address this issue, we introduce a novel approach to learning low-dimensional approximations of neural dynamics by using a sequential variational autoencoder (SVAE) that represents the latent dynamical system via a neural ordinary differential equation (NODE). Importantly, our method finds smooth dynamics that can predict cognitive processes with accuracy higher than classical methods. Our method also shows improved spatial localization to task-relevant brain regions and identifies well-known structures such as the motor homunculus from fMRI motor task recordings. We also find that non-linear projections to the latent space enhance performance for specific tasks, offering a promising direction for future research. We evaluate our approach on various task-fMRI datasets, including motor, working memory, and relational processing tasks, and demonstrate that it outperforms widely used dimensionality reduction techniques in how well the latent timeseries relates to behavioral sub-tasks, such as left-hand or right-hand tapping. Additionally, we replace the NODE with a recurrent neural network (RNN) and compare the two approaches to understand the importance of explicitly learning a dynamical system. Lastly, we analyze the robustness of the learned dynamical systems themselves and find that their fixed points are robust across seeds, highlighting our method’s potential for the analysis of cognitive processes as dynamical systems.

ML-133-标题: Support for Stock Trend Prediction Using Transformer s and Sentiment Analysis

链接: https://arxiv.org/abs/2305.14368
作者: Harsimrat Kaeley, Ye Qiao, Nader Bagherzadeh
备注: 8 pages, 3 figures, 1 table. To be published in IISES 18th Economics & Finance Conference, London


Abstract: Stock trend analysis has been an influential time-series prediction topic due to its lucrative and inherently chaotic nature. Many models looking to accurately predict the trend of stocks have been based on Recurrent Neural Networks (RNNs). However, due to the limitations of RNNs, such as gradient vanish and long-term dependencies being lost as sequence length increases, in this paper we develop a Transformer based model that uses technical stock data and sentiment analysis to conduct accurate stock trend prediction over long time windows. This paper also introduces a novel dataset containing daily technical stock data and top news headline data spanning almost three years. Stock prediction based solely on technical data can suffer from lag caused by the inability of stock indicators to effectively factor in breaking market news. The use of sentiment analysis on top headlines can help account for unforeseen shifts in market conditions caused by news coverage. We measure the performance of our model against RNNs over sequence lengths spanning 5 business days to 30 business days to mimic different length trading strategies. This reveals an improvement in directional accuracy over RNNs as sequence length is increased, with the largest improvement being close to 18.63% at 30 business days.

ML-134-标题: Information processing via human soft tissue

链接: https://arxiv.org/abs/2305.14366
作者: Yo Kobayashi


Abstract: This study demonstrates that the soft biological tissues of humans can be used as a type of soft body in physical reservoir computing. Soft biological tissues possess characteristics such as stress-strain nonlinearity and viscoelasticity that satisfy the requirements for physical reservoir computing, including nonlinearity and memory. The aim of this study was to utilize the dynamics of human soft tissues as a physical reservoir for the emulation of nonlinear dynamical systems. To demonstrate this concept, joint angle data during motion in the flexion-extension direction of the wrist joint, and ultrasound images of the muscles associated with that motion, were acquired from human participants. The input to the system was the angle of the wrist joint, while the deformation field within the muscle (obtained from ultrasound images) represented the state of the reservoir. The results indicate that the dynamics of soft tissue have a positive impact on the computational task of emulating nonlinear dynamical systems. This research suggests that the soft tissue of humans can be used as a potential computational resource.

ML-135-标题: Criticality Analysis: Bio-inspired Nonlinear Data Representation

链接: https://arxiv.org/abs/2305.14361
作者: Tjeerd V. olde Scheper
备注: To be submitted


Abstract: The representation of arbitrary data in a biological system is one of the most elusive elements of biological information processing. The often logarithmic nature of information in amplitude and frequency presented to biosystems prevents simple encapsulation of the information contained in the input. Criticality Analysis (CA) is a bio-inspired method of information representation within a controlled self-organised critical system that allows scale-free representation. This is based on the concept of a reservoir of dynamic behaviour in which self-similar data will create dynamic nonlinear representations. This unique projection of data preserves the similarity of data within a multidimensional neighbourhood. The input can be reduced dimensionally to a projection output that retains the features of the overall data, yet has much simpler dynamic response. The method depends only on the rate control of chaos applied to the underlying controlled models, that allows the encoding of arbitrary data, and promises optimal encoding of data given biological relevant networks of oscillators. The CA method allows for a biologically relevant encoding mechanism of arbitrary input to biosystems, creating a suitable model for information processing in varying complexity of organisms and scale-free data representation for machine learning.

ML-136-标题: Learning in Inverse Optimization: Incenter Cost Augmented Suboptimality Loss and Algorithms

链接: https://arxiv.org/abs/2305.07730
作者: Pedro Zattoni Scroccaro, Bilge Atasoy, Peyman Mohajerin Esfahani


Abstract: In Inverse Optimization (IO), an expert agent solves an optimization problem parametric in an exogenous signal. From a learning perspective, the goal is to learn the expert’s cost function given a dataset of signals and corresponding optimal actions. Motivated by the geometry of the IO set of consistent cost vectors, we introduce the “incenter” concept, a new notion akin to circumcenter recently proposed by Besbes et al. [2022]. Discussing the geometric and robustness interpretation of the incenter cost vector, we develop corresponding tractable convex reformulations, which are in contrast with the circumcenter, which we show is equivalent to an intractable optimization program. We further propose a novel loss function called Augmented Suboptimality Loss (ASL), as a relaxation of the incenter concept, for problems with inconsistent data. Exploiting the structure of the ASL, we propose a novel first-order algorithm, which we name Stochastic Approximate Mirror Descent. This algorithm combines stochastic and approximate subgradient evaluations, together with mirror descent update steps, which is provably efficient for the IO problems with high cardinality discrete feasible sets. We implement the IO approaches developed in this paper as a Python package called InvOpt. All of our numerical experiments are reproducible, and the underlying source code is available as examples in the InvOpt package.


CV-0-标题: Balancing the Picture: Debiasing Vision-Language Dataset s with Synthetic Contrast Sets

链接: https://arxiv.org/abs/2305.15407
作者: Brandon Smith, Miguel Farinha, Siobhan Mackenzie Hall, Hannah Rose Kirk, Aleksandar Shtedritski, Max Bain
备注: Github: this https URL


Abstract: Vision-language models are growing in popularity and public visibility to generate, edit, and caption images at scale; but their outputs can perpetuate and amplify societal biases learned during pre-training on uncurated image-text pairs from the internet. Although debiasing methods have been proposed, we argue that these measurements of model bias lack validity due to dataset bias. We demonstrate there are spurious correlations in COCO Captions, the most commonly used dataset for evaluating bias, between background context and the gender of people in-situ. This is problematic because commonly-used bias metrics (such as Bias@K) rely on per-gender base rates. To address this issue, we propose a novel dataset debiasing pipeline to augment the COCO dataset with synthetic, gender-balanced contrast sets, where only the gender of the subject is edited and the background is fixed. However, existing image editing methods have limitations and sometimes produce low-quality images; so, we introduce a method to automatically filter the generated images based on their similarity to real images. Using our balanced synthetic contrast sets, we benchmark bias in multiple CLIP-based models, demonstrating how metrics are skewed by imbalance in the original COCO images. Our results indicate that the proposed approach improves the validity of the evaluation, ultimately contributing to more realistic understanding of bias in vision-language models.

CV-1-标题: RoMa: Revisiting Robust Losses for Dense Feature Matching

链接: https://arxiv.org/abs/2305.15404
作者: Johan Edstedt, Qiyu Sun, Georg Bökman, Mårten Wadenbäck, Michael Felsberg


Abstract: Dense feature matching is an important computer vision task that involves estimating all correspondences between two images of a 3D scene. In this paper, we revisit robust losses for matching from a Markov chain perspective, yielding theoretical insights and large gains in performance. We begin by constructing a unifying formulation of matching as a Markov chain, based on which we identify two key stages which we argue should be decoupled for matching. The first is the coarse stage, where the estimated result needs to be globally consistent. The second is the refinement stage, where the model needs precise localization capabilities. Inspired by the insight that these stages concern distinct issues, we propose a coarse matcher following the regression-by-classification paradigm that provides excellent globally consistent, albeit not exactly localized, matches. This is followed by a local feature refinement stage using well-motivated robust regression losses, yielding extremely precise matches. Our proposed approach, which we call RoMa, achieves significant improvements compared to the state-of-the-art. Code is available at this https URL

CV-2-标题: Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape

链接: https://arxiv.org/abs/2305.15399
作者: Rundi Wu, Ruoshi Liu, Carl Vondrick, Changxi Zheng
备注: Project page: this https URL, Code: this https URL


Abstract: Synthesizing novel 3D models that resemble the input example has long been pursued by researchers and artists in computer graphics. In this paper, we present Sin3DM, a diffusion model that learns the internal patch distribution from a single 3D textured shape and generates high-quality variations with fine geometry and texture details. Training a diffusion model directly in 3D would induce large memory and computational cost. Therefore, we first compress the input into a lower-dimensional latent space and then train a diffusion model on it. Specifically, we encode the input 3D textured shape into triplane feature maps that represent the signed distance and texture fields of the input. The denoising network of our diffusion model has a limited receptive field to avoid overfitting, and uses triplane-aware 2D convolution blocks to improve the result quality. Aside from randomly generating new samples, our model also facilitates applications such as retargeting, outpainting and local editing. Through extensive qualitative and quantitative evaluation, we show that our model can generate 3D shapes of various types with better quality than prior methods.

CV-3-标题: Layout GPT : Compositional Visual Planning and Generation with Large Language Models

链接: https://arxiv.org/abs/2305.15393
作者: Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, William Yang Wang


Abstract: Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the issue, we study how Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions, and thus collaborate with visual generative models. We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language to enhance the visual planning skills of LLMs. LayoutGPT can generate plausible layouts in multiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also shows superior performance in converting challenging language concepts like numerical and spatial relations to layout arrangements for faithful text-to-image generation. When combined with a downstream image generation model, LayoutGPT outperforms text-to-image models/systems by 20-40% and achieves comparable performance as human users in designing visual layouts for numerical and spatial correctness. Lastly, LayoutGPT achieves comparable performance to supervised methods in 3D indoor scene synthesis, demonstrating its effectiveness and potential in multiple visual domains.

CV-4-标题: A Neural Space-Time Representation for Text-to-Image Personalization

链接: https://arxiv.org/abs/2305.15391
作者: Yuval Alaluf, Elad Richardson, Gal Metzer, Daniel Cohen-Or
备注: Project page available at this https URL


Abstract: A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process. This choice greatly affects the visual fidelity, downstream editability, and disk space needed to store the learned concept. In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space) and showcase its compelling properties. A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly. Instead, we propose to implicitly represent a concept in this space by optimizing a small neural mapper that receives the current time and space parameters and outputs the matching token embedding. In doing so, the entire personalized concept is represented by the parameters of the learned mapper, resulting in a compact, yet expressive, representation. Similarly to other personalization methods, the output of our neural mapper resides in the input space of the text encoder. We observe that one can significantly improve the convergence and visual fidelity of the concept by introducing a textual bypass, where our neural mapper additionally outputs a residual that is added to the output of the text encoder. Finally, we show how one can impose an importance-based ordering over our implicit representation, providing users control over the reconstruction and editability of the learned concept using a single trained model. We demonstrate the effectiveness of our approach over a range of concepts and prompts, showing our method’s ability to generate high-quality and controllable compositions without fine-tuning any parameters of the generative model itself.

CV-5-标题: What can generic neural networks learn from a childs visual experience?

链接: https://arxiv.org/abs/2305.15372
作者: A. Emin Orhan, Brenden M. Lake
备注: 26 pages, 14 figures, 3 tables; code & all pretrained models available from this https URL


Abstract: Young children develop sophisticated internal models of the world based on their egocentric visual experience. How much of this is driven by innate constraints and how much is driven by their experience? To investigate these questions, we train state-of-the-art neural networks on a realistic proxy of a child’s visual experience without any explicit supervision or domain-specific inductive biases. Specifically, we train both embedding models and generative models on 200 hours of headcam video from a single child collected over two years. We train a total of 72 different models, exploring a range of model architectures and self-supervised learning algorithms, and comprehensively evaluate their performance in downstream tasks. The best embedding models perform at 70% of a highly performant ImageNet-trained model on average. They also learn broad semantic categories without any labeled examples and learn to localize semantic categories in an image without any location supervision. However, these models are less object-centric and more background-sensitive than comparable ImageNet-trained models. Generative models trained with the same data successfully extrapolate simple properties of partially masked objects, such as their texture, color, orientation, and rough outline, but struggle with finer object details. We replicate our experiments with two other children and find very similar results. Broadly useful high-level visual representations are thus robustly learnable from a representative sample of a child’s visual experience without strong inductive biases.

CV-6-标题: SAMScore: A Semantic Structural Similarity Metric for Image Translation Evaluation

链接: https://arxiv.org/abs/2305.15367
作者: Yunxiang Li, Meixu Chen, Wenxuan Yang, Kai Wang, Jun Ma, Alan C. Bovik, You Zhang


Abstract: Image translation has wide applications, such as style transfer and modality conversion, usually aiming to generate images having both high degrees of realism and faithfulness. These problems remain difficult, especially when it is important to preserve semantic structures. Traditional image-level similarity metrics are of limited use, since the semantics of an image are high-level, and not strongly governed by pixel-wise faithfulness to an original image. Towards filling this gap, we introduce SAMScore, a generic semantic structural similarity metric for evaluating the faithfulness of image translation models. SAMScore is based on the recent high-performance Segment Anything Model (SAM), which can perform semantic similarity comparisons with standout accuracy. We applied SAMScore on 19 image translation tasks, and found that it is able to outperform all other competitive metrics on all of the tasks. We envision that SAMScore will prove to be a valuable tool that will help to drive the vibrant field of image translation, by allowing for more precise evaluations of new and evolving translation models. The code is available at this https URL.

CV-7-标题: Boundary Attention Mapping (BAM): Fine-grained saliency maps for segmentation of Burn Injuries

链接: https://arxiv.org/abs/2305.15365
作者: Mahla Abdolahnejad, Justin Lee, Hannah Chan, Alex Morzycki, Olivier Ethier, Anthea Mo, Peter X. Liu, Joshua N. Wong, Colin Hong, Rakesh Joshi


Abstract: Burn injuries can result from mechanisms such as thermal, chemical, and electrical insults. A prompt and accurate assessment of burns is essential for deciding definitive clinical treatments. Currently, the primary approach for burn assessments, via visual and tactile observations, is approximately 60%-80% accurate. The gold standard is biopsy and a close second would be non-invasive methods like Laser Doppler Imaging (LDI) assessments, which have up to 97% accuracy in predicting burn severity and the required healing time. In this paper, we introduce a machine learning pipeline for assessing burn severities and segmenting the regions of skin that are affected by burn. Segmenting 2D colour images of burns allows for the injured versus non-injured skin to be delineated, clearly marking the extent and boundaries of the localized burn/region-of-interest, even during remote monitoring of a burn patient. We trained a convolutional neural network (CNN) to classify four severities of burns. We built a saliency mapping method, Boundary Attention Mapping (BAM), that utilises this trained CNN for the purpose of accurately localizing and segmenting the burn regions from skin burn images. We demonstrated the effectiveness of our proposed pipeline through extensive experiments and evaluations using two datasets; 1) A larger skin burn image dataset consisting of 1684 skin burn images of four burn severities, 2) An LDI dataset that consists of a total of 184 skin burn images with their associated LDI scans. The CNN trained using the first dataset achieved an average F1-Score of 78% and micro/macro- average ROC of 85% in classifying the four burn severities. Moreover, a comparison between the BAM results and LDI results for measuring injury boundary showed that the segmentations generated by our method achieved 91.60% accuracy, 78.17% sensitivity, and 93.37% specificity.

CV-8-标题: Mitigating Biased Activation in Weakly-supervised Object Localization via Counterfactual Learning

链接: https://arxiv.org/abs/2305.15354
作者: Feifei Shao, Yawei Luo, Lei Chen, Ping Liu, Yi Yang, Jun Xiao
备注: 13 pages, 5 figures, 4 tables


Abstract: In this paper, we focus on an under-explored issue of biased activation in prior weakly-supervised object localization methods based on Class Activation Mapping (CAM). We analyze the cause of this problem from a causal view and attribute it to the co-occurring background confounders. Following this insight, we propose a novel Counterfactual Co-occurring Learning (CCL) paradigm to synthesize the counterfactual representations via coupling constant foreground and unrealized backgrounds in order to cut off their co-occurring relationship. Specifically, we design a new network structure called Counterfactual-CAM, which embeds the counterfactual representation perturbation mechanism into the vanilla CAM-based model. This mechanism is responsible for decoupling foreground as well as background and synthesizing the counterfactual representations. By training the detection model with these synthesized representations, we compel the model to focus on the constant foreground content while minimizing the influence of distracting co-occurring background. To our best knowledge, it is the first attempt in this direction. Extensive experiments on several benchmarks demonstrate that Counterfactual-CAM successfully mitigates the biased activation problem, achieving improved object localization accuracy.

CV-9-标题: A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

链接: https://arxiv.org/abs/2305.15347
作者: Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, Ming-Hsuan Yang
备注: Project page: this https URL


Abstract: Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.

CV-10-标题: Training on Thin Air: Improve Image Classification with Generated Data

链接: https://arxiv.org/abs/2305.15316
作者: Yongchao Zhou, Hshmat Sahak, Jimmy Ba


Abstract: Acquiring high-quality data for training discriminative models is a crucial yet challenging aspect of building effective predictive systems. In this paper, we present Diffusion Inversion, a simple yet effective method that leverages the pre-trained generative model, Stable Diffusion, to generate diverse, high-quality training data for image classification. Our approach captures the original data distribution and ensures data coverage by inverting images to the latent space of Stable Diffusion, and generates diverse novel training images by conditioning the generative model on noisy versions of these vectors. We identify three key components that allow our generated images to successfully supplant the original dataset, leading to a 2-3x enhancement in sample complexity and a 6.5x decrease in sampling time. Moreover, our approach consistently outperforms generic prompt-based steering methods and KNN retrieval baseline across a wide range of datasets. Additionally, we demonstrate the compatibility of our approach with widely-used data augmentation techniques, as well as the reliability of the generated data in supporting various neural architectures and enhancing few-shot learning.

CV-11-标题: Personalized Dictionary Learning for Heterogeneous Dataset s

链接: https://arxiv.org/abs/2305.15311
作者: Geyu Liang, Naichen Shi, Raed Al Kontar, Salar Fattahi


Abstract: We introduce a relevant yet challenging problem named Personalized Dictionary Learning (PerDL), where the goal is to learn sparse linear representations from heterogeneous datasets that share some commonality. In PerDL, we model each dataset’s shared and unique features as global and local dictionaries. Challenges for PerDL not only are inherited from classical dictionary learning (DL), but also arise due to the unknown nature of the shared and unique features. In this paper, we rigorously formulate this problem and provide conditions under which the global and local dictionaries can be provably disentangled. Under these conditions, we provide a meta-algorithm called Personalized Matching and Averaging (PerMA) that can recover both global and local dictionaries from heterogeneous datasets. PerMA is highly efficient; it converges to the ground truth at a linear rate under suitable conditions. Moreover, it automatically borrows strength from strong learners to improve the prediction of weak learners. As a general framework for extracting global and local dictionaries, we show the application of PerDL in different learning tasks, such as training with imbalanced datasets and video surveillance.

CV-12-标题: Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation

链接: https://arxiv.org/abs/2305.15302
作者: Chang Liu, Henghui Ding, Yulun Zhang, Xudong Jiang


Abstract: We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. Many recent works utilize Transformer to extract features for the target object by aggregating the attended visual regions. However, the generic attention mechanism in Transformer only uses the language input for attention weight calculation, which does not explicitly fuse language features in its output. Thus, its output feature is dominated by vision information, which limits the model to comprehensively understand the multi-modal information, and brings uncertainty for the subsequent mask decoder to extract the output mask. To address this issue, we propose Multi-Modal Mutual Attention ( \mathrmM^3Att ) and Multi-Modal Mutual Decoder ( \mathrmM^3Dec ) that better fuse information from the two input modalities. Based on \mathrmM^3Dec , we further propose Iterative Multi-modal Interaction ( \mathrmIMI ) to allow continuous and in-depth interactions between language and vision features. Furthermore, we introduce Language Feature Reconstruction ( \mathrmLFR ) to prevent the language information from being lost or distorted in the extracted feature. Extensive experiments show that our proposed approach significantly improves the baseline and outperforms state-of-the-art referring image segmentation methods on RefCOCO series datasets consistently.

CV-13-标题: MultiFusion: Fusing Pre-Trained Models for Multi-Lingual Multi-Modal Image Generation

链接: https://arxiv.org/abs/2305.15296
作者: Marco Bellagente, Manuel Brack, Hannah Teufel, Felix Friedrich, Björn Deiseroth, Constantin Eichenberg, Andrew Dai, Robert Baldock, Souradeep Nanda, Koen Oostermeijer, Andres Felipe Cruz-Salinas, Patrick Schramowski, Kristian Kersting, Samuel Weinbach


Abstract: The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.

CV-14-标题: High Speed Human Action Recognition using a Photonic Reservoir Computer

链接: https://arxiv.org/abs/2305.15283
作者: Enrico Picco, Piotr Antonik, Serge Massar


Abstract: The recognition of human actions in videos is one of the most active research fields in computer vision. The canonical approach consists in a more or less complex preprocessing stages of the raw video data, followed by a relatively simple classification algorithm. Here we address recognition of human actions using the reservoir computing algorithm, which allows us to focus on the classifier stage. We introduce a new training method for the reservoir computer, based on “Timesteps Of Interest”, which combines in a simple way short and long time scales. We study the performance of this algorithm using both numerical simulations and a photonic implementation based on a single non-linear node and a delay line on the well known KTH dataset. We solve the task with high accuracy and speed, to the point of allowing for processing multiple video streams in real time. The present work is thus an important step towards developing efficient dedicated hardware for video processing.

CV-15-标题: ViTMatte: Boosting Image Matting with Pretrain ed Plain Vision Transformer s

链接: https://arxiv.org/abs/2305.15272
作者: Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang


Abstract: Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.

CV-16-标题: Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation

链接: https://arxiv.org/abs/2305.15270
作者: Tong Xu, Micol Spitale, Hao Tang, Lu Liu, Hatice Gunes, Siyang Song


Abstract: Generating facial reactions in a human-human dyadic interaction is complex and highly dependent on the context since more than one facial reactions can be appropriate for the speaker’s behaviour. This has challenged existing machine learning (ML) methods, whose training strategies enforce models to reproduce a specific (not multiple) facial reaction from each input speaker behaviour. This paper proposes the first multiple appropriate facial reaction generation framework that re-formulates the one-to-many mapping facial reaction generation problem as a one-to-one mapping problem. This means that we approach this problem by considering the generation of a distribution of the listener’s appropriate facial reactions instead of multiple different appropriate facial reactions, i.e., ‘many’ appropriate facial reaction labels are summarised as ‘one’ distribution label during training. Our model consists of a perceptual processor, a cognitive processor, and a motor processor. The motor processor is implemented with a novel Reversible Multi-dimensional Edge Graph Neural Network (REGNN). This allows us to obtain a distribution of appropriate real facial reactions during the training process, enabling the cognitive processor to be trained to predict the appropriate facial reaction distribution. At the inference stage, the REGNN decodes an appropriate facial reaction by using this distribution as input. Experimental results demonstrate that our approach outperforms existing models in generating more appropriate, realistic, and synchronized facial reactions. The improved performance is largely attributed to the proposed appropriate facial reaction distribution learning strategy and the use of a REGNN. The code is available at this https URL.

CV-17-标题: Rethinking the Evaluation Protocol of Domain Generalization

链接: https://arxiv.org/abs/2305.15253
作者: Han Yu, Xingxuan Zhang, Renzhe Xu, Jiashuo Liu, Yue He, Peng Cui


Abstract: Domain generalization aims to solve the challenge of Out-of-Distribution (OOD) generalization by leveraging common knowledge learned from multiple training domains to generalize to unseen test domains. To accurately evaluate the OOD generalization ability, it is necessary to ensure that test data information is unavailable. However, the current domain generalization protocol may still have potential test data information leakage. This paper examines the potential risks of test data information leakage in two aspects of the current protocol: pretraining on ImageNet and oracle model selection. We propose that training from scratch and using multiple test domains would result in a more precise evaluation of OOD generalization ability. We also rerun the algorithms with the modified protocol and introduce a new leaderboard to encourage future research in domain generalization with a fairer comparison.

CV-18-标题: Delving Deeper into Data Scaling in Masked Image Modeling

链接: https://arxiv.org/abs/2305.15248
作者: Cheng-Ze Lu, Xiaojie Jin, Qibin Hou, Jun Hao Liew, Ming-Ming Cheng, Jiashi Feng


Abstract: Understanding whether self-supervised learning methods can scale with unlimited data is crucial for training large-scale models. In this work, we conduct an empirical study on the scaling capability of masked image modeling (MIM) methods (e.g., MAE) for visual recognition. Unlike most previous works that depend on the widely-used ImageNet dataset, which is manually curated and object-centric, we take a step further and propose to investigate this problem in a more practical setting. Specifically, we utilize the web-collected Coyo-700M dataset. We randomly sample varying numbers of training images from the Coyo dataset and construct a series of sub-datasets, containing 0.5M, 1M, 5M, 10M, and 100M images, for pre-training. Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models. The study reveals that: 1) MIM can be viewed as an effective method to improve the model capacity when the scale of the training data is relatively small; 2) Strong reconstruction targets can endow the models with increased capacities on downstream tasks; 3) MIM pre-training is data-agnostic under most scenarios, which means that the strategy of sampling pre-training data is non-critical. We hope these observations could provide valuable insights for future research on MIM.

CV-19-标题: Robust Classification via a Single Diffusion Model

链接: https://arxiv.org/abs/2305.15241
作者: Huanran Chen, Yinpeng Dong, Zhengyi Wang, Xiao Yang, Chengqi Duan, Hang Su, Jun Zhu


Abstract: Recently, diffusion models have been successfully applied to improving adversarial robustness of image classifiers by purifying the adversarial noises or generating realistic data for adversarial training. However, the diffusion-based purification can be evaded by stronger adaptive attacks while adversarial training does not perform well under unseen threats, exhibiting inevitable limitations of these methods. To better harness the expressive power of diffusion models, in this paper we propose Robust Diffusion Classifier (RDC), a generative classifier that is constructed from a pre-trained diffusion model to be adversarially robust. Our method first maximizes the data likelihood of a given input and then predicts the class probabilities of the optimized input using the conditional likelihood of the diffusion model through Bayes’ theorem. Since our method does not require training on particular adversarial attacks, we demonstrate that it is more generalizable to defend against multiple unseen threats. In particular, RDC achieves 73.24% robust accuracy against \ell_\infty norm-bounded perturbations with \epsilon_\infty=8/255 on CIFAR-10, surpassing the previous state-of-the-art adversarial training models by +2.34% . The findings highlight the potential of generative classifiers by employing diffusion models for adversarial robustness compared with the commonly studied discriminative classifiers.

CV-20-标题: Real time dense anomaly detection by learning on synthetic negative data

链接: https://arxiv.org/abs/2305.15227
作者: Anja Delić, Matej Grcić, Siniša Šegvić
备注: 3 pages


Abstract: Most approaches to dense anomaly detection rely on generative modeling or on discriminative methods that train with negative data. We consider a recent hybrid method that optimizes the same shared representation according to cross-entropy of the discriminative predictions, and negative log likelihood of the predicted energy-based density. We extend that work with a jointly trained generative flow that samples synthetic negatives at the border of the inlier distribution. The proposed extension provides potential to learn the hybrid method without real negative data. Our experiments analyze the impact of training with synthetic negative data and validate contribution of the energy-based density during training and evaluation.

CV-21-标题: DynStatF: An Efficient Feature Fusion Strategy for LiDAR 3D Object Detection CVPR2023

链接: https://arxiv.org/abs/2305.15219
作者: Yao Rong, Xiangyu Wei, Tianwei Lin, Yueyu Wang, Enkelejda Kasneci
备注: Accepted to CVPR2023 Workshop on End-to-End Autonomous Driving


Abstract: Augmenting LiDAR input with multiple previous frames provides richer semantic information and thus boosts performance in 3D object detection, However, crowded point clouds in multi-frames can hurt the precise position information due to the motion blur and inaccurate point projection. In this work, we propose a novel feature fusion strategy, DynStaF (Dynamic-Static Fusion), which enhances the rich semantic information provided by the multi-frame (dynamic branch) with the accurate location information from the current single-frame (static branch). To effectively extract and aggregate complimentary features, DynStaF contains two modules, Neighborhood Cross Attention (NCA) and Dynamic-Static Interaction (DSI), operating through a dual pathway architecture. NCA takes the features in the static branch as queries and the features in the dynamic branch as keys (values). When computing the attention, we address the sparsity of point clouds and take only neighborhood positions into consideration. NCA fuses two features at different feature map scales, followed by DSI providing the comprehensive interaction. To analyze our proposed strategy DynStaF, we conduct extensive experiments on the nuScenes dataset. On the test set, DynStaF increases the performance of PointPillars in NDS by a large margin from 57.7% to 61.6%. When combined with CenterPoint, our framework achieves 61.0% mAP and 67.7% NDS, leading to state-of-the-art performance without bells and whistles.

CV-22-标题: Multi-modal Machine Learning for Vehicle Rating Predictions Using Image Text and Parametric Data

链接: https://arxiv.org/abs/2305.15218
作者: Hanqi Su, Binyang Song, Faez Ahmed
备注: The paper submitted to IDETC/CIE2023, the International Design Engineering Technical Conferences & Computers and Information in Engineering Conference, has been accepted


Abstract: Accurate vehicle rating prediction can facilitate designing and configuring good vehicles. This prediction allows vehicle designers and manufacturers to optimize and improve their designs in a timely manner, enhance their product performance, and effectively attract consumers. However, most of the existing data-driven methods rely on data from a single mode, e.g., text, image, or parametric data, which results in a limited and incomplete exploration of the available information. These methods lack comprehensive analyses and exploration of data from multiple modes, which probably leads to inaccurate conclusions and hinders progress in this field. To overcome this limitation, we propose a multi-modal learning model for more comprehensive and accurate vehicle rating predictions. Specifically, the model simultaneously learns features from the parametric specifications, text descriptions, and images of vehicles to predict five vehicle rating scores, including the total score, critics score, performance score, safety score, and interior score. We compare the multi-modal learning model to the corresponding unimodal models and find that the multi-modal model’s explanatory power is 4% - 12% higher than that of the unimodal models. On this basis, we conduct sensitivity analyses using SHAP to interpret our model and provide design and optimization directions to designers and manufacturers. Our study underscores the importance of the data-driven multi-modal learning approach for vehicle design, evaluation, and optimization. We have made the code publicly available at this http URL.

CV-23-标题: L-CAD: Language-based Colorization with Any-level Descriptions

链接: https://arxiv.org/abs/2305.15217
作者: Zheng Chang, Shuchen Weng, Peixuan Zhang, Yu Li, Si Li, Boxin Shi


Abstract: Language-based colorization produces plausible and visually pleasing colors under the guidance of user-friendly natural language descriptions. Previous methods implicitly assume that users provide comprehensive color descriptions for most of the objects in the image, which leads to suboptimal performance. In this paper, we propose a unified model to perform language-based colorization with any-level descriptions. We leverage the pretrained cross-modality generative model for its robust language understanding and rich color priors to handle the inherent ambiguity of any-level descriptions. We further design modules to align with input conditions to preserve local spatial structures and prevent the ghosting effect. With the proposed novel sampling strategy, our model achieves instance-aware colorization in diverse and complex scenarios. Extensive experimental results demonstrate our advantages of effectively handling any-level descriptions and outperforming both language-based and automatic colorization methods. The code and pretrained models are available at: this https URL.

CV-24-标题: GTNet: Graph Transformer Network for 3D Point Cloud Classification and Semantic Segmentation

链接: https://arxiv.org/abs/2305.15213
作者: Wei Zhou, Qian Wang, Weiwei Jin, Xinzhe Shi, Dekui Wang, Xingxing Hao, Yongxiang Yu


Abstract: Recently, graph-based and Transformer-based deep learning networks have demonstrated excellent performances on various point cloud tasks. Most of the existing graph methods are based on static graph, which take a fixed input to establish graph relations. Moreover, many graph methods apply maximization and averaging to aggregate neighboring features, so that only a single neighboring point affects the feature of centroid or different neighboring points have the same influence on the centroid’s feature, which ignoring the correlation and difference between points. Most Transformer-based methods extract point cloud features based on global attention and lack the feature learning on local neighbors. To solve the problems of these two types of models, we propose a new feature extraction block named Graph Transformer and construct a 3D point point cloud learning network called GTNet to learn features of point clouds on local and global patterns. Graph Transformer integrates the advantages of graph-based and Transformer-based methods, and consists of Local Transformer and Global Transformer modules. Local Transformer uses a dynamic graph to calculate all neighboring point weights by intra-domain cross-attention with dynamically updated graph relations, so that every neighboring point could affect the features of centroid with different weights; Global Transformer enlarges the receptive field of Local Transformer by a global self-attention. In addition, to avoid the disappearance of the gradient caused by the increasing depth of network, we conduct residual connection for centroid features in GTNet; we also adopt the features of centroid and neighbors to generate the local geometric descriptors in Local Transformer to strengthen the local information learning capability of the model. Finally, we use GTNet for shape classification, part segmentation and semantic segmentation tasks in this paper.

CV-25-标题: Promoting Generalization in Cross- Dataset Remote Photoplethysmography

链接: https://arxiv.org/abs/2305.15199
作者: Nathan Vance, Jeremy Speth, Benjamin Sporrer, Patrick Flynn
备注: 8 pages, accepted for publication at CVPM 2023


Abstract: Remote Photoplethysmography (rPPG), or the remote monitoring of a subject’s heart rate using a camera, has seen a shift from handcrafted techniques to deep learning models. While current solutions offer substantial performance gains, we show that these models tend to learn a bias to pulse wave features inherent to the training dataset. We develop augmentations to mitigate this learned bias by expanding both the range and variability of heart rates that the model sees while training, resulting in improved model convergence when training and cross-dataset generalization at test time. Through a 3-way cross dataset analysis we demonstrate a reduction in mean absolute error from over 13 beats per minute to below 3 beats per minute. We compare our method with other recent rPPG systems, finding similar performance under a variety of evaluation parameters.

CV-26-标题: DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2305.15194
作者: Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn
备注: 18 pages, 16 figures, and 3 tables


Abstract: The recent progress in diffusion-based text-to-image generation models has significantly expanded generative capabilities via conditioning the text descriptions. However, since relying solely on text prompts is still restrictive for fine-grained customization, we aim to extend the boundaries of conditional generation to incorporate diverse types of modalities, e.g., sketch, box, and style embedding, simultaneously. We thus design a multimodal text-to-image diffusion model, coined as DiffBlender, that achieves the aforementioned goal in a single model by training only a few small hypernetworks. DiffBlender facilitates a convenient scaling of input modalities, without altering the parameters of an existing large-scale generative model to retain its well-established knowledge. Furthermore, our study sets new standards for multimodal generation by conducting quantitative and qualitative comparisons with existing approaches. By diversifying the channels of conditioning modalities, DiffBlender faithfully reflects the provided information or, in its absence, creates imaginative generation.

CV-27-标题: Deceptive-NeRF: Enhancing NeRF Reconstruction using Pseudo-Observations from Diffusion Models

链接: https://arxiv.org/abs/2305.15171
作者: Xinhang Liu, Shiu-hong Kao, Jiaben Chen, Yu-Wing Tai, Chi-Keung Tang


Abstract: This paper introduces Deceptive-NeRF, a new method for enhancing the quality of reconstructed NeRF models using synthetically generated pseudo-observations, capable of handling sparse input and removing floater artifacts. Our proposed method involves three key steps: 1) reconstruct a coarse NeRF model from sparse inputs; 2) generate pseudo-observations based on the coarse model; 3) refine the NeRF model using pseudo-observations to produce a high-quality reconstruction. To generate photo-realistic pseudo-observations that faithfully preserve the identity of the reconstructed scene while remaining consistent with the sparse inputs, we develop a rectification latent diffusion model that generates images conditional on a coarse RGB image and depth map, which are derived from the coarse NeRF and latent text embedding from input images. Extensive experiments show that our method is effective and can generate perceptually high-quality NeRF even with very sparse inputs.

CV-28-标题: Clinically Labeled Contrastive Learning for OCT Biomarker Classification ALT

链接: https://arxiv.org/abs/2305.15154
作者: Kiran Kokilepersaud, Stephanie Trejo Corona, Mohit Prabhushankar, Ghassan AlRegib, Charles Wykoff
备注: Accepted in IEEE Journal of Biomedical and Health Informatics. arXiv admin note: text overlap with arXiv:2211.05092


Abstract: This paper presents a novel positive and negative set selection strategy for contrastive learning of medical images based on labels that can be extracted from clinical data. In the medical field, there exists a variety of labels for data that serve different purposes at different stages of a diagnostic and treatment process. Clinical labels and biomarker labels are two examples. In general, clinical labels are easier to obtain in larger quantities because they are regularly collected during routine clinical care, while biomarker labels require expert analysis and interpretation to obtain. Within the field of ophthalmology, previous work has shown that clinical values exhibit correlations with biomarker structures that manifest within optical coherence tomography (OCT) scans. We exploit this relationship by using the clinical data as pseudo-labels for our data without biomarker labels in order to choose positive and negative instances for training a backbone network with a supervised contrastive loss. In this way, a backbone network learns a representation space that aligns with the clinical data distribution available. Afterwards, we fine-tune the network trained in this manner with the smaller amount of biomarker labeled data with a cross-entropy loss in order to classify these key indicators of disease directly from OCT scans. We also expand on this concept by proposing a method that uses a linear combination of clinical contrastive losses. We benchmark our methods against state of the art self-supervised methods in a novel setting with biomarkers of varying granularity. We show performance improvements by as much as 5% in total biomarker detection AUROC.

CV-29-标题: Reliability Scores from Saliency Map Clusters for Improved Image-based Harvest-Readiness Prediction in Cauliflower

链接: https://arxiv.org/abs/2305.15149
作者: Jana Kierdorf, Ribana Roscher
备注: Preprint, 8 pages, 6 figures


Abstract: Cauliflower is a hand-harvested crop that must fulfill high-quality standards in sales making the timing of harvest important. However, accurately determining harvest-readiness can be challenging due to the cauliflower head being covered by its canopy. While deep learning enables automated harvest-readiness estimation, errors can occur due to field-variability and limited training data. In this paper, we analyze the reliability of a harvest-readiness classifier with interpretable machine learning. By identifying clusters of saliency maps, we derive reliability scores for each classification result using knowledge about the domain and the image properties. For unseen data, the reliability can be used to (i) inform farmers to improve their decision-making and (ii) increase the model prediction accuracy. Using RGB images of single cauliflower plants at different developmental stages from the GrowliFlower dataset, we investigate various saliency mapping approaches and find that they result in different quality of reliability scores. With the most suitable interpretation tool, we adjust the classification result and achieve a 15.72% improvement of the overall accuracy to 88.14% and a 15.44% improvement of the average class accuracy to 88.52% for the GrowliFlower dataset.

CV-30-标题: Networks are Slacking Off: Understanding Generalization Problem in Image Deraining

链接: https://arxiv.org/abs/2305.15134
作者: Jinjin Gu, Xianzheng Ma, Xiangtao Kong, Yu Qiao, Chao Dong


Abstract: Deep deraining networks, while successful in laboratory benchmarks, consistently encounter substantial generalization issues when deployed in real-world applications. A prevailing perspective in deep learning encourages the use of highly complex training data, with the expectation that a richer image content knowledge will facilitate overcoming the generalization problem. However, through comprehensive and systematic experimentation, we discovered that this strategy does not enhance the generalization capability of these networks. On the contrary, it exacerbates the tendency of networks to overfit to specific degradations. Our experiments reveal that better generalization in a deraining network can be achieved by simplifying the complexity of the training data. This is due to the networks are slacking off during training, that is, learning the least complex elements in the image content and degradation to minimize training loss. When the complexity of the background image is less than that of the rain streaks, the network will prioritize the reconstruction of the background, thereby avoiding overfitting to the rain patterns and resulting in improved generalization performance. Our research not only offers a valuable perspective and methodology for better understanding the generalization problem in low-level vision tasks, but also displays promising practical potential.

CV-31-标题: Thinking Twice: Clinical-Inspired Thyroid Ultrasound Lesion Detection Based on Feature Feedback

链接: https://arxiv.org/abs/2305.15114
作者: Lingtao Wang, Jianrui Ding, Fenghe Tang, Chunping Ning
备注: 20 pages, 11 figures, released code for this https URL


Abstract: Accurate detection of thyroid lesions is a critical aspect of computer-aided diagnosis. However, most existing detection methods perform only one feature extraction process and then fuse multi-scale features, which can be affected by noise and blurred features in ultrasound images. In this study, we propose a novel detection network based on a feature feedback mechanism inspired by clinical diagnosis. The mechanism involves first roughly observing the overall picture and then focusing on the details of interest. It comprises two parts: a feedback feature selection module and a feature feedback pyramid. The feedback feature selection module efficiently selects the features extracted in the first phase in both space and channel dimensions to generate high semantic prior knowledge, which is similar to coarse observation. The feature feedback pyramid then uses this high semantic prior knowledge to enhance feature extraction in the second phase and adaptively fuses the two features, similar to fine observation. Additionally, since radiologists often focus on the shape and size of lesions for diagnosis, we propose an adaptive detection head strategy to aggregate multi-scale features. Our proposed method achieves an AP of 70.3% and AP50 of 99.0% on the thyroid ultrasound dataset and meets the real-time requirement. The code is available at this https URL.

CV-32-标题: Computer Vision for Construction Progress Monitoring: A Real-Time Object Detection Approach

链接: https://arxiv.org/abs/2305.15097
作者: Jiesheng Yang, Andreas Wilde, Karsten Menzel, Md Zubair Sheikh, Boris Kuznetsov
备注: 15 Pages


Abstract: Construction progress monitoring (CPM) is essential for effective project management, ensuring on-time and on-budget delivery. Traditional CPM methods often rely on manual inspection and reporting, which are time-consuming and prone to errors. This paper proposes a novel approach for automated CPM using state-of-the-art object detection algorithms. The proposed method leverages e.g. YOLOv8’s real-time capabilities and high accuracy to identify and track construction elements within site images and videos. A dataset was created, consisting of various building elements and annotated with relevant objects for training and validation. The performance of the proposed approach was evaluated using standard metrics, such as precision, recall, and F1-score, demonstrating significant improvement over existing methods. The integration of Computer Vision into CPM provides stakeholders with reliable, efficient, and cost-effective means to monitor project progress, facilitating timely decision-making and ultimately contributing to the successful completion of construction projects.

CV-33-标题: InpaintNeRF360: Text-Guided 3D Inpainting on Unbounded Neural Radiance Fields

链接: https://arxiv.org/abs/2305.15094
作者: Dongqing Wang, Tong Zhang, Alaa Abboud, Sabine Süsstrunk


Abstract: Neural Radiance Fields (NeRF) can generate highly realistic novel views. However, editing 3D scenes represented by NeRF across 360-degree views, particularly removing objects while preserving geometric and photometric consistency, remains a challenging problem due to NeRF’s implicit scene representation. In this paper, we propose InpaintNeRF360, a unified framework that utilizes natural language instructions as guidance for inpainting NeRF-based 3D scenes.Our approach employs a promptable segmentation model by generating multi-modal prompts from the encoded text for multiview segmentation. We apply depth-space warping to enforce viewing consistency in the segmentations, and further refine the inpainted NeRF model using perceptual priors to ensure visual plausibility. InpaintNeRF360 is capable of simultaneously removing multiple objects or modifying object appearance based on text instructions while synthesizing 3D viewing-consistent and photo-realistic inpainting. Through extensive experiments on both unbounded and frontal-facing scenes trained through NeRF, we demonstrate the effectiveness of our approach and showcase its potential to enhance the editability of implicit radiance fields.

CV-34-标题: Modeling Complex Object Changes in Satellite Image Time-Series: Approach based on CSP and Spatiotemporal Graph

链接: https://arxiv.org/abs/2305.15091
作者: Zouhayra Ayadi, Wadii Boulila, Imed Riadh Farah


Abstract: This paper proposes a method for automatically monitoring and analyzing the evolution of complex geographic objects. The objects are modeled as a spatiotemporal graph, which separates filiation relations, spatial relations, and spatiotemporal relations, and is analyzed by detecting frequent sub-graphs using constraint satisfaction problems (CSP). The process is divided into four steps: first, the identification of complex objects in each satellite image; second, the construction of a spatiotemporal graph to model the spatiotemporal changes of the complex objects; third, the creation of sub-graphs to be detected in the base spatiotemporal graph; and fourth, the analysis of the spatiotemporal graph by detecting the sub-graphs and solving a constraint network to determine relevant sub-graphs. The final step is further broken down into two sub-steps: (i) the modeling of the constraint network with defined variables and constraints, and (ii) the solving of the constraint network to find relevant sub-graphs in the spatiotemporal graph. Experiments were conducted using real-world satellite images representing several cities in Saudi Arabia, and the results demonstrate the effectiveness of the proposed approach.

CV-35-标题: Unpaired Image-to-Image Translation via Neural Schrödinger Bridge

链接: https://arxiv.org/abs/2305.15086
作者: Beomsu Kim, Gihyun Kwon, Kwanyoung Kim, Jong Chul Ye


Abstract: Diffusion models are a powerful class of generative models which simulate stochastic differential equations (SDEs) to generate data from noise. Although diffusion models have achieved remarkable progress in recent years, they have limitations in the unpaired image-to-image translation tasks due to the Gaussian prior assumption. Schrödinger Bridge (SB), which learns an SDE to translate between two arbitrary distributions, have risen as an attractive solution to this problem. However, none of SB models so far have been successful at unpaired translation between high-resolution images. In this work, we propose the Unpaired Neural Schrödinger Bridge (UNSB), which combines SB with adversarial training and regularization to learn a SB between unpaired data. We demonstrate that UNSB is scalable, and that it successfully solves various unpaired image-to-image translation tasks. Code: \urlthis https URL

CV-36-标题: Audio-Visual Dataset and Method for Anomaly Detection in Traffic Videos

链接: https://arxiv.org/abs/2305.15084
作者: Błażej Leporowski, Arian Bakhtiarnia, Nicole Bonnici, Adrian Muscat, Luca Zanella, Yiming Wang, Alexandros Iosifidis


Abstract: We introduce the first audio-visual dataset for traffic anomaly detection taken from real-world scenes, called MAVAD, with a diverse range of weather and illumination conditions. In addition, we propose a novel method named AVACA that combines visual and audio features extracted from video sequences by means of cross-attention to detect anomalies. We demonstrate that the addition of audio improves the performance of AVACA by up to 5.2%. We also evaluate the impact of image anonymization, showing only a minor decrease in performance averaging at 1.7%.

CV-37-标题: Learning INR for Event-guided Rolling Shutter Frame Correction Deblur and Interpolation

链接: https://arxiv.org/abs/2305.15078
作者: Yunfan Lu, Guoqiang Liang, Lin Wang


Abstract: Images captured by rolling shutter (RS) cameras under fast camera motion often contain obvious image distortions and blur, which can be modeled as a row-wise combination of a sequence of global shutter (GS) frames within the exposure time naturally, recovering high-frame-rate GS sharp frames from an RS blur image needs to simultaneously consider RS correction, deblur, and frame interpolation Taking this task is nontrivial, and to our knowledge, no feasible solutions exist by far. A naive way is to decompose the complete process into separate tasks and simply cascade existing methods; however, this results in cumulative errors and noticeable artifacts. Event cameras enjoy many advantages, e.g., high temporal resolution, making them potential for our problem. To this end, we make the first attempt to recover high-frame-rate sharp GS frames from an RS blur image and paired event data. Our key idea is to learn an implicit neural representation (INR) to directly map the position and time coordinates to RGB values to address the interlocking degradations in the image restoration process. Specifically, we introduce spatial-temporal implicit encoding (STE) to convert an RS blur image and events into a spatial-temporal representation (STR). To query a specific sharp frame (GS or RS), we embed the exposure time into STR and decode the embedded features to recover a sharp frame. Moreover, we propose an RS blur image-guided integral loss to better train the network. Our method is relatively lightweight as it contains only 0.379M parameters and demonstrates high efficiency as the STE is called only once for any number of interpolation frames. Extensive experiments show that our method significantly outperforms prior methods addressing only one or two of the tasks.

CV-38-标题: PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology

链接: https://arxiv.org/abs/2305.15072
作者: Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Zhongyi Shui, Xiaoxuan Yu, Yizhi Zhao, Honglin Li, Yunlong Zhang, Ruojia Zhao, Xinheng Lyu, Lin Yang
备注: 13 pages, 5 figures, conference


Abstract: As advances in large language models (LLMs) and multimodal techniques continue to mature, the development of general-purpose multimodal large language models (MLLMs) has surged, with significant applications in natural image interpretation. However, the field of pathology has largely remained untapped in this regard, despite the growing need for accurate, timely, and personalized diagnostics. To bridge the gap in pathology MLLMs, we present the PathAsst in this study, which is a generative foundation AI assistant to revolutionize diagnostic and predictive analytics in pathology. To develop PathAsst, we collect over 142K high-quality pathology image-text pairs from a variety of reliable sources, including PubMed, comprehensive pathology textbooks, reputable pathology websites, and private data annotated by pathologists. Leveraging the advanced capabilities of ChatGPT/GPT-4, we generate over 180K instruction-following samples. Furthermore, we devise additional instruction-following data, specifically tailored for the invocation of the pathology-specific models, allowing the PathAsst to effectively interact with these models based on the input image and user intent, consequently enhancing the model’s diagnostic capabilities. Subsequently, our PathAsst is trained based on Vicuna-13B language model in coordination with the CLIP vision encoder. The results of PathAsst show the potential of harnessing the AI-powered generative foundation model to improve pathology diagnosis and treatment processes. We are committed to open-sourcing our meticulously curated dataset, as well as a comprehensive toolkit designed to aid researchers in the extensive collection and preprocessing of their own datasets. Resources can be obtained at this https URL.

CV-39-标题: Jointly Optimizing Image Compression with Low-light Image Enhancement

链接: https://arxiv.org/abs/2305.15030
作者: Shilv Cai, Xu Zou, Liqun Chen, Luxin Yan, Sheng Zhong
备注: arXiv admin note: text overlap with arXiv:2303.06705 by other authors


Abstract: Learning-based image compression methods have made great progress. Most of them are designed for generic natural images. In fact, low-light images frequently occur due to unavoidable environmental influences or technical limitations, such as insufficient lighting or limited exposure time. %When general-purpose image compression algorithms compress low-light images, useful detail information is lost, resulting in a dramatic decrease in image enhancement. Once low-light images are compressed by existing general image compression approaches, useful information(e.g., texture details) would be lost resulting in a dramatic performance decrease in low-light image enhancement. To simultaneously achieve a higher compression rate and better enhancement performance for low-light images, we propose a novel image compression framework with joint optimization of low-light image enhancement. We design an end-to-end trainable two-branch architecture with lower computational cost, which includes the main enhancement branch and the signal-to-noise ratio~(SNR) aware branch. Experimental results show that our proposed joint optimization framework achieves a significant improvement over existing Compress before Enhance" or Enhance before Compress" sequential solutions for low-light images. Source codes are included in the supplementary material.

CV-40-标题: Transferring Visual Attributes from Natural Language to Verified Image Generation

链接: https://arxiv.org/abs/2305.15026
作者: Rodrigo Valerio, Joao Bordalo, Michal Yarom, Yonattan Bitton, Idan Szpektor, Joao Magalhaes


Abstract: Text to image generation methods (T2I) are widely popular in generating art and other creative artifacts. While visual hallucinations can be a positive factor in scenarios where creativity is appreciated, such artifacts are poorly suited for cases where the generated image needs to be grounded in complex natural language without explicit visual elements. In this paper, we propose to strengthen the consistency property of T2I methods in the presence of natural complex language, which often breaks the limits of T2I methods by including non-visual information, and textual elements that require knowledge for accurate generation. To address these phenomena, we propose a Natural Language to Verified Image generation approach (NL2VI) that converts a natural prompt into a visual prompt, which is more suitable for image generation. A T2I model then generates an image for the visual prompt, which is then verified with VQA algorithms. Experimentally, aligning natural prompts with image generation can improve the consistency of the generated images by up to 11% over the state of the art. Moreover, improvements can generalize to challenging domains like cooking and DIY tasks, where the correctness of the generated image is crucial to illustrate actions.

CV-41-标题: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

链接: https://arxiv.org/abs/2305.15023
作者: Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji


Abstract: Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions without compromising their ability of natural language understanding. To validate MMA, we apply it to a recent LLM called LLaMA and term this formed large vision-language instructed model as LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of LaVIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming the effectiveness of MMA. Our project is released at this https URL.

CV-42-标题: Embodied GPT : Vision-Language Pre-Training via Embodied Chain of Thought

链接: https://arxiv.org/abs/2305.15021
作者: Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo


Abstract: Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the “Chain of Thoughts” mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.

CV-43-标题: Measuring Faithful and Plausible Visual Grounding in VQA

链接: https://arxiv.org/abs/2305.15015
作者: Daniel Reich, Felix Putze, Tanja Schultz


Abstract: Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system’s reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference capabilities of VQA models are often illustrated by a few qualitative illustrations, most systems are not quantitatively assessed for their VG properties. We believe, an easily calculated criterion for meaningfully measuring a system’s VG can help remedy this shortcoming, as well as add another valuable dimension to model evaluations and analysis. To this end, we propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer, i.e., if its visual grounding is both “faithful” and “plausible”. Our metric, called “Faithful and Plausible Visual Grounding” (FPVG), is straightforward to determine for most VQA model designs. We give a detailed description of FPVG and evaluate several reference systems spanning various VQA architectures. Code to support the metric calculations on the GQA data set is available on GitHub.

CV-44-标题: Contrastive Training of Complex-Valued Autoencoders for Object Discovery

链接: https://arxiv.org/abs/2305.15001
作者: Aleksandar Stanić, Anand Gopalakrishnan, Kazuki Irie, Jürgen Schmidhuber
备注: 26 pages, 14 figures


Abstract: Current state-of-the-art object-centric models use slots and attention-based routing for binding. However, this class of models has several conceptual limitations: the number of slots is hardwired; all slots have equal capacity; training has high computational cost; there are no object-level relational factors within slots. Synchrony-based models in principle can address these limitations by using complex-valued activations which store binding information in their phase components. However, working examples of such synchrony-based models have been developed only very recently, and are still limited to toy grayscale datasets and simultaneous storage of less than three objects in practice. Here we introduce architectural modifications and a novel contrastive learning method that greatly improve the state-of-the-art synchrony-based model. For the first time, we obtain a class of synchrony-based models capable of discovering objects in an unsupervised manner in multi-object color datasets and simultaneously representing more than three objects

CV-45-标题: Non-adversarial Robustness of Deep Learning Methods for Computer Vision

链接: https://arxiv.org/abs/2305.14986
作者: Gorana Gojić, Vladimir Vincan, Ognjen Kundačina, Dragiša Mišković, Dinu Dragan


Abstract: Non-adversarial robustness, also known as natural robustness, is a property of deep learning models that enables them to maintain performance even when faced with distribution shifts caused by natural variations in data. However, achieving this property is challenging because it is difficult to predict in advance the types of distribution shifts that may occur. To address this challenge, researchers have proposed various approaches, some of which anticipate potential distribution shifts, while others utilize knowledge about the shifts that have already occurred to enhance model generalizability. In this paper, we present a brief overview of the most recent techniques for improving the robustness of computer vision methods, as well as a summary of commonly used robustness benchmark datasets for evaluating the model’s performance under data distribution shifts. Finally, we examine the strengths and limitations of the approaches reviewed and identify general trends in deep learning robustness improvement for computer vision.

CV-46-标题: Scale Matters: Attribution Meets the Wavelet Domain to Explain Model Sensitivity to Image Corruptions

链接: https://arxiv.org/abs/2305.14979
作者: Gabriel Kasmi, Laurent Dubus, Yves-Marie Saint Drenan, Philippe Blanc
备注: main: 9 pages, appendix 19 pages, 32 figures, 5 tables


Abstract: Neural networks have shown remarkable performance in computer vision, but their deployment in real-world scenarios is challenging due to their sensitivity to image corruptions. Existing attribution methods are uninformative for explaining the sensitivity to image corruptions, while the literature on robustness only provides model-based explanations. However, the ability to scrutinize models’ behavior under image corruptions is crucial to increase the user’s trust. Towards this end, we introduce the Wavelet sCale Attribution Method (WCAM), a generalization of attribution from the pixel domain to the space-scale domain. Attribution in the space-scale domain reveals where and on what scales the model focuses. We show that the WCAM explains models’ failures under image corruptions, identifies sufficient information for prediction, and explains how zoom-in increases accuracy.

CV-47-标题: Sampling-based Uncertainty Estimation for an Instance Segmentation Network

链接: https://arxiv.org/abs/2305.14977
作者: Florian Heidecker, Ahmad El-Khateeb, Bernhard Sick


Abstract: The examination of uncertainty in the predictions of machine learning (ML) models is receiving increasing attention. One uncertainty modeling technique used for this purpose is Monte-Carlo (MC)-Dropout, where repeated predictions are generated for a single input. Therefore, clustering is required to describe the resulting uncertainty, but only through efficient clustering is it possible to describe the uncertainty from the model attached to each object. This article uses Bayesian Gaussian Mixture (BGM) to solve this problem. In addition, we investigate different values for the dropout rate and other techniques, such as focal loss and calibration, which we integrate into the Mask-RCNN model to obtain the most accurate uncertainty approximation of each instance and showcase it graphically.

CV-48-标题: MMNet: Multi-Mask Network for Referring Image Segmentation

链接: https://arxiv.org/abs/2305.14969
作者: Yichen Yan, Xingjian He, Wenxuan Wan, Jing Liu
备注: 10 pages, 5 figures


Abstract: Referring image segmentation aims to segment an object referred to by natural language expression from an image. However, this task is challenging due to the distinct data properties between text and image, and the randomness introduced by diverse objects and unrestricted language expression. Most of previous work focus on improving cross-modal feature fusion while not fully addressing the inherent uncertainty caused by diverse objects and unrestricted language. To tackle these problems, we propose an end-to-end Multi-Mask Network for referring image segmentation(MMNet). we first combine picture and language and then employ an attention mechanism to generate multiple queries that represent different aspects of the language expression. We then utilize these queries to produce a series of corresponding segmentation masks, assigning a score to each mask that reflects its importance. The final result is obtained through the weighted sum of all masks, which greatly reduces the randomness of the language expression. Our proposed framework demonstrates superior performance compared to state-of-the-art approaches on the two most commonly used datasets, RefCOCO, RefCOCO+ and G-Ref, without the need for any post-processing. This further validates the efficacy of our proposed framework.

CV-49-标题: ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents ICDAR2023

链接: https://arxiv.org/abs/2305.14962
作者: Christoph Auer, Ahmed Nassar, Maksym Lysak, Michele Dolfi, Nikolaos Livathinos, Peter Staar
备注: ICDAR 2023, 10 pages, 4 figures


Abstract: Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. Recovering the layout structure and content from PDF files or scanned material has remained a key problem for decades. ICDAR has a long tradition in hosting competitions to benchmark the state-of-the-art and encourage the development of novel solutions to document layout understanding. In this report, we present the results of our \textitICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents, which posed the challenge to accurately segment the page layout in a broad range of document styles and domains, including corporate reports, technical literature and patents. To raise the bar over previous competitions, we engineered a hard competition dataset and proposed the recent DocLayNet dataset for training. We recorded 45 team registrations and received official submissions from 21 teams. In the presented solutions, we recognize interesting combinations of recent computer vision models, data augmentation strategies and ensemble methods to achieve remarkable accuracy in the task we posed. A clear trend towards adoption of vision-transformer based methods is evident. The results demonstrate substantial progress towards achieving robust and highly generalizing methods for document layout understanding.

CV-50-标题: DC-Net: Divide-and-Conquer for Salient Object Detection

链接: https://arxiv.org/abs/2305.14955
作者: Jiayi Zhu, Xuebin Qin, Abdulmotaleb Elsaddik


Abstract: In this paper, we introduce Divide-and-Conquer into the salient object detection (SOD) task to enable the model to learn prior knowledge that is for predicting the saliency map. We design a novel network, Divide-and-Conquer Network (DC-Net) which uses two encoders to solve different subtasks that are conducive to predicting the final saliency map, here is to predict the edge maps with width 4 and location maps of salient objects and then aggregate the feature maps with different semantic information into the decoder to predict the final saliency map. The decoder of DC-Net consists of our newly designed two-level Residual nested-ASPP (ResASPP ^2 ) modules, which have the ability to capture a large number of different scale features with a small number of convolution operations and have the advantages of maintaining high resolution all the time and being able to obtain a large and compact effective receptive field (ERF). Based on the advantage of Divide-and-Conquer’s parallel computing, we use Parallel Acceleration to speed up DC-Net, allowing it to achieve competitive performance on six LR-SOD and five HR-SOD datasets under high efficiency (60 FPS and 55 FPS). Codes and results are available: this https URL.

CV-51-标题: Dual-Side Feature Fusion 3D Pose Transfer

链接: https://arxiv.org/abs/2305.14951
作者: Jue Liu, Feipeng Da


Abstract: 3D pose transfer solves the problem of additional input and correspondence of traditional deformation transfer, only the source and target meshes need to be input, and the pose of the source mesh can be transferred to the target mesh. Some lightweight methods proposed in recent years consume less memory but cause spikes and distortions for some unseen poses, while others are costly in training due to the inclusion of large matrix multiplication and adversarial networks. In addition, the meshes with different numbers of vertices also increase the difficulty of pose transfer. In this work, we propose a Dual-Side Feature Fusion Pose Transfer Network to improve the pose transfer accuracy of the lightweight method. Our method takes the pose features as one of the side inputs to the decoding network and fuses them into the target mesh layer by layer at multiple scales. Our proposed Feature Fusion Adaptive Instance Normalization has the characteristic of having two side input channels that fuse pose features and identity features as denormalization parameters, thus enhancing the pose transfer capability of the network. Extensive experimental results show that our proposed method has stronger pose transfer capability than state-of-the-art methods while maintaining a lightweight network structure, and can converge faster.

CV-52-标题: Incremental Dense Reconstruction from Monocular Video with Guided Sparse Feature Volume Fusion

链接: https://arxiv.org/abs/2305.14918
作者: Xingxing Zuo, Nan Yang, Nathaniel Merrill, Binbin Xu, Stefan Leutenegger
备注: 8 pages, 5 figures, RA-L 2023


Abstract: Incrementally recovering 3D dense structures from monocular videos is of paramount importance since it enables various robotics and AR applications. Feature volumes have recently been shown to enable efficient and accurate incremental dense reconstruction without the need to first estimate depth, but they are not able to achieve as high of a resolution as depth-based methods due to the large memory consumption of high-resolution feature volumes. This letter proposes a real-time feature volume-based dense reconstruction method that predicts TSDF (Truncated Signed Distance Function) values from a novel sparsified deep feature volume, which is able to achieve higher resolutions than previous feature volume-based methods, and is favorable in large-scale outdoor scenarios where the majority of voxels are empty. An uncertainty-aware multi-view stereo (MVS) network is leveraged to infer initial voxel locations of the physical surface in a sparse feature volume. Then for refining the recovered 3D geometry, deep features are attentively aggregated from multiview images at potential surface locations, and temporally fused. Besides achieving higher resolutions than before, our method is shown to produce more complete reconstructions with finer detail in many cases. Extensive evaluations on both public and self-collected datasets demonstrate a very competitive real-time reconstruction result for our method compared to state-of-the-art reconstruction methods in both indoor and outdoor settings.

CV-53-标题: GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for Remote Sensing Data

链接: https://arxiv.org/abs/2305.14914
作者: Zhitong Xiong, Sining Chen, Yi Wang, Lichao Mou, Xiao Xiang Zhu
备注: 13 pages


Abstract: Geometric information in the normalized digital surface models (nDSM) is highly correlated with the semantic class of the land cover. Exploiting two modalities (RGB and nDSM (height)) jointly has great potential to improve the segmentation performance. However, it is still an under-explored field in remote sensing due to the following challenges. First, the scales of existing datasets are relatively small and the diversity of existing datasets is limited, which restricts the ability of validation. Second, there is a lack of unified benchmarks for performance assessment, which leads to difficulties in comparing the effectiveness of different models. Last, sophisticated multi-modal semantic segmentation methods have not been deeply explored for remote sensing data. To cope with these challenges, in this paper, we introduce a new remote-sensing benchmark dataset for multi-modal semantic segmentation based on RGB-Height (RGB-H) data. Towards a fair and comprehensive analysis of existing methods, the proposed benchmark consists of 1) a large-scale dataset including co-registered RGB and nDSM pairs and pixel-wise semantic labels; 2) a comprehensive evaluation and analysis of existing multi-modal fusion strategies for both convolutional and Transformer-based networks on remote sensing data. Furthermore, we propose a novel and effective Transformer-based intermediary multi-modal fusion (TIMF) module to improve the semantic segmentation performance through adaptive token-level multi-modal fusion.The designed benchmark can foster future research on developing new methods for multi-modal learning on remote sensing data. Extensive analyses of those methods are conducted and valuable insights are provided through the experimental results. Code for the benchmark and baselines can be accessed at \urlthis https URL.

CV-54-标题: HARD: Hard Augmentations for Robust Distillation

链接: https://arxiv.org/abs/2305.14890
作者: Arne F. Nix, Max F. Burg, Fabian H. Sinz


Abstract: Knowledge distillation (KD) is a simple and successful method to transfer knowledge from a teacher to a student model solely based on functional activity. However, current KD has a few shortcomings: it has recently been shown that this method is unsuitable to transfer simple inductive biases like shift equivariance, struggles to transfer out of domain generalization, and optimization time is magnitudes longer compared to default non-KD model training. To improve these aspects of KD, we propose Hard Augmentations for Robust Distillation (HARD), a generally applicable data augmentation framework, that generates synthetic data points for which the teacher and the student disagree. We show in a simple toy example that our augmentation framework solves the problem of transferring simple equivariances with KD. We then apply our framework in real-world tasks for a variety of augmentation models, ranging from simple spatial transformations to unconstrained image manipulations with a pretrained variational autoencoder. We find that our learned augmentations significantly improve KD performance on in-domain and out-of-domain evaluation. Moreover, our method outperforms even state-of-the-art data augmentations and since the augmented training inputs can be visualized, they offer a qualitative insight into the properties that are transferred from the teacher to the student. Thus HARD represents a generally applicable, dynamically optimized data augmentation technique tailored to improve the generalization and convergence speed of models trained with KD.

CV-55-标题: Towards View-invariant and Accurate Loop Detection Based on Scene Graph ICRA2023

链接: https://arxiv.org/abs/2305.14885
作者: Chuhao Liu, Shaojie Shen
备注: Accepted by ICRA2023


Abstract: Loop detection plays a key role in visual Simultaneous Localization and Mapping (SLAM) by correcting the accumulated pose drift. In indoor scenarios, the richly distributed semantic landmarks are view-point invariant and hold strong descriptive power in loop detection. The current semantic-aided loop detection embeds the topology between semantic instances to search a loop. However, current semantic-aided loop detection methods face challenges in dealing with ambiguous semantic instances and drastic viewpoint differences, which are not fully addressed in the literature. This paper introduces a novel loop detection method based on an incrementally created scene graph, targeting the visual SLAM at indoor scenes. It jointly considers the macro-view topology, micro-view topology, and occupancy of semantic instances to find correct correspondences. Experiments using handheld RGB-D sequence show our method is able to accurately detect loops in drastically changed viewpoints. It maintains a high precision in observing objects with similar topology and appearance. Our method also demonstrates that it is robust in changed indoor scenes.

CV-56-标题: Multiresolution Feature Guidance Based Transformer for Anomaly Detection

链接: https://arxiv.org/abs/2305.14880
作者: Shuting Yan, Pingping Chen, Honghui Chen, Huan Mao, Feng Chen, Zhijian Lin


Abstract: Anomaly detection is represented as an unsupervised learning to identify deviated images from normal images. In general, there are two main challenges of anomaly detection tasks, i.e., the class imbalance and the unexpectedness of anomalies. In this paper, we propose a multiresolution feature guidance method based on Transformer named GTrans for unsupervised anomaly detection and localization. In GTrans, an Anomaly Guided Network (AGN) pre-trained on ImageNet is developed to provide surrogate labels for features and tokens. Under the tacit knowledge guidance of the AGN, the anomaly detection network named Trans utilizes Transformer to effectively establish a relationship between features with multiresolution, enhancing the ability of the Trans in fitting the normal data manifold. Due to the strong generalization ability of AGN, GTrans locates anomalies by comparing the differences in spatial distance and direction of multi-scale features extracted from the AGN and the Trans. Our experiments demonstrate that the proposed GTrans achieves state-of-the-art performance in both detection and localization on the MVTec AD dataset. GTrans achieves image-level and pixel-level anomaly detection AUROC scores of 99.0% and 97.9% on the MVTec AD dataset, respectively.

CV-57-标题: Optimization-Based Improvement of Face Image Quality Assessment Techniques

链接: https://arxiv.org/abs/2305.14856
作者: Žiga Babnik, Naser Damer, Vitomir Štruc
备注: In proceedings of the International Workshop on Biometrics and Forensics (IWBF) 2023


Abstract: Contemporary face recognition (FR) models achieve near-ideal recognition performance in constrained settings, yet do not fully translate the performance to unconstrained (realworld) scenarios. To help improve the performance and stability of FR systems in such unconstrained settings, face image quality assessment (FIQA) techniques try to infer sample-quality information from the input face images that can aid with the recognition process. While existing FIQA techniques are able to efficiently capture the differences between high and low quality images, they typically cannot fully distinguish between images of similar quality, leading to lower performance in many scenarios. To address this issue, we present in this paper a supervised quality-label optimization approach, aimed at improving the performance of existing FIQA techniques. The developed optimization procedure infuses additional information (computed with a selected FR model) into the initial quality scores generated with a given FIQA technique to produce better estimates of the “actual” image quality. We evaluate the proposed approach in comprehensive experiments with six state-of-the-art FIQA approaches (CR-FIQA, FaceQAN, SER-FIQ, PCNet, MagFace, SDD-FIQA) on five commonly used benchmarks (LFW, CFPFP, CPLFW, CALFW, XQLFW) using three targeted FR models (ArcFace, ElasticFace, CurricularFace) with highly encouraging results.

CV-58-标题: DuDGAN: Improving Class-Conditional GANs via Dual-Diffusion

链接: https://arxiv.org/abs/2305.14849
作者: Taesun Yeom, Minhyeok Lee


Abstract: Class-conditional image generation using generative adversarial networks (GANs) has been investigated through various techniques; however, it continues to face challenges such as mode collapse, training instability, and low-quality output in cases of datasets with high intra-class variation. Furthermore, most GANs often converge in larger iterations, resulting in poor iteration efficacy in training procedures. While Diffusion-GAN has shown potential in generating realistic samples, it has a critical limitation in generating class-conditional samples. To overcome these limitations, we propose a novel approach for class-conditional image generation using GANs called DuDGAN, which incorporates a dual diffusion-based noise injection process. Our method consists of three unique networks: a discriminator, a generator, and a classifier. During the training process, Gaussian-mixture noises are injected into the two noise-aware networks, the discriminator and the classifier, in distinct ways. This noisy data helps to prevent overfitting by gradually introducing more challenging tasks, leading to improved model performance. As a result, our method outperforms state-of-the-art conditional GAN models for image generation in terms of performance. We evaluated our method using the AFHQ, Food-101, and CIFAR-10 datasets and observed superior results across metrics such as FID, KID, Precision, and Recall score compared with comparison models, highlighting the effectiveness of our approach.

CV-59-标题: Introducing Competition to Boost the Transferability of Targeted Adversarial Examples through Clean Feature Mixup CVPR2023

链接: https://arxiv.org/abs/2305.14846
作者: Junyoung Byun, Myung-Joon Kwon, Seungju Cho, Yoonji Kim, Changick Kim
备注: CVPR 2023 camera-ready


Abstract: Deep neural networks are widely known to be susceptible to adversarial examples, which can cause incorrect predictions through subtle input modifications. These adversarial examples tend to be transferable between models, but targeted attacks still have lower attack success rates due to significant variations in decision boundaries. To enhance the transferability of targeted adversarial examples, we propose introducing competition into the optimization process. Our idea is to craft adversarial perturbations in the presence of two new types of competitor noises: adversarial perturbations towards different target classes and friendly perturbations towards the correct class. With these competitors, even if an adversarial example deceives a network to extract specific features leading to the target class, this disturbance can be suppressed by other competitors. Therefore, within this competition, adversarial examples should take different attack strategies by leveraging more diverse features to overwhelm their interference, leading to improving their transferability to different models. Considering the computational complexity, we efficiently simulate various interference from these two types of competitors in feature space by randomly mixing up stored clean features in the model inference and named this method Clean Feature Mixup (CFM). Our extensive experimental results on the ImageNet-Compatible and CIFAR-10 datasets show that the proposed method outperforms the existing baselines with a clear margin. Our code is available at this https URL.

CV-60-标题: Predicting Token Impact Towards Efficient Vision Transformer

链接: https://arxiv.org/abs/2305.14840
作者: Hong Wang, Su Yang, Xiaoke Huang, Weishan Zhang
备注: 10 pages


Abstract: Token filtering to reduce irrelevant tokens prior to self-attention is a straightforward way to enable efficient vision Transformer. This is the first work to view token filtering from a feature selection perspective, where we weigh the importance of a token according to how much it can change the loss once masked. If the loss changes greatly after masking a token of interest, it means that such a token has a significant impact on the final decision and is thus relevant. Otherwise, the token is less important for the final decision, so it can be filtered out. After applying the token filtering module generalized from the whole training data, the token number fed to the self-attention module can be obviously reduced in the inference phase, leading to much fewer computations in all the subsequent self-attention layers. The token filter can be realized using a very simple network, where we utilize multi-layer perceptron. Except for the uniqueness of performing token filtering only once from the very beginning prior to self-attention, the other core feature making our method different from the other token filters lies in the predictability of token impact from a feature selection point of view. The experiments show that the proposed method provides an efficient way to approach a light weighted model after optimized with a backbone by means of fine tune, which is easy to be deployed in comparison with the existing methods based on training from scratch.

CV-61-标题: NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

链接: https://arxiv.org/abs/2305.14836
作者: Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang


Abstract: We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at this https URL.

CV-62-标题: OD-NeRF: Efficient Training of On-the-Fly Dynamic Neural Radiance Fields

链接: https://arxiv.org/abs/2305.14831
作者: Zhiwen Yan, Chen Li, Gim Hee Lee


Abstract: Dynamic neural radiance fields (dynamic NeRFs) have demonstrated impressive results in novel view synthesis on 3D dynamic scenes. However, they often require complete video sequences for training followed by novel view synthesis, which is similar to playing back the recording of a dynamic 3D scene. In contrast, we propose OD-NeRF to efficiently train and render dynamic NeRFs on-the-fly which instead is capable of streaming the dynamic scene. When training on-the-fly, the training frames become available sequentially and the model is trained and rendered frame-by-frame. The key challenge of efficient on-the-fly training is how to utilize the radiance field estimated from the previous frames effectively. To tackle this challenge, we propose: 1) a NeRF model conditioned on the multi-view projected colors to implicitly track correspondence between the current and previous frames, and 2) a transition and update algorithm that leverages the occupancy grid from the last frame to sample efficiently at the current frame. Our algorithm can achieve an interactive speed of 6FPS training and rendering on synthetic dynamic scenes on-the-fly, and a significant speed-up compared to the state-of-the-art on real-world dynamic scenes.

CV-63-标题: On Correlated Knowledge Distillation for Monitoring Human Pose with Radios

链接: https://arxiv.org/abs/2305.14829
作者: Shiva Raj Pokhrel, Jonathan Kua, Deol Satish, Phil Williams, Arkady Zaslavsky, Seng W. Loke, Jinho Choi


Abstract: In this work, we propose and develop a simple experimental testbed to study the feasibility of a novel idea by coupling radio frequency (RF) sensing technology with Correlated Knowledge Distillation (CKD) theory towards designing lightweight, near real-time and precise human pose monitoring systems. The proposed CKD framework transfers and fuses pose knowledge from a robust “Teacher” model to a parameterized “Student” model, which can be a promising technique for obtaining accurate yet lightweight pose estimates. To assure its efficacy, we implemented CKD for distilling logits in our integrated Software Defined Radio (SDR)-based experimental setup and investigated the RF-visual signal correlation. Our CKD-RF sensing technique is characterized by two modes – a camera-fed Teacher Class Network (e.g., images, videos) with an SDR-fed Student Class Network (e.g., RF signals). Specifically, our CKD model trains a dual multi-branch teacher and student network by distilling and fusing knowledge bases. The resulting CKD models are then subsequently used to identify the multimodal correlation and teach the student branch in reverse. Instead of simply aggregating their learnings, CKD training comprised multiple parallel transformations with the two domains, i.e., visual images and RF signals. Once trained, our CKD model can efficiently preserve privacy and utilize the multimodal correlated logits from the two different neural networks for estimating poses without using visual signals/video frames (by using only the RF signals).

CV-64-标题: Semi-Supervised and Long-Tailed Object Detection with CascadeMatch

链接: https://arxiv.org/abs/2305.14813
作者: Yuhang Zang, Kaiyang Zhou, Chen Huang, Chen Change Loy
备注: International Journal of Computer Vision (IJCV), 2023


Abstract: This paper focuses on long-tailed object detection in the semi-supervised learning setting, which poses realistic challenges, but has rarely been studied in the literature. We propose a novel pseudo-labeling-based detector called CascadeMatch. Our detector features a cascade network architecture, which has multi-stage detection heads with progressive confidence thresholds. To avoid manually tuning the thresholds, we design a new adaptive pseudo-label mining mechanism to automatically identify suitable values from data. To mitigate confirmation bias, where a model is negatively reinforced by incorrect pseudo-labels produced by itself, each detection head is trained by the ensemble pseudo-labels of all detection heads. Experiments on two long-tailed datasets, i.e., LVIS and COCO-LT, demonstrate that CascadeMatch surpasses existing state-of-the-art semi-supervised approaches – across a wide range of detection architectures – in handling long-tailed object detection. For instance, CascadeMatch outperforms Unbiased Teacher by 1.9 AP Fix on LVIS when using a ResNet50-based Cascade R-CNN structure, and by 1.7 AP Fix when using Sparse R-CNN with a Transformer encoder. We also show that CascadeMatch can even handle the challenging sparsely annotated object detection problem.

CV-65-标题: Exploring Diverse In-Context Configurations for Image Captioning

链接: https://arxiv.org/abs/2305.14800
作者: Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen


Abstract: After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, \ie, randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case.

CV-66-标题: Polarimetric Imaging for Perception

链接: https://arxiv.org/abs/2305.14787
作者: Michael Baltaxe, Tomer Pe’er, Dan Levi


Abstract: Autonomous driving and advanced driver-assistance systems rely on a set of sensors and algorithms to perform the appropriate actions and provide alerts as a function of the driving scene. Typically, the sensors include color cameras, radar, lidar and ultrasonic sensors. Strikingly however, although light polarization is a fundamental property of light, it is seldom harnessed for perception tasks. In this work we analyze the potential for improvement in perception tasks when using an RGB-polarimetric camera, as compared to an RGB camera. We examine monocular depth estimation and free space detection during the middle of the day, when polarization is independent of subject heading, and show that a quantifiable improvement can be achieved for both of them using state-of-the-art deep neural networks, with a minimum of architectural changes. We also present a new dataset composed of RGB-polarimetric images, lidar scans, GNSS / IMU readings and free space segmentations that further supports developing perception algorithms that take advantage of light polarization.

CV-67-标题: Generative Modeling through the Semi-dual Formulation of Unbalanced Optimal Transport

链接: https://arxiv.org/abs/2305.14777
作者: Jaemoo Choi, Jaewoong Choi, Myungjoo Kang
备注: 23 pages, 15 figures


Abstract: Optimal Transport (OT) problem investigates a transport map that bridges two distributions while minimizing a given cost function. In this regard, OT between tractable prior distribution and data has been utilized for generative modeling tasks. However, OT-based methods are susceptible to outliers and face optimization challenges during training. In this paper, we propose a novel generative model based on the semi-dual formulation of Unbalanced Optimal Transport (UOT). Unlike OT, UOT relaxes the hard constraint on distribution matching. This approach provides better robustness against outliers, stability during training, and faster convergence. We validate these properties empirically through experiments. Moreover, we study the theoretical upper-bound of divergence between distributions in UOT. Our model outperforms existing OT-based generative models, achieving FID scores of 2.97 on CIFAR-10 and 5.80 on CelebA-HQ-256.

CV-68-标题: Dual Path Transformer with Partition Attention

链接: https://arxiv.org/abs/2305.14768
作者: Zhengkai Jiang, Liang Liu, Jiangning Zhang, Yabiao Wang, Mingang Chen, Chengjie Wang


Abstract: This paper introduces a novel attention mechanism, called dual attention, which is both efficient and effective. The dual attention mechanism consists of two parallel components: local attention generated by Convolutional Neural Networks (CNNs) and long-range attention generated by Vision Transformers (ViTs). To address the high computational complexity and memory footprint of vanilla Multi-Head Self-Attention (MHSA), we introduce a novel Multi-Head Partition-wise Attention (MHPA) mechanism. The partition-wise attention approach models both intra-partition and inter-partition attention simultaneously. Building on the dual attention block and partition-wise attention mechanism, we present a hierarchical vision backbone called DualFormer. We evaluate the effectiveness of our model on several computer vision tasks, including image classification on ImageNet, object detection on COCO, and semantic segmentation on Cityscapes. Specifically, the proposed DualFormer-XS achieves 81.5% top-1 accuracy on ImageNet, outperforming the recent state-of-the-art MPViT-XS by 0.6% top-1 accuracy with much higher throughput.

CV-69-标题: MRN: Multiplexed Routing Network for Incremental Multilingual Text Recognition

链接: https://arxiv.org/abs/2305.14758
作者: Tianlun Zheng, Zhineng Chen, BingChen Huang, Wei Zhang, Yu-Gang Jiang


Abstract: Traditional Multilingual Text Recognition (MLTR) usually targets a fixed set of languages and thus struggles to handle newly added languages or adapt to ever-changing class distributions. In this paper, we introduce the Incremental Multilingual Text Recognition (IMLTR) task in the incremental learning setting, where new language data comes in batches. Compared to generic incremental learning, IMLTR is even more challenging as it suffers from rehearsal-imbalance (uneven distribution of sample characters in the rehearsal set). To address this issue, we propose a Multiplexed Routing Network (MRN), where a series of recognizers is trained for each language. Subsequently, a language predictor is adopted to weigh the recognizers for voting. Since the recognizers are derived from the original model, MRN effectively reduces the reliance on older data and is better suited for rehearsal-imbalance. We extensively evaluate MRN on MLT17 and MLT19 datasets, outperforming existing state-of-the-art methods by a large margin, i.e., accuracy improvement ranging from 10.3% to 27.4% under different settings.

CV-70-标题: SUVR: A Search-based Approach to Unsupervised Visual Representation Learning ICASSP2023

链接: https://arxiv.org/abs/2305.14754
作者: Yi-Zhan Xu, Chih-Yao Chen, Cheng-Te Li
备注: ICASSP 2023


Abstract: Unsupervised learning has grown in popularity because of the difficulty of collecting annotated data and the development of modern frameworks that allow us to learn from unlabeled data. Existing studies, however, either disregard variations at different levels of similarity or only consider negative samples from one batch. We argue that image pairs should have varying degrees of similarity, and the negative samples should be allowed to be drawn from the entire dataset. In this work, we propose Search-based Unsupervised Visual Representation Learning (SUVR) to learn better image representations in an unsupervised manner. We first construct a graph from the image dataset by the similarity between images, and adopt the concept of graph traversal to explore positive samples. In the meantime, we make sure that negative samples can be drawn from the full dataset. Quantitative experiments on five benchmark image classification datasets demonstrate that SUVR can significantly outperform strong competing methods on unsupervised embedding learning. Qualitative experiments also show that SUVR can produce better representations in which similar images are clustered closer together than unrelated images in the latent space.

CV-71-标题: ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

链接: https://arxiv.org/abs/2305.14742
作者: Dongxu Yue, Qin Guo, Munan Ning, Jiaxi Cui, Yuesheng Zhu, Li Yuan


Abstract: Editing real facial images is a crucial task in computer vision with significant demand in various real-world applications. While GAN-based methods have showed potential in manipulating images especially when combined with CLIP, these methods are limited in their ability to reconstruct real images due to challenging GAN inversion capability. Despite the successful image reconstruction achieved by diffusion-based methods, there are still challenges in effectively manipulating fine-gained facial attributes with textual this http URL address these issues and facilitate convenient manipulation of real facial images, we propose a novel approach that conduct text-driven image editing in the semantic latent space of diffusion model. By aligning the temporal feature of the diffusion model with the semantic condition at generative process, we introduce a stable manipulation strategy, which perform precise zero-shot manipulation effectively. Furthermore, we develop an interactive system named ChatFace, which combines the zero-shot reasoning ability of large language models to perform efficient manipulations in diffusion semantic latent space. This system enables users to perform complex multi-attribute manipulations through dialogue, opening up new possibilities for interactive image editing. Extensive experiments confirmed that our approach outperforms previous methods and enables precise editing of real facial images, making it a promising candidate for real-world applications. Project page: this https URL

CV-72-标题: AutoDepthNet: High Frame Rate Depth Map Reconstruction using Commodity Depth and RGB Cameras

链接: https://arxiv.org/abs/2305.14731
作者: Peyman Gholami, Robert Xiao


Abstract: Depth cameras have found applications in diverse fields, such as computer vision, artificial intelligence, and video gaming. However, the high latency and low frame rate of existing commodity depth cameras impose limitations on their applications. We propose a fast and accurate depth map reconstruction technique to reduce latency and increase the frame rate in depth cameras. Our approach uses only a commodity depth camera and color camera in a hybrid camera setup; our prototype is implemented using a Kinect Azure depth camera at 30 fps and a high-speed RGB iPhone 11 Pro camera captured at 240 fps. The proposed network, AutoDepthNet, is an encoder-decoder model that captures frames from the high-speed RGB camera and combines them with previous depth frames to reconstruct a stream of high frame rate depth maps. On GPU, with a 480 x 270 output resolution, our system achieves an inference time of 8 ms, enabling real-time use at up to 200 fps with parallel processing. AutoDepthNet can estimate depth values with an average RMS error of 0.076, a 44.5% improvement compared to an optical flow-based comparison method. Our method can also improve depth map quality by estimating depth values for missing and invalidated pixels. The proposed method can be easily applied to existing depth cameras and facilitates the use of depth cameras in applications that require high-speed depth estimation. We also showcase the effectiveness of the framework in upsampling different sparse datasets e.g. video object segmentation. As a demonstration of our method, we integrated our framework into existing body tracking systems and demonstrated the robustness of the proposed method in such applications.

CV-73-标题: BinaryViT: Towards Efficient and Accurate Binary Vision Transformer s

链接: https://arxiv.org/abs/2305.14730
作者: Junrui Xiao, Zhikai Li, Lianwei Yang, Qingyi Gu


Abstract: Vision Transformers (ViTs) have emerged as the fundamental architecture for most computer vision fields, but the considerable memory and computation costs hinders their application on resource-limited devices. As one of the most powerful compression methods, binarization reduces the computation of the neural network by quantizing the weights and activation values as \pm 1. Although existing binarization methods have demonstrated excellent performance on Convolutional Neural Networks (CNNs), the full binarization of ViTs is still under-studied and suffering a significant performance drop. In this paper, we first argue empirically that the severe performance degradation is mainly caused by the weight oscillation in the binarization training and the information distortion in the activation of ViTs. Based on these analyses, we propose \textbfBinaryViT , an accurate full binarization scheme for ViTs, which pushes the quantization of ViTs to the limit. Specifically, we propose a novel gradient regularization scheme (GRS) for driving a bimodal distribution of the weights to reduce oscillation in binarization training. Moreover, we design an activation shift module (ASM) to adaptively tune the activation distribution to reduce the information distortion caused by binarization. Extensive experiments on ImageNet dataset show that our BinaryViT consistently surpasses the strong baseline by 2.05% and improve the accuracy of fully binarized ViTs to a usable level. Furthermore, our method achieves impressive savings of 16.2 \times and 17.7 \times in model size and OPs compared to the full-precision DeiT-S. The codes and models will be released on github.

CV-74-标题: Remote Sensing Image Change Detection Towards Continuous Bitemporal Resolution Differences

链接: https://arxiv.org/abs/2305.14722
作者: Hao Chen, Haotian Zhang, Keyan Chen, Chenyao Zhou, Song Chen, Zhengxia Zhou, Zhenwei Shi
备注: 19 pages, 11 figures. Submitted to the IEEE for a possible publication


Abstract: Most contemporary supervised Remote Sensing (RS) image Change Detection (CD) approaches are customized for equal-resolution bitemporal images. Real-world applications raise the need for cross-resolution change detection, aka, CD based on bitemporal images with different spatial resolutions. Current cross-resolution methods that are trained with samples of a fixed resolution difference (resolution ratio between the high-resolution (HR) image and the low-resolution (LR) one) may fit a certain ratio but lack adaptation to other resolution differences. Toward continuous cross-resolution CD, we propose scale-invariant learning to enforce the model consistently predicting HR results given synthesized samples of varying bitemporal resolution differences. Concretely, we synthesize blurred versions of the HR image by random downsampled reconstructions to reduce the gap between HR and LR images. We introduce coordinate-based representations to decode per-pixel predictions by feeding the coordinate query and corresponding multi-level embedding features into an MLP that implicitly learns the shape of land cover changes, therefore benefiting recognizing blurred objects in the LR image. Moreover, considering that spatial resolution mainly affects the local textures, we apply local-window self-attention to align bitemporal features during the early stages of the encoder. Extensive experiments on two synthesized and one real-world different-resolution CD datasets verify the effectiveness of the proposed method. Our method significantly outperforms several vanilla CD methods and two cross-resolution CD methods on the three datasets both in in-distribution and out-of-distribution settings. The empirical results suggest that our method could yield relatively consistent HR change predictions regardless of varying resolution difference ratios. Our code will be public.

CV-75-标题: BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

链接: https://arxiv.org/abs/2305.14720
作者: Dongxu Li, Junnan Li, Steven C.H. Hoi


Abstract: Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Code and models will be released at this https URL. Project page at this https URL.

CV-76-标题: Leveraging Future Relationship Reasoning for Vehicle Trajectory Prediction ICLR2023

链接: https://arxiv.org/abs/2305.14715
作者: Daehee Park, Hobin Ryu, Yunseo Yang, Jegyeong Cho, Jiwon Kim, Kuk-Jin Yoon
备注: ICLR 2023


Abstract: Understanding the interaction between multiple agents is crucial for realistic vehicle trajectory prediction. Existing methods have attempted to infer the interaction from the observed past trajectories of agents using pooling, attention, or graph-based methods, which rely on a deterministic approach. However, these methods can fail under complex road structures, as they cannot predict various interactions that may occur in the future. In this paper, we propose a novel approach that uses lane information to predict a stochastic future relationship among agents. To obtain a coarse future motion of agents, our method first predicts the probability of lane-level waypoint occupancy of vehicles. We then utilize the temporal probability of passing adjacent lanes for each agent pair, assuming that agents passing adjacent lanes will highly interact. We also model the interaction using a probabilistic distribution, which allows for multiple possible future interactions. The distribution is learned from the posterior distribution of interaction obtained from ground truth future trajectories. We validate our method on popular trajectory prediction datasets: nuScenes and Argoverse. The results show that the proposed method brings remarkable performance gain in prediction accuracy, and achieves state-of-the-art performance in long-term prediction benchmark dataset.

CV-77-标题: Streaming Object Detection on Fisheye Cameras for Automatic Parking

链接: https://arxiv.org/abs/2305.14713
作者: Yixiong Yan, Liangzhu Cheng, Yongxu Li, Xinjuan Tuo


Abstract: Fisheye cameras are widely employed in automatic parking, and the video stream object detection (VSOD) of the fisheye camera is a fundamental perception function to ensure the safe operation of vehicles. In past research work, the difference between the output of the deep learning model and the actual situation at the current moment due to the existence of delay of the perception system is generally ignored. But the environment will inevitably change within the delay time which may cause a potential safety hazard. In this paper, we propose a real-time detection framework equipped with a dual-flow perception module (dynamic and static flows) that can predict the future and alleviate the time-lag problem. Meanwhile, we use a new scheme to evaluate latency and accuracy. The standard bounding box is unsuitable for the object in fisheye camera images due to the strong radial distortion of the fisheye camera and the primary detection objects of parking perception are vehicles and pedestrians, so we adopt the rotate bounding box and propose a new periodic angle loss function to regress the angle of the box, which is the simple and accurate representation method of objects. The instance segmentation ground truth is used to supervise the training. Experiments demonstrate the effectiveness of our approach. Code is released at: this https URL.

CV-78-标题: EgoVSR: Towards High-Quality Egocentric Video Super-Resolution

链接: https://arxiv.org/abs/2305.14708
作者: Yichen Chi, Junhao Gu, Jiamiao Zhang, Wenming Yang, Yapeng Tian


Abstract: Due to the limitations of capture devices and scenarios, egocentric videos frequently have low visual quality, mainly caused by high compression and severe motion blur. With the increasing application of egocentric videos, there is an urgent need to enhance the quality of these videos through super-resolution. However, existing Video Super-Resolution (VSR) works, focusing on third-person view videos, are actually unsuitable for handling blurring artifacts caused by rapid ego-motion and object motion in egocentric videos. To this end, we propose EgoVSR, a VSR framework specifically designed for egocentric videos. We explicitly tackle motion blurs in egocentric videos using a Dual Branch Deblur Network (DB ^2 Net) in the VSR framework. Meanwhile, a blurring mask is introduced to guide the DB ^2 Net learning, and can be used to localize blurred areas in video frames. We also design a MaskNet to predict the mask, as well as a mask loss to optimize the mask estimation. Additionally, an online motion blur synthesis model for common VSR training data is proposed to simulate motion blurs as in egocentric videos. In order to validate the effectiveness of our proposed method, we introduce an EgoVSR dataset containing a large amount of fast-motion egocentric video sequences. Extensive experiments demonstrate that our EgoVSR model can efficiently super-resolve low-quality egocentric videos and outperform strong comparison baselines. Our code, pre-trained models, and data will be released.

CV-79-标题: AdvFunMatch: When Consistent Teaching Meets Adversarial Robustness

链接: https://arxiv.org/abs/2305.14700
作者: Ziuhi Wu, Haichang Gao, Bingqian Zhou, Ping Wang


Abstract: \emphConsistent teaching is an effective paradigm for implementing knowledge distillation (KD), where both student and teacher models receive identical inputs, and KD is treated as a function matching task (FunMatch). However, one limitation of FunMatch is that it does not account for the transfer of adversarial robustness, a model’s resistance to adversarial attacks. To tackle this problem, we propose a simple but effective strategy called Adversarial Function Matching (AdvFunMatch), which aims to match distributions for all data points within the \ell_p -norm ball of the training data, in accordance with consistent teaching. Formulated as a min-max optimization problem, AdvFunMatch identifies the worst-case instances that maximizes the KL-divergence between teacher and student model outputs, which we refer to as “mismatched examples,” and then matches the outputs on these mismatched examples. Our experimental results show that AdvFunMatch effectively produces student models with both high clean accuracy and robustness. Furthermore, we reveal that strong data augmentations (\emphe.g., AutoAugment) are beneficial in AdvFunMatch, whereas prior works have found them less effective in adversarial training. Code is available at \urlthis https URL.

CV-80-标题: Label-Efficient Learning in Agriculture: A Comprehensive Review

链接: https://arxiv.org/abs/2305.14691
作者: Jiajia Li, Dong Chen, Xinda Qi, Zhaojian Li, Yanbo Huang, Daniel Morris, Xiaobo Tan
备注: 34 pages, 23 figures


Abstract: The past decade has witnessed many great successes of machine learning (ML) and deep learning (DL) applications in agricultural systems, including weed control, plant disease diagnosis, agricultural robotics, and precision livestock management. Despite tremendous progresses, one downside of such ML/DL models is that they generally rely on large-scale labeled datasets for training, and the performance of such models is strongly influenced by the size and quality of available labeled data samples. In addition, collecting, processing, and labeling such large-scale datasets is extremely costly and time-consuming, partially due to the rising cost in human labor. Therefore, developing label-efficient ML/DL methods for agricultural applications has received significant interests among researchers and practitioners. In fact, there are more than 50 papers on developing and applying deep-learning-based label-efficient techniques to address various agricultural problems since 2016, which motivates the authors to provide a timely and comprehensive review of recent label-efficient ML/DL methods in agricultural applications. To this end, we first develop a principled taxonomy to organize these methods according to the degree of supervision, including weak supervision (i.e., active learning and semi-/weakly- supervised learning), and no supervision (i.e., un-/self- supervised learning), supplemented by representative state-of-the-art label-efficient ML/DL methods. In addition, a systematic review of various agricultural applications exploiting these label-efficient algorithms, such as precision agriculture, plant phenotyping, and postharvest quality assessment, is presented. Finally, we discuss the current problems and challenges, as well as future research directions. A well-classified paper list can be accessed at this https URL.

CV-81-标题: Collaborative Auto-encoding for Blind Image Quality Assessment

链接: https://arxiv.org/abs/2305.14684
作者: Zehong Zhou, Fei Zhou, Guoping Qiu


Abstract: Blind image quality assessment (BIQA) is a challenging problem with important real-world applications. Recent efforts attempting to exploit powerful representations by deep neural networks (DNN) are hindered by the lack of subjectively annotated data. This paper presents a novel BIQA method which overcomes this fundamental obstacle. Specifically, we design a pair of collaborative autoencoders (COAE) consisting of a content autoencoder (CAE) and a distortion autoencoder (DAE) that work together to extract content and distortion representations, which are shown to be highly descriptive of image quality. While the CAE follows a standard codec procedure, we introduce the CAE-encoded feature as an extra input to the DAE’s decoder for reconstructing distorted images, thus effectively forcing DAE’s encoder to extract distortion representations. The self-supervised learning framework allows the COAE including two feature extractors to be trained by almost unlimited amount of data, thus leaving limited samples with annotations to finetune a BIQA model. We will show that the proposed BIQA method achieves state-of-the-art performance and has superior generalization capability over other learning based models. The codes are available at: this https URL.

CV-82-标题: Optimal Linear Subspace Search: Learning to Construct Fast and High-Quality Schedulers for Diffusion Models

链接: https://arxiv.org/abs/2305.14677
作者: Zhongjie Duan, Chengyu Wang, Cen Chen, Jun Huang, Weining Qian
备注: 13 pages, 5 figures


Abstract: In recent years, diffusion models have become the most popular and powerful methods in the field of image synthesis, even rivaling human artists in artistic creativity. However, the key issue currently limiting the application of diffusion models is its extremely slow generation process. Although several methods were proposed to speed up the generation process, there still exists a trade-off between efficiency and quality. In this paper, we first provide a detailed theoretical and empirical analysis of the generation process of the diffusion models based on schedulers. We transform the designing problem of schedulers into the determination of several parameters, and further transform the accelerated generation process into an expansion process of the linear subspace. Based on these analyses, we consequently propose a novel method called Optimal Linear Subspace Search (OLSS), which accelerates the generation process by searching for the optimal approximation process of the complete generation process in the linear subspaces spanned by latent variables. OLSS is able to generate high-quality images with a very small number of steps. To demonstrate the effectiveness of our method, we conduct extensive comparative experiments on open-source diffusion models. Experimental results show that with a given number of steps, OLSS can significantly improve the quality of generated images. Using an NVIDIA A100 GPU, we make it possible to generate a high-quality image by Stable Diffusion within only one second without other optimization techniques.

CV-83-标题: T1: Scaling Diffusion Probabilistic Fields to High-Resolution on Unified Visual Modalities

链接: https://arxiv.org/abs/2305.14674
作者: Kangfu Mei, Mo Zhou, Vishal M. Patel
备注: for project page, see this https URL


Abstract: Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces. While DPF shows great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it does not scale to a higher data resolution. This can be attributed to the ``scaling property’', where it is difficult for the model to capture local structures through uniform sampling. To this end, we propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating additional guidance, e.g., text description, to complement the global geometry. The model can be scaled to generate high-resolution data while unifying multiple modalities. Experimental results on data generation in various modalities demonstrate the effectiveness of our model, as well as its potential as a foundation framework for scalable modality-unified visual content generation.

CV-84-标题: NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-World Video Super-Resolution

链接: https://arxiv.org/abs/2305.14669
作者: Yexing Song, Meilin Wang, Xiaoyu Xian, Zhijing Yang, Yuming Fan, Yukai Shi


Abstract: The capability of video super-resolution (VSR) to synthesize high-resolution (HR) video from ideal datasets has been demonstrated in many works. However, applying the VSR model to real-world video with unknown and complex degradation remains a challenging task. First, existing degradation metrics in most VSR methods are not able to effectively simulate real-world noise and blur. On the contrary, simple combinations of classical degradation are used for real-world noise modeling, which led to the VSR model often being violated by out-of-distribution noise. Second, many SR models focus on noise simulation and transfer. Nevertheless, the sampled noise is monotonous and limited. To address the aforementioned problems, we propose a Negatives augmentation strategy for generalized noise modeling in Video Super-Resolution (NegVSR) task. Specifically, we first propose sequential noise generation toward real-world data to extract practical noise sequences. Then, the degeneration domain is widely expanded by negative augmentation to build up various yet challenging real-world noise sets. We further propose the augmented negative guidance loss to learn robust features among augmented negatives effectively. Extensive experiments on real-world datasets (e.g., VideoLQ and FLIR) show that our method outperforms state-of-the-art methods with clear margins, especially in visual quality.

CV-85-标题: Robust 3D-aware Object Classification via Discriminative Render-and-Compare

链接: https://arxiv.org/abs/2305.14668
作者: Artur Jesslen, Guofeng Zhang, Angtian Wang, Alan Yuille, Adam Kortylewski


Abstract: In real-world applications, it is essential to jointly estimate the 3D object pose and class label of objects, i.e., to perform 3D-aware classification.While current approaches for either image classification or pose estimation can be extended to 3D-aware classification, we observe that they are inherently limited: 1) Their performance is much lower compared to the respective single-task models, and 2) they are not robust in out-of-distribution (OOD) scenarios. Our main contribution is a novel architecture for 3D-aware classification, which builds upon a recent work and performs comparably to single-task models while being highly robust. In our method, an object category is represented as a 3D cuboid mesh composed of feature vectors at each mesh vertex. Using differentiable rendering, we estimate the 3D object pose by minimizing the reconstruction error between the mesh and the feature representation of the target image. Object classification is then performed by comparing the reconstruction losses across object categories. Notably, the neural texture of the mesh is trained in a discriminative manner to enhance the classification performance while also avoiding local optima in the reconstruction loss. Furthermore, we show how our method and feed-forward neural networks can be combined to scale the render-and-compare approach to larger numbers of categories. Our experiments on PASCAL3D+, occluded-PASCAL3D+, and OOD-CV show that our method outperforms all baselines at 3D-aware classification by a wide margin in terms of performance and robustness.

CV-86-标题: Dealing with Cross-Task Class Discrimination in Online Continual Learning CVPR2023

链接: https://arxiv.org/abs/2305.14657
作者: Yiduo Guo, Bing Liu, Dongyan Zhao
备注: Accepted by CVPR2023


Abstract: Existing continual learning (CL) research regards catastrophic forgetting (CF) as almost the only challenge. This paper argues for another challenge in class-incremental learning (CIL), which we call cross-task class discrimination (CTCD),~i.e., how to establish decision boundaries between the classes of the new task and old tasks with no (or limited) access to the old task data. CTCD is implicitly and partially dealt with by replay-based methods. A replay method saves a small amount of data (replay data) from previous tasks. When a batch of current task data arrives, the system jointly trains the new data and some sampled replay data. The replay data enables the system to partially learn the decision boundaries between the new classes and the old classes as the amount of the saved data is small. However, this paper argues that the replay approach also has a dynamic training bias issue which reduces the effectiveness of the replay data in solving the CTCD problem. A novel optimization objective with a gradient-based adaptive method is proposed to dynamically deal with the problem in the online CL process. Experimental results show that the new method achieves much better results in online CL.

CV-87-标题: Reinforcement Learning finetuned Vision-Code Transformer for UI-to-Code Generation

链接: https://arxiv.org/abs/2305.14637
作者: Davit Soselia, Khalid Saifullah, Tianyi Zhou


Abstract: Automated HTML/CSS code generation from screenshots is an important yet challenging problem with broad applications in website development and design. In this paper, we present a novel vision-code transformer approach that leverages an Encoder-Decoder architecture as well as explore actor-critic fine-tuning as a method for improving upon the baseline. For this purpose, two image encoders are compared: Vision Transformer (ViT) and Document Image Transformer (DiT). We propose an end-to-end pipeline that can generate high-quality code snippets directly from screenshots, streamlining the website creation process for developers. To train and evaluate our models, we created a synthetic dataset of 30,000 unique pairs of code and corresponding screenshots. We evaluate the performance of our approach using a combination of automated metrics such as MSE, BLEU, IoU, and a novel htmlBLEU score, where our models demonstrated strong performance. We establish a strong baseline with the DiT-GPT2 model and show that actor-critic can be used to improve IoU score from the baseline of 0.64 to 0.79 and lower MSE from 12.25 to 9.02. We achieved similar performance as when using larger models, with much lower computational cost.

CV-88-标题: Realistically distributing object placements in synthetic training data improves the performance of vision-based object detection models

链接: https://arxiv.org/abs/2305.14621
作者: Setareh Dabiri, Vasileios Lioutas, Berend Zwartsenberg, Yunpeng Liu, Matthew Niedoba, Xiaoxuan Liang, Dylan Green, Justice Sefas, Jonathan Wilder Lavington, Frank Wood, Adam Scibior


Abstract: When training object detection models on synthetic data, it is important to make the distribution of synthetic data as close as possible to the distribution of real data. We investigate specifically the impact of object placement distribution, keeping all other aspects of synthetic data fixed. Our experiment, training a 3D vehicle detection model in CARLA and testing on KITTI, demonstrates a substantial improvement resulting from improving the object placement distribution.

CV-89-标题: Assessment of Anterior Cruciate Ligament Injury Risk Based on Human Key Points Detection Algorithm

链接: https://arxiv.org/abs/2305.14612
作者: Ziyu Gong, Xiong Zhao, Chen Yang
备注: 17 pages,and 6 figures


Abstract: This paper aims to detect the potential injury risk of the anterior cruciate ligament (ACL) by proposing an ACL potential injury risk assessment algorithm based on key points of the human body detected using computer vision technology. To obtain the key points data of the human body in each frame, OpenPose, an open source computer vision algorithm, was employed. The obtained data underwent preprocessing and were then fed into an ACL potential injury feature extraction model based on the Landing Error Evaluation System (LESS). This model extracted several important parameters, including the knee flexion angle, the trunk flexion on the sagittal plane, trunk flexion angle on the frontal plane, the ankle knee horizontal distance, and the ankle shoulder horizontal distance. Each of these features was assigned a threshold interval, and a segmented evaluation function was utilized to score them accordingly. To calculate the final score of the participant, the score values were input into a weighted scoring model designed based on the Analytic Hierarchy Process (AHP). The AHP based model takes into account the relative importance of each feature in the overall assessment. The results demonstrate that the proposed algorithm effectively detects the potential risk of ACL injury. The proposed algorithm demonstrates its effectiveness in detecting ACL injury risk, offering valuable insights for injury prevention and intervention strategies in sports and related fields. Code is available at: this https URL

CV-90-标题: FaceFusion: Exploiting Full Spectrum of Multiple Dataset s

链接: https://arxiv.org/abs/2305.14601
作者: Chiyoung Song, Dongjae Lee


Abstract: The size of training dataset is known to be among the most dominating aspects of training high-performance face recognition embedding model. Building a large dataset from scratch could be cumbersome and time-intensive, while combining multiple already-built datasets poses the risk of introducing large amount of label noise. We present a novel training method, named FaceFusion. It creates a fused view of different datasets that is untainted by identity conflicts, while concurrently training an embedding network using the view in an end-to-end fashion. Using the unified view of combined datasets enables the embedding network to be trained against the entire spectrum of the datasets, leading to a noticeable performance boost. Extensive experiments confirm superiority of our method, whose performance in public evaluation datasets surpasses not only that of using a single training dataset, but also that of previously known methods under various training circumstances.

CV-91-标题: Vision Language Applications: A Survey CVPR2023

链接: https://arxiv.org/abs/2305.14598
作者: Yutong Zhou, Nobutaka Shimada
备注: Accepted by GCV @CVPR2023


Abstract: Text-to-image generation has attracted significant interest from researchers and practitioners in recent years due to its widespread and diverse applications across various industries. Despite the progress made in the domain of vision and language research, the existing literature remains relatively limited, particularly with regard to advancements and applications in this field. This paper explores a relevant research track within multimodal applications, including text, vision, audio, and others. In addition to the studies discussed in this paper, we are also committed to continually updating the latest relevant papers, datasets, application projects and corresponding information at this https URL

CV-92-标题: Real-Time Idling Vehicles Detection Using Combined Audio-Visual Deep Learning

链接: https://arxiv.org/abs/2305.14579
作者: Xiwen Li, Tristalee Mangin, Surojit Saha, Evan Blanchard, Dillon Tang, Henry Poppe, Nathan Searle, Ouk Choi, Kerry Kelly, Ross Whitaker


Abstract: Combustion vehicle emissions contribute to poor air quality and release greenhouse gases into the atmosphere, and vehicle pollution has been associated with numerous adverse health effects. Roadways with extensive waiting and/or passenger drop off, such as schools and hospital drop-off zones, can result in high incidence and density of idling vehicles. This can produce micro-climates of increased vehicle pollution. Thus, the detection of idling vehicles can be helpful in monitoring and responding to unnecessary idling and be integrated into real-time or off-line systems to address the resulting pollution. In this paper we present a real-time, dynamic vehicle idling detection algorithm. The proposed idle detection algorithm and notification rely on an algorithm to detect these idling vehicles. The proposed method relies on a multi-sensor, audio-visual, machine-learning workflow to detect idling vehicles visually under three conditions: moving, static with the engine on, and static with the engine off. The visual vehicle motion detector is built in the first stage, and then a contrastive-learning-based latent space is trained for classifying static vehicle engine sound. We test our system in real-time at a hospital drop-off point in Salt Lake City. This in-situ dataset was collected and annotated, and it includes vehicles of varying models and types. The experiments show that the method can detect engine switching on or off instantly and achieves 71.01 mean average precision (mAP).

CV-93-标题: Towards Early Prediction of Human iPSC Reprogramming Success

链接: https://arxiv.org/abs/2305.14575
作者: Abhineet Singh, Ila Jasra, Omar Mouhammed, Nidheesh Dadheech, Nilanjan Ray, James Shapiro


Abstract: This paper presents advancements in automated early-stage prediction of the success of reprogramming human induced pluripotent stem cells (iPSCs) as a potential source for regenerative cell therapies.The minuscule success rate of iPSC-reprogramming of around 0.01% to 0.1% makes it labor-intensive, time-consuming, and exorbitantly expensive to generate a stable iPSC line. Since that requires culturing of millions of cells and intense biological scrutiny of multiple clones to identify a single optimal clone. The ability to reliably predict which cells are likely to establish as an optimal iPSC line at an early stage of pluripotency would therefore be ground-breaking in rendering this a practical and cost-effective approach to personalized medicine. Temporal information about changes in cellular appearance over time is crucial for predicting its future growth outcomes. In order to generate this data, we first performed continuous time-lapse imaging of iPSCs in culture using an ultra-high resolution microscope. We then annotated the locations and identities of cells in late-stage images where reliable manual identification is possible. Next, we propagated these labels backwards in time using a semi-automated tracking system to obtain labels for early stages of growth. Finally, we used this data to train deep neural networks to perform automatic cell segmentation and classification. Our code and data are available at this https URL.

CV-94-标题: GO-LDA: Generalised Optimal Linear Discriminant Analysis

链接: https://arxiv.org/abs/2305.14568
作者: Jiahui Liu, Xiaohao Cai, Mahesan Niranjan


Abstract: Linear discriminant analysis (LDA) has been a useful tool in pattern recognition and data analysis research and practice. While linearity of class boundaries cannot always be expected, nonlinear projections through pre-trained deep neural networks have served to map complex data onto feature spaces in which linear discrimination has served well. The solution to binary LDA is obtained by eigenvalue analysis of within-class and between-class scatter matrices. It is well known that the multiclass LDA is solved by an extension to the binary LDA, a generalised eigenvalue problem, from which the largest subspace that can be extracted is of dimension one lower than the number of classes in the given problem. In this paper, we show that, apart from the first of the discriminant directions, the generalised eigenanalysis solution to multiclass LDA does neither yield orthogonal discriminant directions nor maximise discrimination of projected data along them. Surprisingly, to the best of our knowledge, this has not been noted in decades of literature on LDA. To overcome this drawback, we present a derivation with a strict theoretical support for sequentially obtaining discriminant directions that are orthogonal to previously computed ones and maximise in each step the Fisher criterion. We show distributions of projections along these axes and demonstrate that discrimination of data projected onto these discriminant directions has optimal separation, which is much higher than those from the generalised eigenvectors of the multiclass LDA. Using a wide range of benchmark tasks, we present a comprehensive empirical demonstration that on a number of pattern recognition and classification problems, the optimal discriminant subspaces obtained by the proposed method, referred to as GO-LDA (Generalised Optimal LDA), can offer superior accuracy.

CV-95-标题: Constant Memory Attentive Neural Processes

链接: https://arxiv.org/abs/2305.14567
作者: Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Yoshua Bengio, Mohamed Osama Ahmed


Abstract: Neural Processes (NPs) are efficient methods for estimating predictive uncertainties. NPs comprise of a conditioning phase where a context dataset is encoded, a querying phase where the model makes predictions using the context dataset encoding, and an updating phase where the model updates its encoding with newly received datapoints. However, state-of-the-art methods require additional memory which scales linearly or quadratically with the size of the dataset, limiting their applications, particularly in low-resource settings. In this work, we propose Constant Memory Attentive Neural Processes (CMANPs), an NP variant which only requires constant memory for the conditioning, querying, and updating phases. In building CMANPs, we propose Constant Memory Attention Block (CMAB), a novel general-purpose attention block that can compute its output in constant memory and perform updates in constant computation. Empirically, we show CMANPs achieve state-of-the-art results on meta-regression and image completion tasks while being (1) significantly more memory efficient than prior methods and (2) more scalable to harder settings.

CV-96-标题: Exploring Semantic Variations in GAN Latent Spaces via Matrix Factorization ICLR2023

链接: https://arxiv.org/abs/2305.14551
作者: Andrey Palaev, Rustam A. Lukmanov, Adil Khan
备注: Accepted at ICLR 2023 Tiny Papers


Abstract: Controlled data generation with GANs is desirable but challenging due to the nonlinearity and high dimensionality of their latent spaces. In this work, we explore image manipulations learned by GANSpace, a state-of-the-art method based on PCA. Through quantitative and qualitative assessments we show: (a) GANSpace produces a wide range of high-quality image manipulations, but they can be highly entangled, limiting potential use cases; (b) Replacing PCA with ICA improves the quality and disentanglement of manipulations; © The quality of the generated images can be sensitive to the size of GANs, but regardless of their complexity, fundamental controlling directions can be observed in their latent spaces.

CV-97-标题: Slovo: Russian Sign Language Dataset

链接: https://arxiv.org/abs/2305.14527
作者: Alexander Kapitanov, Karina Kvanchiani, Alexander Nagaev, Elizaveta Petrova
备注: russian sign language recognition dataset, open-source


Abstract: One of the main challenges of the sign language recognition task is the difficulty of collecting a suitable dataset due to the gap between deaf and hearing society. In addition, the sign language in each country differs significantly, which obliges the creation of new data for each of them. This paper presents the Russian Sign Language (RSL) video dataset Slovo, produced using crowdsourcing platforms. The dataset contains 20,000 FullHD recordings, divided into 1,000 classes of RSL gestures received by 194 signers. We also provide the entire dataset creation pipeline, from data collection to video annotation, with the following demo application. Several neural networks are trained and evaluated on the Slovo to demonstrate its teaching ability. Proposed data and pre-trained models are publicly available.

CV-98-标题: Design a Delicious Lunchbox in Style CVPR2023

链接: https://arxiv.org/abs/2305.14522
作者: Yutong Zhou
备注: Accepted by WiCV @CVPR2023 (In Progress). Dataset: this https URL


Abstract: We propose a cyclic generative adversarial network with spatial-wise and channel-wise attention modules for text-to-image synthesis. To accurately depict and design scenes with multiple occluded objects, we design a pre-trained ordering recovery model and a generative adversarial network to predict layout and composite novel box lunch presentations. In the experiments, we devise the Bento800 dataset to evaluate the performance of the text-to-image synthesis model and the layout generation & image composition model. This paper is the continuation of our previous paper works. We also present additional experiments and qualitative performance comparisons to verify the effectiveness of our proposed method. Bento800 dataset is available at this https URL

CV-99-标题: Windscreen Optical Quality for AI Algorithms: Refractive Power and MTF not Sufficient ITSC-2023

链接: https://arxiv.org/abs/2305.14513
作者: Dominik Werner Wolf, Markus Ulrich, Alexander Braun
备注: Submitted to IEEE ITSC-2023


Abstract: Windscreen optical quality is an important aspect of any advanced driver assistance system, and also for future autonomous driving, as today at least some cameras of the sensor suite are situated behind the windscreen. Automotive mass production processes require measurement systems that characterize the optical quality of the windscreens in a meaningful way, which for modern perception stacks implies meaningful for artificial intelligence (AI) algorithms. The measured optical quality needs to be linked to the performance of these algorithms, such that performance limits - and thus production tolerance limits - can be defined. In this article we demonstrate that the main metric established in the industry - refractive power - is fundamentally not capable of capturing relevant optical properties of windscreens. Further, as the industry is moving towards the modulation transfer function (MTF) as an alternative, we mathematically show that this metric cannot be used on windscreens alone, but that the windscreen forms a novel optical system together with the optics of the camera system. Hence, the required goal of a qualification system that is installed at the windscreen supplier and independently measures the optical quality cannot be achieved using MTF. We propose a novel concept to determine the optical quality of windscreens and to use simulation to link this optical quality to the performance of AI algorithms, which can hopefully lead to novel inspection systems.

CV-100-标题: Point2SSM: Learning Morphological Variations of Anatomies from Point Cloud

链接: https://arxiv.org/abs/2305.14486
作者: Jadie Adams, Shireen Elhabian
备注: Under review


Abstract: We introduce Point2SSM, a novel unsupervised learning approach that can accurately construct correspondence-based statistical shape models (SSMs) of anatomy directly from point clouds. SSMs are crucial in clinical research for analyzing the population-level morphological variation in bones and organs. However, traditional methods for creating SSMs have limitations that hinder their widespread adoption, such as the need for noise-free surface meshes or binary volumes, reliance on assumptions or predefined templates, and simultaneous optimization of the entire cohort leading to lengthy inference times given new data. Point2SSM overcomes these barriers by providing a data-driven solution that infers SSMs directly from raw point clouds, reducing inference burdens and increasing applicability as point clouds are more easily acquired. Deep learning on 3D point clouds has seen recent success in unsupervised representation learning, point-to-point matching, and shape correspondence; however, their application to constructing SSMs of anatomies is largely unexplored. In this work, we benchmark state-of-the-art point cloud deep networks on the task of SSM and demonstrate that they are not robust to the challenges of anatomical SSM, such as noisy, sparse, or incomplete input and significantly limited training data. Point2SSM addresses these challenges via an attention-based module that provides correspondence mappings from learned point features. We demonstrate that the proposed method significantly outperforms existing networks in terms of both accurate surface sampling and correspondence, better capturing population-level statistics.

CV-101-标题: Integrated Object Deformation and Contact Patch Estimation from Visuo-Tactile Feedback

链接: https://arxiv.org/abs/2305.14470
作者: Mark Van der Merwe, Youngsun Wi, Dmitry Berenson, Nima Fazeli
备注: 12 pages


Abstract: Reasoning over the interplay between object deformation and force transmission through contact is central to the manipulation of compliant objects. In this paper, we propose Neural Deforming Contact Field (NDCF), a representation that jointly models object deformations and contact patches from visuo-tactile feedback using implicit representations. Representing the object geometry and contact with the environment implicitly allows a single model to predict contact patches of varying complexity. Additionally, learning geometry and contact simultaneously allows us to enforce physical priors, such as ensuring contacts lie on the surface of the object. We propose a neural network architecture to learn a NDCF, and train it using simulated data. We then demonstrate that the learned NDCF transfers directly to the real-world without the need for fine-tuning. We benchmark our proposed approach against a baseline representing geometry and contact patches with point clouds. We find that NDCF performs better on simulated data and in transfer to the real-world.

CV-102-标题: FLAIR #2: textural and temporal information for semantic segmentation from multi-source optical imagery

链接: https://arxiv.org/abs/2305.14467
作者: Anatol Garioud, Apolline De Wit, Marc Poupée, Marion Valette, Sébastien Giordano, Boris Wattrelos


Abstract: The FLAIR #2 dataset hereby presented includes two very distinct types of data, which are exploited for a semantic segmentation task aimed at mapping land cover. The data fusion workflow proposes the exploitation of the fine spatial and textural information of very high spatial resolution (VHR) mono-temporal aerial imagery and the temporal and spectral richness of high spatial resolution (HR) time series of Copernicus Sentinel-2 satellite images. The French National Institute of Geographical and Forest Information (IGN), in response to the growing availability of high-quality Earth Observation (EO) data, is actively exploring innovative strategies to integrate these data with heterogeneous characteristics. IGN is therefore offering this dataset to promote innovation and improve our knowledge of our territories.

CV-103-标题: Sorted Convolutional Network for Achieving Continuous Rotational Invariance

链接: https://arxiv.org/abs/2305.14462
作者: Hanlin Mo, Guoying Zhao


Abstract: The topic of achieving rotational invariance in convolutional neural networks (CNNs) has gained considerable attention recently, as this invariance is crucial for many computer vision tasks such as image classification and matching. In this letter, we propose a Sorting Convolution (SC) inspired by some hand-crafted features of texture images, which achieves continuous rotational invariance without requiring additional learnable parameters or data augmentation. Further, SC can directly replace the conventional convolution operations in a classic CNN model to achieve its rotational invariance. Based on MNIST-rot dataset, we first analyze the impact of convolutional kernel sizes, different sampling and sorting strategies on SC’s rotational invariance, and compare our method with previous rotation-invariant CNN models. Then, we combine SC with VGG, ResNet and DenseNet, and conduct classification experiments on popular texture and remote sensing image datasets. Our results demonstrate that SC achieves the best performance in the aforementioned tasks.

CV-104-标题: Prompt ing Language-Informed Distribution for Compositional Zero-Shot Learning

链接: https://arxiv.org/abs/2305.14428
作者: Wentao Bao, Lichang Chen, Heng Huang, Yu Kong


Abstract: The compositional zero-shot learning (CZSL) task aims to recognize unseen compositional visual concepts (i.e., sliced tomatoes), where the models are learned only from the seen compositions (i.e., sliced potatoes and red tomatoes). Thanks to the prompt tuning on large pre-trained visual language models such as CLIP, recent literature shows impressively better CZSL performance than traditional vision-based methods. However, the key aspects that impact the generalization to unseen compositions, including the diversity and informativeness of class context, and the entanglement between visual primitives (i.e., states and objects), are not properly addressed in existing CLIP-based CZSL literature. In this paper, we propose a model by prompting the language-informed distribution, aka., PLID, for the CZSL task. Specifically, the PLID leverages pre-trained large language models (LLM) to 1) formulate the language-informed class distribution, and 2) enhance the compositionality of the softly prompted class embedding. Moreover, a stochastic logit mixup strategy is proposed to dynamically fuse the decisions from the predictions in the compositional and the primitive logit space. Orthogonal to the existing literature of soft, hard, or distributional prompts, our method advocates prompting the LLM-supported class distribution that leads to a better compositional zero-shot generalization. Experimental results on MIT-States, UT-Zappos, and C-GQA datasets show the superior performance of the PLID to the prior arts. The code and models will be publicly released.

CV-105-标题: Evolution: A Unified Formula for Feature Operators from a High-level Perspective

链接: https://arxiv.org/abs/2305.14409
作者: Zhicheng Cai


Abstract: Traditionally, different types of feature operators (e.g., convolution, self-attention and involution) utilize different approaches to extract and aggregate the features. Resemblance can be hardly discovered from their mathematical formulas. However, these three operators all serve the same paramount purpose and bear no difference in essence. Hence we probe into the essence of various feature operators from a high-level perspective, transformed their components equivalently, and explored their mathematical expressions within higher dimensions. We raise one clear and concrete unified formula for different feature operators termed as Evolution. Evolution utilizes the Evolution Function to generate the Evolution Kernel, which extracts and aggregates the features in certain positions of the input feature map. We mathematically deduce the equivalent transformation from the traditional formulas of these feature operators to Evolution and prove the unification. In addition, we discuss the forms of Evolution Functions and the properties of generated Evolution Kernels, intending to give inspirations to the further research and innovations of powerful feature operators.

CV-106-标题: Layer-adaptive Structured Pruning Guided by Latency

链接: https://arxiv.org/abs/2305.14403
作者: Siyuan Pan, Linna Zhang, Jie Zhang, Xiaoshuang Li, Liang Hou, Xiaobing Tu
备注: arXiv admin note: text overlap with arXiv:2010.07611, arXiv:2110.10811 by other authors


Abstract: Structured pruning can simplify network architecture and improve inference speed. Combined with the underlying hardware and inference engine in which the final model is deployed, better results can be obtained by using latency collaborative loss function to guide network pruning together. Existing pruning methods that optimize latency have demonstrated leading performance, however, they often overlook the hardware features and connection in the network. To address this problem, we propose a global importance score SP-LAMP(Structured Pruning Layer-Adaptive Magnitude-based Pruning) by deriving a global importance score LAMP from unstructured pruning to structured pruning. In SP-LAMP, each layer includes a filter with an SP-LAMP score of 1, and the remaining filters are grouped. We utilize a group knapsack solver to maximize the SP-LAMP score under latency constraints. In addition, we improve the strategy of collect the latency to make it more accurate. In particular, for ResNet50/ResNet18 on ImageNet and CIFAR10, SP-LAMP is 1.28x/8.45x faster with +1.7%/-1.57% top-1 accuracy changed, respectively. Experimental results in ResNet56 on CIFAR10 demonstrate that our algorithm achieves lower latency compared to alternative approaches while ensuring accuracy and FLOPs.

CV-107-标题: Towards credible visual model interpretation with path attribution ICML’23

链接: https://arxiv.org/abs/2305.14395
作者: Naveed Akhtar, Muhammad A. A. K. Jalwana
备注: ICML’23 paper (text improved for CV community)


Abstract: Originally inspired by game-theory, path attribution framework stands out among the post-hoc model interpretation tools due to its axiomatic nature. However, recent developments show that this framework can still suffer from counter-intuitive results. Moreover, specifically for deep visual models, the existing path-based methods also fall short on conforming to the original intuitions that are the basis of the claimed axiomatic properties of this framework. We address these problems with a systematic investigation, and pinpoint the conditions in which the counter-intuitive results can be avoided for deep visual model interpretation with the path attribution strategy. We also devise a scheme to preclude the conditions in which visual model interpretation can invalidate the axiomatic properties of path attribution. These insights are combined into a method that enables reliable visual model interpretation. Our findings are establish empirically with multiple datasets, models and evaluation metrics. Extensive experiments show a consistent performance gain of our method over the baselines.

CV-108-标题: FEDORA: Flying Event Dataset fOr Reactive behAvior

链接: https://arxiv.org/abs/2305.14392
作者: Amogh Joshi, Adarsh Kosta, Wachirawit Ponghiran, Manish Nagaraj, Kaushik Roy


Abstract: The ability of living organisms to perform complex high speed manoeuvers in flight with a very small number of neurons and an incredibly low failure rate highlights the efficacy of these resource-constrained biological systems. Event-driven hardware has emerged, in recent years, as a promising avenue for implementing complex vision tasks in resource-constrained environments. Vision-based autonomous navigation and obstacle avoidance consists of several independent but related tasks such as optical flow estimation, depth estimation, Simultaneous Localization and Mapping (SLAM), object detection, and recognition. To ensure coherence between these tasks, it is imperative that they be trained on a single dataset. However, most existing datasets provide only a selected subset of the required data. This makes inter-network coherence difficult to achieve. Another limitation of existing datasets is the limited temporal resolution they provide. To address these limitations, we present FEDORA, a first-of-its-kind fully synthetic dataset for vision-based tasks, with ground truths for depth, pose, ego-motion, and optical flow. FEDORA is the first dataset to provide optical flow at three different frequencies - 10Hz, 25Hz, and 50Hz

CV-109-标题: Breast Cancer Segmentation using Attention-based Convolutional Network and Explainable AI

链接: https://arxiv.org/abs/2305.14389
作者: Jai Vardhan, Ghanta Sai Krishna


Abstract: Breast cancer (BC) remains a significant health threat, with no long-term cure currently available. Early detection is crucial, yet mammography interpretation is hindered by high false positives and negatives. With BC incidence projected to surpass lung cancer, improving early detection methods is vital. Thermography, using high-resolution infrared cameras, offers promise, especially when combined with artificial intelligence (AI). This work presents an attention-based convolutional neural network for segmentation, providing increased speed and precision in BC detection and classification. The system enhances images and performs cancer segmentation with explainable AI. We propose a transformer-attention-based convolutional architecture (UNet) for fault identification and employ Gradient-weighted Class Activation Mapping (Grad-CAM) to analyze areas of bias and weakness in the UNet architecture with IRT images. The superiority of our proposed framework is confirmed when compared with existing deep learning frameworks.

CV-110-标题: Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models

链接: https://arxiv.org/abs/2305.14384
作者: Alicia Parrish, Hannah Rose Kirk, Jessica Quaye, Charvi Rastogi, Max Bartolo, Oana Inel, Juan Ciro, Rafael Mosquera, Addison Howard, Will Cukierski, D. Sculley, Vijay Janapa Reddi, Lora Aroyo


Abstract: The generative AI revolution in recent years has been spurred by an expansion in compute power and data quantity, which together enable extensive pre-training of powerful text-to-image (T2I) models. With their greater capabilities to generate realistic and creative content, these T2I models like DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider audiences. Any unsafe behaviors inherited from pretraining on uncurated internet-scraped datasets thus have the potential to cause wide-reaching harm, for example, through generated images which are violent, sexually explicit, or contain biased and derogatory stereotypes. Despite this risk of harm, we lack systematic and structured evaluation datasets to scrutinize model behavior, especially adversarial attacks that bypass existing safety filters. A typical bottleneck in safety evaluation is achieving a wide coverage of different types of challenging examples in the evaluation set, i.e., identifying ‘unknown unknowns’ or long-tail problems. To address this need, we introduce the Adversarial Nibbler challenge. The goal of this challenge is to crowdsource a diverse set of failure modes and reward challenge participants for successfully finding safety vulnerabilities in current state-of-the-art T2I models. Ultimately, we aim to provide greater awareness of these issues and assist developers in improving the future safety and reliability of generative AI models. Adversarial Nibbler is a data-centric challenge, part of the DataPerf challenge suite, organized and supported by Kaggle and MLCommons.

CV-111-标题: Connecting Multi-modal Contrastive Representations

链接: https://arxiv.org/abs/2305.14381
作者: Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jiageng Liu, Li Tang, Linjun Li, Yongqi Wang, Aoxiong Yin, Ziang Zhang, Zhou Zhao
备注: Demos are available at \url{this https URL}


Abstract: Multi-modal Contrastive Representation (MCR) learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities. This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR). Specifically, given two existing MCRs pre-trained on (A, B) and (B, C) modality pairs, we project them to a new space and use the data from the overlapping modality B to aligning the two MCRs in the new space. Meanwhile, since the modality pairs (A, B) and (B, C) are already aligned within each MCR, the connection learned by overlapping modality can also be transferred to non-overlapping modality pair (A, C). To unleash the potential of C-MCR, we further introduce a semantic-enhanced inter- and intra-MCR connection method. We first enhance the semantic consistency and completion of embeddings across different modalities for more robust alignment. Then we utilize the inter-MCR alignment to establish the connection, and employ the intra-MCR alignment to better maintain the connection for inputs from non-overlapping modalities. We take the field of audio-visual contrastive learning as an example to demonstrate the effectiveness of C-MCR. We connect pre-trained CLIP and CLAP models via texts to derive audio-visual contrastive representations. Remarkably, without using any paired audio-visual data and further tuning, C-MCR achieves state-of-the-art performance on six datasets across three audio-visual downstream tasks.

CV-112-标题: Zero-shot personalized lip-to-speech synthesis with face image based voice control ICASSP2023

链接: https://arxiv.org/abs/2305.14359
作者: Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling
备注: ICASSP 2023


Abstract: Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted from natural reference speech and are unavailable when only the silent video of an unseen speaker is given. In this paper, we propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities. A variational autoencoder is adopted to disentangle the speaker identity and linguistic content representations, which enables speaker embeddings to control the voice characteristics of synthetic speech for unseen speakers. Furthermore, we propose associated cross-modal representation learning to promote the ability of face-based speaker embeddings (FSE) on voice control. Extensive experiments verify the effectiveness of the proposed method whose synthetic utterances are more natural and matching with the personality of input video than the compared methods. To our best knowledge, this paper makes the first attempt on zero-shot personalized Lip2Speech synthesis with a face image rather than reference audio to control voice characteristics.

CV-113-标题: An Extensible Multimodal Multi-task Object Dataset with Materials ICLR2023

链接: https://arxiv.org/abs/2305.14352
作者: Trevor Standley, Ruohan Gao, Dawn Chen, Jiajun Wu, Silvio Savarese
备注: ICLR 2023


Abstract: We present EMMa, an Extensible, Multimodal dataset of Amazon product listings that contains rich Material annotations. It contains more than 2.8 million objects, each with image(s), listing text, mass, price, product ratings, and position in Amazon’s product-category taxonomy. We also design a comprehensive taxonomy of 182 physical materials (e.g., Plastic \rightarrow Thermoplastic \rightarrow Acrylic). Objects are annotated with one or more materials from this taxonomy. With the numerous attributes available for each object, we develop a Smart Labeling framework to quickly add new binary labels to all objects with very little manual labeling effort, making the dataset extensible. Each object attribute in our dataset can be included in either the model inputs or outputs, leading to combinatorial possibilities in task configurations. For example, we can train a model to predict the object category from the listing text, or the mass and price from the product listing image. EMMa offers a new benchmark for multi-task learning in computer vision and NLP, and allows practitioners to efficiently add new tasks and object attributes at scale.

CV-114-标题: Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution

链接: https://arxiv.org/abs/2305.15357
作者: Yiyang Ma, Huan Yang, Wenhan Yang, Jianlong Fu, Jiaying Liu


Abstract: Diffusion models, as a kind of powerful generative model, have given impressive results on image super-resolution (SR) tasks. However, due to the randomness introduced in the reverse process of diffusion models, the performances of diffusion-based SR models are fluctuating at every time of sampling, especially for samplers with few resampled steps. This inherent randomness of diffusion models results in ineffectiveness and instability, making it challenging for users to guarantee the quality of SR results. However, our work takes this randomness as an opportunity: fully analyzing and leveraging it leads to the construction of an effective plug-and-play sampling method that owns the potential to benefit a series of diffusion-based SR methods. More in detail, we propose to steadily sample high-quality SR images from pretrained diffusion-based SR models by solving diffusion ordinary differential equations (diffusion ODEs) with optimal boundary conditions (BCs) and analyze the characteristics between the choices of BCs and their corresponding SR results. Our analysis shows the route to obtain an approximately optimal BC via an efficient exploration in the whole space. The quality of SR results sampled by the proposed method with fewer steps outperforms the quality of results sampled by current methods with randomness from the same pretrained diffusion-based SR model, which means that our sampling method ``boosts’’ current diffusion-based SR models without any additional training.

CV-115-标题: Deep Learning-based Bio-Medical Image Segmentation using UNet Architecture and Transfer Learning

链接: https://arxiv.org/abs/2305.14841
作者: Nima Hassanpour, Abouzar Ghavami


Abstract: Image segmentation is a branch of computer vision that is widely used in real world applications including biomedical image processing. With recent advancement of deep learning, image segmentation has achieved at a very high level performance. Recently, UNet architecture is found as the core of novel deep learning segmentation methods. In this paper we implement UNet architecture from scratch with using basic blocks in Pytorch and evaluate its performance on multiple biomedical image datasets. We also use transfer learning to apply novel modified UNet segmentation packages on the biomedical image datasets. We fine tune the pre-trained transferred model with each specific dataset. We compare its performance with our fundamental UNet implementation. We show that transferred learning model has better performance in image segmentation than UNet model that is implemented from scratch.

CV-116-标题: Detection of Non-uniformity in Parameters for Magnetic Domain Pattern Generation by Machine Learning

链接: https://arxiv.org/abs/2305.14764
作者: Naoya Mamada, Masaichiro Mizumaki, Ichiro Akai, Toru Aonishi
备注: 27 pages, 11 figures


Abstract: We attempt to estimate the spatial distribution of heterogeneous physical parameters involved in the formation of magnetic domain patterns of polycrystalline thin films by using convolutional neural networks. We propose a method to obtain a spatial map of physical parameters by estimating the parameters from patterns within a small subregion window of the full magnetic domain and subsequently shifting this window. To enhance the accuracy of parameter estimation in such subregions, we employ employ large-scale models utilized for natural image classification and exploit the benefits of pretraining. Using a model with high estimation accuracy on these subregions, we conduct inference on simulation data featuring spatially varying parameters and demonstrate the capability to detect such parameter variations.

CV-117-标题: ORRN: An ODE-based Recursive Registration Network for Deformable Respiratory Motion Estimation with Lung 4DCT Images

链接: https://arxiv.org/abs/2305.14673
作者: Xiao Liang, Shan Lin, Dimitri Schreiber, Michael Yip
备注: Accepted by IEEE Transactions on Biomedical Engineering


Abstract: Deformable Image Registration (DIR) plays a significant role in quantifying deformation in medical data. Recent Deep Learning methods have shown promising accuracy and speedup for registering a pair of medical images. However, in 4D (3D + time) medical data, organ motion, such as respiratory motion and heart beating, can not be effectively modeled by pair-wise methods as they were optimized for image pairs but did not consider the organ motion patterns necessary when considering 4D data. This paper presents ORRN, an Ordinary Differential Equations (ODE)-based recursive image registration network. Our network learns to estimate time-varying voxel velocities for an ODE that models deformation in 4D image data. It adopts a recursive registration strategy to progressively estimate a deformation field through ODE integration of voxel velocities. We evaluate the proposed method on two publicly available lung 4DCT datasets, DIRLab and CREATIS, for two tasks: 1) registering all images to the extreme inhale image for 3D+t deformation tracking and 2) registering extreme exhale to inhale phase images. Our method outperforms other learning-based methods in both tasks, producing the smallest Target Registration Error of 1.24mm and 1.26mm, respectively. Additionally, it produces less than 0.001% unrealistic image folding, and the computation speed is less than 1 second for each CT volume. ORRN demonstrates promising registration accuracy, deformation plausibility, and computation efficiency on group-wise and pair-wise registration tasks. It has significant implications in enabling fast and accurate respiratory motion estimation for treatment planning in radiation therapy or robot motion planning in thoracic needle insertion.

CV-118-标题: Attentive Continuous Generative Self-training for Unsupervised Domain Adaptive Medical Image Translation

链接: https://arxiv.org/abs/2305.14589
作者: Xiaofeng Liu, Jerry L. Prince, Fangxu Xing, Jiachen Zhuo, Reese Timothy, Maureen Stone, Georges El Fakhri, Jonghye Woo
备注: Accepted to Medical Image Analysis


Abstract: Self-training is an important class of unsupervised domain adaptation (UDA) approaches that are used to mitigate the problem of domain shift, when applying knowledge learned from a labeled source domain to unlabeled and heterogeneous target domains. While self-training-based UDA has shown considerable promise on discriminative tasks, including classification and segmentation, through reliable pseudo-label filtering based on the maximum softmax probability, there is a paucity of prior work on self-training-based UDA for generative tasks, including image modality translation. To fill this gap, in this work, we seek to develop a generative self-training (GST) framework for domain adaptive image translation with continuous value prediction and regression objectives. Specifically, we quantify both aleatoric and epistemic uncertainties within our GST using variational Bayes learning to measure the reliability of synthesized data. We also introduce a self-attention scheme that de-emphasizes the background region to prevent it from dominating the training process. The adaptation is then carried out by an alternating optimization scheme with target domain supervision that focuses attention on the regions with reliable pseudo-labels. We evaluated our framework on two cross-scanner/center, inter-subject translation tasks, including tagged-to-cine magnetic resonance (MR) image translation and T1-weighted MR-to-fractional anisotropy translation. Extensive validations with unpaired target domain data showed that our GST yielded superior synthesis performance in comparison to adversarial training UDA methods.

CV-119-标题: An Accelerated Pipeline for Multi-label Renal Pathology Image Segmentation at the Whole Slide Image Level

链接: https://arxiv.org/abs/2305.14566
作者: Haoju Leng, Ruining Deng, Zuhayr Asad, R. Michael Womick, Haichun Yang, Lipeng Wan, Yuankai Huo


Abstract: Deep-learning techniques have been used widely to alleviate the labour-intensive and time-consuming manual annotation required for pixel-level tissue characterization. Our previous study introduced an efficient single dynamic network - Omni-Seg - that achieved multi-class multi-scale pathological segmentation with less computational complexity. However, the patch-wise segmentation paradigm still applies to Omni-Seg, and the pipeline is time-consuming when providing segmentation for Whole Slide Images (WSIs). In this paper, we propose an enhanced version of the Omni-Seg pipeline in order to reduce the repetitive computing processes and utilize a GPU to accelerate the model’s prediction for both better model performance and faster speed. Our proposed method’s innovative contribution is two-fold: (1) a Docker is released for an end-to-end slide-wise multi-tissue segmentation for WSIs; and (2) the pipeline is deployed on a GPU to accelerate the prediction, achieving better segmentation quality in less time. The proposed accelerated implementation reduced the average processing time (at the testing stage) on a standard needle biopsy WSI from 2.3 hours to 22 minutes, using 35 WSIs from the Kidney Tissue Atlas (KPMP) Datasets. The source code and the Docker have been made publicly available at this https URL.

CV-120-标题: Reproducibility analysis of automated deep learning based localisation of mandibular canals on a temporal CBCT dataset

链接: https://arxiv.org/abs/2305.14385
作者: Jorma Järnstedt, Jaakko Sahlsten, Joel Jaskari, Kimmo Kaski, Helena Mehtonen, Ari Hietanen, Osku Sundqvist, Vesa Varjonen, Vesa Mattila, Sangsom Prapayasotok, Sakarat Nalampang


Abstract: Preoperative radiological identification of mandibular canals is essential for maxillofacial surgery. This study demonstrates the reproducibility of a deep learning system (DLS) by evaluating its localisation performance on 165 heterogeneous cone beam computed tomography (CBCT) scans from 72 patients in comparison to an experienced radiologist’s annotations. We evaluated the performance of the DLS using the symmetric mean curve distance (SMCD), the average symmetric surface distance (ASSD), and the Dice similarity coefficient (DSC). The reproducibility of the SMCD was assessed using the within-subject coefficient of repeatability (RC). Three other experts rated the diagnostic validity twice using a 0-4 Likert scale. The reproducibility of the Likert scoring was assessed using the repeatability measure (RM). The RC of SMCD was 0.969 mm, the median (interquartile range) SMCD and ASSD were 0.643 (0.186) mm and 0.351 (0.135) mm, respectively, and the mean (standard deviation) DSC was 0.548 (0.138). The DLS performance was most affected by postoperative changes. The RM of the Likert scoring was 0.923 for the radiologist and 0.877 for the DLS. The mean (standard deviation) Likert score was 3.94 (0.27) for the radiologist and 3.84 (0.65) for the DLS. The DLS demonstrated proficient qualitative and quantitative reproducibility, temporal generalisability, and clinical validity.

CV-121-标题: Raidionics: an open software for pre- and postoperative central nervous system tumor segmentation and standardized reporting

链接: https://arxiv.org/abs/2305.14351
作者: David Bouget, Demah Alsinan, Valeria Gaitan, Ragnhild Holden Helland, André Pedersen, Ole Solheim, Ingerid Reinertsen
备注: 11 pages, 3 figures, 3 tables


Abstract: For patients suffering from central nervous system tumors, prognosis estimation, treatment decisions, and postoperative assessments are made from the analysis of a set of magnetic resonance (MR) scans. Currently, the lack of open tools for standardized and automatic tumor segmentation and generation of clinical reports, incorporating relevant tumor characteristics, leads to potential risks from inherent decisions’ subjectivity. To tackle this problem, the proposed Raidionics open-source software has been developed, offering both a user-friendly graphical user interface and stable processing backend. The software includes preoperative segmentation models for each of the most common tumor types (i.e., glioblastomas, lower grade gliomas, meningiomas, and metastases), together with one early postoperative glioblastoma segmentation model. Preoperative segmentation performances were quite homogeneous across the four different brain tumor types, with an average Dice around 85% and patient-wise recall and precision around 95%. Postoperatively, performances were lower with an average Dice of 41%. Overall, the generation of a standardized clinical report, including the tumor segmentation and features computation, requires about ten minutes on a regular laptop. The proposed Raidionics software is the first open solution enabling an easy use of state-of-the-art segmentation models for all major tumor types, including preoperative and postsurgical standardized reports.


IR-0-标题: Collaborative Recommendation Model Based on Multi-modal Multi-view Attention Network: Movie and literature cases

链接: https://arxiv.org/abs/2305.15159
作者: Zheng Hu, Shi-Min Cai, Jun Wang, Tao Zhou


Abstract: The existing collaborative recommendation models that use multi-modal information emphasize the representation of users’ preferences but easily ignore the representation of users’ dislikes. Nevertheless, modelling users’ dislikes facilitates comprehensively characterizing user profiles. Thus, the representation of users’ dislikes should be integrated into the user modelling when we construct a collaborative recommendation model. In this paper, we propose a novel Collaborative Recommendation Model based on Multi-modal multi-view Attention Network (CRMMAN), in which the users are represented from both preference and dislike views. Specifically, the users’ historical interactions are divided into positive and negative interactions, used to model the user’s preference and dislike views, respectively. Furthermore, the semantic and structural information extracted from the scene is employed to enrich the item representation. We validate CRMMAN by designing contrast experiments based on two benchmark MovieLens-1M and Book-Crossing datasets. Movielens-1m has about a million ratings, and Book-Crossing has about 300,000 ratings. Compared with the state-of-the-art knowledge-graph-based and multi-modal recommendation methods, the AUC, NDCG@5 and NDCG@10 are improved by 2.08%, 2.20% and 2.26% on average of two datasets. We also conduct controlled experiments to explore the effects of multi-modal information and multi-view mechanism. The experimental results show that both of them enhance the model’s performance.

IR-1-标题: Bert 4CMR: Cross-Market Recommendation with Bidirectional Encoder Representations from Transformer

链接: https://arxiv.org/abs/2305.15145
作者: Zheng Hu, Fuji Ren


Abstract: Real-world multinational e-commerce companies, such as Amazon and eBay, serve in multiple countries and regions. Obviously, these markets have similar goods but different users. Some markets are data-scarce, while others are data-rich. In recent years, cross-market recommendation (CMR) has been proposed to enhance data-scarce markets by leveraging auxiliary information from data-rich markets. Previous works fine-tune the pre-trained model on the local market after freezing part of the parameters or introducing inter-market similarity into the local market to improve the performance of CMR. However, they generally do not consider eliminating the mutual interference between markets. Therefore, the existing methods are neither unable to learn unbiased general knowledge nor efficient transfer reusable information across markets. In this paper, we propose a novel attention-based model called Bert4CMR to simultaneously improve all markets’ recommendation performance. Specifically, we employ the attention mechanism to capture user interests by modelling user behavioural sequences. We pre-train the proposed model on global data to learn the general knowledge of items. Then we fine-tune specific target markets to perform local recommendations. We propose market embedding to model the bias of each market and reduce the mutual inference between the parallel markets. Extensive experiments conducted on seven markets show that our model is state-of-the-art. Our model outperforms the suboptimal model by 4.82%, 4.73%, 7.66% and 6.49% on average of seven datasets in terms of four metrics, respectively. We conduct ablation experiments to analyse the effectiveness of the proposed components. Experimental results indicate that our model is able to learn general knowledge through global data and shield the mutual interference between markets.

IR-2-标题: Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies KDD2023

链接: https://arxiv.org/abs/2305.15115
作者: Yubao Tang, Ruqing Zhang, Jiafeng Guo, Jiangui Chen, Zuowei Zhu, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng
备注: Accepted by KDD 2023


Abstract: Recently, a new paradigm called Differentiable Search Index (DSI) has been proposed for document retrieval, wherein a sequence-to-sequence model is learned to directly map queries to relevant document identifiers. The key idea behind DSI is to fully parameterize traditional ``index-retrieve’’ pipelines within a single neural model, by encoding all documents in the corpus into the model parameters. In essence, DSI needs to resolve two major questions: (1) how to assign an identifier to each document, and (2) how to learn the associations between a document and its identifier. In this work, we propose a Semantic-Enhanced DSI model (SE-DSI) motivated by Learning Strategies in the area of Cognitive Psychology. Our approach advances original DSI in two ways: (1) For the document identifier, we take inspiration from Elaboration Strategies in human learning. Specifically, we assign each document an Elaborative Description based on the query generation technique, which is more meaningful than a string of integers in the original DSI; and (2) For the associations between a document and its identifier, we take inspiration from Rehearsal Strategies in human learning. Specifically, we select fine-grained semantic features from a document as Rehearsal Contents to improve document memorization. Both the offline and online experiments show improved retrieval performance over prevailing baselines.

IR-3-标题: Exploring Adapter-based Transfer Learning for Recommender Systems: Empirical Studies and Practical Insights

链接: https://arxiv.org/abs/2305.15036
作者: Junchen Fu, Fajie Yuan, Yu Song, Zheng Yuan, Mingyue Cheng, Shenghui Cheng, Jiaqi Zhang, Jie Wang, Yunzhu Pan


Abstract: Adapters, a plug-in neural network module with some tunable parameters, have emerged as a parameter-efficient transfer learning technique for adapting pre-trained models to downstream tasks, especially for natural language processing (NLP) and computer vision (CV) fields. Meanwhile, learning recommendation models directly from raw item modality features – e.g., texts of NLP and images of CV – can enable effective and transferable recommender systems (called TransRec). In view of this, a natural question arises: can adapter-based learning techniques achieve parameter-efficient TransRec with good performance? To this end, we perform empirical studies to address several key sub-questions. First, we ask whether the adapter-based TransRec performs comparably to TransRec based on standard full-parameter fine-tuning? does it hold for recommendation with different item modalities, e.g., textual RS and visual RS. If yes, we benchmark these existing adapters, which have been shown to be effective in NLP and CV tasks, in the item recommendation settings. Third, we carefully study several key factors for the adapter-based TransRec in terms of where and how to insert these adapters? Finally, we look at the effects of adapter-based TransRec by either scaling up its source training data or scaling down its target training data. Our paper provides key insights and practical guidance on unified & transferable recommendation – a less studied recommendation scenario. We promise to release all code & datasets for future research.

IR-4-标题: How Graph Convolutions Amplify Popularity Bias for Recommendation?

链接: https://arxiv.org/abs/2305.14886
作者: Jiajia Chen, Jiancan Wu, Jiawei Chen, Xin Xin, Yong Li, Xiangnan He
备注: Accepted by Frontiers of Computer Science


Abstract: Graph convolutional networks (GCNs) have become prevalent in recommender system (RS) due to their superiority in modeling collaborative patterns. Although improving the overall accuracy, GCNs unfortunately amplify popularity bias – tail items are less likely to be recommended. This effect prevents the GCN-based RS from making precise and fair recommendations, decreasing the effectiveness of recommender systems in the long run. In this paper, we investigate how graph convolutions amplify the popularity bias in RS. Through theoretical analyses, we identify two fundamental factors: (1) with graph convolution (\textiti.e., neighborhood aggregation), popular items exert larger influence than tail items on neighbor users, making the users move towards popular items in the representation space; (2) after multiple times of graph convolution, popular items would affect more high-order neighbors and become more influential. The two points make popular items get closer to almost users and thus being recommended more frequently. To rectify this, we propose to estimate the amplified effect of popular nodes on each node’s representation, and intervene the effect after each graph convolution. Specifically, we adopt clustering to discover highly-influential nodes and estimate the amplification effect of each node, then remove the effect from the node embeddings at each graph convolution layer. Our method is simple and generic – it can be used in the inference stage to correct existing models rather than training a new model from scratch, and can be applied to various GCN models. We demonstrate our method on two representative GCN backbones LightGCN and UltraGCN, verifying its ability in improving the recommendations of tail items without sacrificing the performance of popular items. Codes are open-sourced \footnotethis https URL.

IR-5-标题: JDsearch: A Personalized Product Search Dataset with Real Queries and Full Interactions SIGIR2023

链接: https://arxiv.org/abs/2305.14810
作者: Jiongnan Liu, Zhicheng Dou, Guoyu Tang, Sulong Xu
备注: Accepted to SIGIR 2023


Abstract: Recently, personalized product search attracts great attention and many models have been proposed. To evaluate the effectiveness of these models, previous studies mainly utilize the simulated Amazon recommendation dataset, which contains automatically generated queries and excludes cold users and tail products. We argue that evaluating with such a dataset may yield unreliable results and conclusions, and deviate from real user satisfaction. To overcome these problems, in this paper, we release a personalized product search dataset comprised of real user queries and diverse user-product interaction types (clicking, adding to cart, following, and purchasing) collected from this http URL, a popular Chinese online shopping platform. More specifically, we sample about 170,000 active users on a specific date, then record all their interacted products and issued queries in one year, without removing any tail users and products. This finally results in roughly 12,000,000 products, 9,400,000 real searches, and 26,000,000 user-product interactions. We study the characteristics of this dataset from various perspectives and evaluate representative personalization models to verify its feasibility. The dataset can be publicly accessed at Github: this https URL.

IR-6-标题: Fusion-in-T5: Unifying Document Ranking Signals for Improved Information Retrieval

链接: https://arxiv.org/abs/2305.14685
作者: Shi Yu (1), Chenghao Fan (2), Chenyan Xiong (3), David Jin (4), Zhiyuan Liu (1), Zhenghao Liu (5) ((1) Tsinghua University, (2) Huazhong University of Science and Technology, (3) Microsoft Research, (4) Massachusetts Institute of Technology, (5) Northeastern University)


Abstract: Common IR pipelines are typically cascade systems that may involve multiple rankers and/or fusion models to integrate different information step-by-step. In this paper, we propose a novel re-ranker named Fusion-in-T5 (FiT5), which integrates document text information, retrieval features, and global document information into a single unified model using templated-based input and global attention. Experiments on passage ranking benchmarks MS MARCO and TREC DL show that FiT5 significantly improves ranking performance over prior pipelines. Analyses find that through global attention, FiT5 is able to jointly utilize the ranking features via gradually attending to related documents, and thus improve the detection of subtle nuances between them. Our code will be open-sourced.

IR-7-标题: Extracting Shopping Interest-Related Product Types from the Web

链接: https://arxiv.org/abs/2305.14549
作者: Yinghao Li, Colin Lockard, Prashant Shiralkar, Chao Zhang


Abstract: Recommending a diversity of product types (PTs) is important for a good shopping experience when customers are looking for products around their high-level shopping interests (SIs) such as hiking. However, the SI-PT connection is typically absent in e-commerce product catalogs and expensive to construct manually due to the volume of potential SIs, which prevents us from establishing a recommender with easily accessible knowledge systems. To establish such connections, we propose to extract PTs from the Web pages containing hand-crafted PT recommendations for SIs. The extraction task is formulated as binary HTML node classification given the general observation that an HTML node in our target Web pages can present one and only one PT phrase. Accordingly, we introduce TrENC, which stands for Tree-Transformer Encoders for Node Classification. It improves the inter-node dependency modeling with modified attention mechanisms that preserve the long-term sibling and ancestor-descendant relations. TrENC also injects SI into node features for better semantic representation. Trained on pages regarding limited SIs, TrEnc is ready to be applied to other unobserved interests. Experiments on our manually constructed dataset, WebPT, show that TrENC outperforms the best baseline model by 2.37 F1 points in the zero-shot setup. The performance indicates the feasibility of constructing SI-PT relations and using them to power downstream applications such as search and recommendation.


AI-0-标题: Theorem Proving in Dependently-Typed Higher-Order Logic – Extended Preprint

链接: https://arxiv.org/abs/2305.15382
作者: Colin Rothgang, Florian Rabe, Christoph Benzmüller
备注: 72 pages, The 29th International Conference on Automated Deduction (CADE-29), July 1-5, 2023, Rome, Italy


Abstract: Higher-order logic HOL offers a very simple syntax and semantics for representing and reasoning about typed data structures. But its type system lacks advanced features where types may depend on terms. Dependent type theory offers such a rich type system, but has rather substantial conceptual differences to HOL, as well as comparatively poor proof automation support. We introduce a dependently-typed extension DHOL of HOL that retains the style and conceptual framework of HOL. Moreover, we build a translation from DHOL to HOL and implement it as a preprocessor to a HOL theorem prover, thereby obtaining a theorem prover for DHOL.

AI-1-标题: Bayesian calibration of differentiable agent-based models ICLR2023

链接: https://arxiv.org/abs/2305.15340
作者: Arnau Quera-Bofarull, Ayush Chopra, Anisoara Calinescu, Michael Wooldridge, Joel Dyer
备注: Accepted for Oral Presentation at the AI4ABM Workshop at ICLR 2023


Abstract: Agent-based modelling (ABMing) is a powerful and intuitive approach to modelling complex systems; however, the intractability of ABMs’ likelihood functions and the non-differentiability of the mathematical operations comprising these models present a challenge to their use in the real world. These difficulties have in turn generated research on approximate Bayesian inference methods for ABMs and on constructing differentiable approximations to arbitrary ABMs, but little work has been directed towards designing approximate Bayesian inference techniques for the specific case of differentiable ABMs. In this work, we aim to address this gap and discuss how generalised variational inference procedures may be employed to provide misspecification-robust Bayesian parameter inferences for differentiable ABMs. We demonstrate with experiments on a differentiable ABM of the COVID-19 pandemic that our approach can result in accurate inferences, and discuss avenues for future work.

AI-2-标题: Model evaluation for extreme risks

链接: https://arxiv.org/abs/2305.15324
作者: Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, Allan Dafoe


Abstract: Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through “dangerous capability evaluations”) and the propensity of models to apply their capabilities for harm (through “alignment evaluations”). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

AI-3-标题: “What if?” in Probabilistic Logic Programming

链接: https://arxiv.org/abs/2305.15318
作者: Rafael Kiesel, Kilian Rückschloß, Felix Weitkämper


Abstract: A ProbLog program is a logic program with facts that only hold with a specified probability. In this contribution we extend this ProbLog language by the ability to answer “What if” queries. Intuitively, a ProbLog program defines a distribution by solving a system of equations in terms of mutually independent predefined Boolean random variables. In the theory of causality, Judea Pearl proposes a counterfactual reasoning for such systems of equations. Based on Pearl’s calculus, we provide a procedure for processing these counterfactual queries on ProbLog programs, together with a proof of correctness and a full implementation. Using the latter, we provide insights into the influence of different parameters on the scalability of inference. Finally, we also show that our approach is consistent with CP-logic, i.e. with the causal semantics for logic programs with annotated with disjunctions.

AI-4-标题: Discounting in Strategy Logic IJCAI2023

链接: https://arxiv.org/abs/2305.15256
作者: Munyque Mittelmann, Aniello Murano, Laurent Perrussel
备注: Extended version of the paper accepted at IJCAI 2023


Abstract: Discounting is an important dimension in multi-agent systems as long as we want to reason about strategies and time. It is a key aspect in economics as it captures the intuition that the far-away future is not as important as the near future. Traditional verification techniques allow to check whether there is a winning strategy for a group of agents but they do no