本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2023-09-19)

今日共更新725篇论文,其中:

  • 106篇自然语言处理(NLP: cs.CL)
  • 148篇计算机视觉(CV: cs.CV)
  • 128篇机器学习(ML: cs.LG)
  • 52篇人工智能(AI: cs.AI)
  • 8篇信息检索(IR: cs.IR)
  • 其它主题283篇

自然语言处理

NLP-0-标题: An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

链接: https://arxiv.org/abs/2309.09958
作者: Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, Yelong Shen
备注: Released at LLaVA Model Zoo: this https URL

点击查看摘要

Abstract: Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM’s pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

NLP-1-标题: How to Generate Popular Post Headlines on Social Media?

链接: https://arxiv.org/abs/2309.09949
作者: Zhouxiang Fang, Min Yu, Zhendong Fu, Boning Zhang, Xuanwen Huang, Xiaoqi Tang, Yang Yang
备注:

点击查看摘要

Abstract: Posts, as important containers of user-generated-content pieces on social media, are of tremendous social influence and commercial value. As an integral components of a post, the headline has a decisive contribution to the post’s popularity. However, current mainstream method for headline generation is still manually writing, which is unstable and requires extensive human effort. This drives us to explore a novel research question: Can we automate the generation of popular headlines on social media? We collect more than 1 million posts of 42,447 celebrities from public data of Xiaohongshu, which is a well-known social media platform in China. We then conduct careful observations on the headlines of these posts. Observation results demonstrate that trends and personal styles are widespread in headlines on social medias and have significant contribution to posts’s popularity. Motivated by these insights, we present MEBART, which combines Multiple preference-Extractors with Bidirectional and Auto-Regressive Transformers (BART), capturing trends and personal styles to generate popular headlines on social medias. We perform extensive experiments on real-world datasets and achieve state-of-the-art performance compared with several advanced baselines. In addition, ablation and case studies demonstrate that MEBART advances in capturing trends and personal styles.

NLP-2-标题: Speaker attribution in German parliamentary debates with QLoRA-adapted large language models

链接: https://arxiv.org/abs/2309.09902
作者: Tobias Bornheim, Niklas Grieger, Patrick Gustav Blaneck, Stephan Bialonski
备注: 8 pages, 3 figures

点击查看摘要

Abstract: The growing body of political texts opens up new opportunities for rich insights into political dynamics and ideologies but also increases the workload for manual analysis. Automated speaker attribution, which detects who said what to whom in a speech event and is closely related to semantic role labeling, is an important processing step for computational text analysis. We study the potential of the large language model family Llama 2 to automate speaker attribution in German parliamentary debates from 2017-2021. We fine-tune Llama 2 with QLoRA, an efficient training strategy, and observe our approach to achieve competitive performance in the GermEval 2023 Shared Task On Speaker Attribution in German News Articles and Parliamentary Debates. Our results shed light on the capabilities of large language models in automating speaker attribution, revealing a promising avenue for computational analysis of political discourse and the development of semantic role labeling systems.

NLP-3-标题: Not Enough Labeled Data? Just Add Semantics: A Data-Efficient Method for Inferring Online Health Texts

链接: https://arxiv.org/abs/2309.09877
作者: Joseph Gatto, Sarah M. Preum
备注:

点击查看摘要

Abstract: User-generated texts available on the web and social platforms are often long and semantically challenging, making them difficult to annotate. Obtaining human annotation becomes increasingly difficult as problem domains become more specialized. For example, many health NLP problems require domain experts to be a part of the annotation pipeline. Thus, it is crucial that we develop low-resource NLP solutions able to work with this set of limited-data problems. In this study, we employ Abstract Meaning Representation (AMR) graphs as a means to model low-resource Health NLP tasks sourced from various online health resources and communities. AMRs are well suited to model online health texts as they can represent multi-sentence inputs, abstract away from complex terminology, and model long-distance relationships between co-referring tokens. AMRs thus improve the ability of pre-trained language models to reason about high-complexity texts. Our experiments show that we can improve performance on 6 low-resource health NLP tasks by augmenting text embeddings with semantic graph embeddings. Our approach is task agnostic and easy to merge into any standard text classification pipeline. We experimentally validate that AMRs are useful in the modeling of complex texts by analyzing performance through the lens of two textual complexity measures: the Flesch Kincaid Reading Level and Syntactic Complexity. Our error analysis shows that AMR-infused language models perform better on complex texts and generally show less predictive variance in the presence of changing complexity.

NLP-4-标题: Instruction-Following Speech Recognition

链接: https://arxiv.org/abs/2309.09843
作者: Cheng-I Jeff Lai, Zhiyun Lu, Liangliang Cao, Ruoming Pang
备注:

点击查看摘要

Abstract: Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models’ speech understanding and “reasoning” capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks – ranging from transcript manipulation to summarization – without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like “transcribe first half and then turn off listening,” providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models.

NLP-5-标题: Hypr: A comprehensive study for ASR hypothesis revising with a reference corpus ICASSP2024

链接: https://arxiv.org/abs/2309.09838
作者: Yi-Wei Wang, Ke-Han Lu, Kuan-Yu Chen
备注: Submitted to ICASSP 2024

点击查看摘要

Abstract: With the development of deep learning, automatic speech recognition (ASR) has made significant progress. To further enhance the performance, revising recognition results is one of the lightweight but efficient manners. Various methods can be roughly classified into N-best reranking methods and error correction models. The former aims to select the hypothesis with the lowest error rate from a set of candidates generated by ASR for a given input speech. The latter focuses on detecting recognition errors in a given hypothesis and correcting these errors to obtain an enhanced result. However, we observe that these studies are hardly comparable to each other as they are usually evaluated on different corpora, paired with different ASR models, and even use different datasets to train the models. Accordingly, we first concentrate on releasing an ASR hypothesis revising (HypR) dataset in this study. HypR contains several commonly used corpora (AISHELL-1, TED-LIUM 2, and LibriSpeech) and provides 50 recognition hypotheses for each speech utterance. The checkpoint models of the ASR are also published. In addition, we implement and compare several classic and representative methods, showing the recent research progress in revising speech recognition results. We hope the publicly available HypR dataset can become a reference benchmark for subsequent research and promote the school of research to an advanced level.

NLP-6-标题: Task Selection and Assignment for Multi-modal Multi-task Dialogue Act Classification with Non-stationary Multi-armed Bandits ICASSP2024

链接: https://arxiv.org/abs/2309.09832
作者: Xiangheng He, Junjie Chen, Björn W. Schuller
备注: Submitted to ICASSP 2024

点击查看摘要

Abstract: Multi-task learning (MTL) aims to improve the performance of a primary task by jointly learning with related auxiliary tasks. Traditional MTL methods select tasks randomly during training. However, both previous studies and our results suggest that such the random selection of tasks may not be helpful, and can even be harmful to performance. Therefore, new strategies for task selection and assignment in MTL need to be explored. This paper studies the multi-modal, multi-task dialogue act classification task, and proposes a method for selecting and assigning tasks based on non-stationary multi-armed bandits (MAB) with discounted Thompson Sampling (TS) using Gaussian priors. Our experimental results show that in different training stages, different tasks have different utility. Our proposed method can effectively identify the task utility, actively avoid useless or harmful tasks, and realise the task assignment during training. Our proposed method is significantly superior in terms of UAR and F1 to the single-task and multi-task baselines with p-values < 0.05. Further analysis of experiments indicates that for the dataset with the data imbalance problem, our proposed method has significantly higher stability and can obtain consistent and decent performance for minority classes. Our proposed method is superior to the current state-of-the-art model.

NLP-7-标题: Efficient Avoidance of Vulnerabilities in Auto-completed Smart Contract Code Using Vulnerability-constrained Decoding

链接: https://arxiv.org/abs/2309.09826
作者: André Storhaug, Jingyue Li, Tianyuan Hu
备注: 12 pages, 8 figures, 2 tables, 5 listings, accepted to the 34th IEEE International Symposium on Software Reliability Engineering (ISSRE 2023)

点击查看摘要

Abstract: Auto-completing code enables developers to speed up coding significantly. Recent advances in transformer-based large language model (LLM) technologies have been applied to code synthesis. However, studies show that many of such synthesized codes contain vulnerabilities. We propose a novel vulnerability-constrained decoding approach to reduce the amount of vulnerable code generated by such models. Using a small dataset of labeled vulnerable lines of code, we fine-tune an LLM to include vulnerability labels when generating code, acting as an embedded classifier. Then, during decoding, we deny the model to generate these labels to avoid generating vulnerable code. To evaluate the method, we chose to automatically complete Ethereum Blockchain smart contracts (SCs) as the case study due to the strict requirements of SC security. We first fine-tuned the 6-billion-parameter GPT-J model using 186,397 Ethereum SCs after removing the duplication from 2,217,692 SCs. The fine-tuning took more than one week using ten GPUs. The results showed that our fine-tuned model could synthesize SCs with an average BLEU (BiLingual Evaluation Understudy) score of 0.557. However, many codes in the auto-completed SCs were vulnerable. Using the code before the vulnerable line of 176 SCs containing different types of vulnerabilities to auto-complete the code, we found that more than 70% of the auto-completed codes were insecure. Thus, we further fine-tuned the model on other 941 vulnerable SCs containing the same types of vulnerabilities and applied vulnerability-constrained decoding. The fine-tuning took only one hour with four GPUs. We then auto-completed the 176 SCs again and found that our approach could identify 62% of the code to be generated as vulnerable and avoid generating 67% of them, indicating the approach could efficiently and effectively avoid vulnerabilities in the auto-completed code.

NLP-8-标题: AMuRD: Annotated Multilingual Receipts Dataset for Cross-lingual Key Information Extraction and Classification

链接: https://arxiv.org/abs/2309.09800
作者: Abdelrahman Abdallah, Mahmoud Abdalla, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt
备注:

点击查看摘要

Abstract: Key information extraction involves recognizing and extracting text from scanned receipts, enabling retrieval of essential content, and organizing it into structured documents. This paper presents a novel multilingual dataset for receipt extraction, addressing key challenges in information extraction and item classification. The dataset comprises 47,720 samples, including annotations for item names, attributes like (price, brand, etc.), and classification into 44 product categories. We introduce the InstructLLaMA approach, achieving an F1 score of 0.76 and an accuracy of 0.68 for key information extraction and item classification. We provide code, datasets, and checkpoints.\footnote\urlthis https URL.

NLP-9-标题: Watch the Speakers: A Hybrid Continuous Attribution Network for Emotion Recognition in Conversation With Emotion Disentanglement

链接: https://arxiv.org/abs/2309.09799
作者: Shanglin Lei, Xiaoping Wang, Guanting Dong, Jiang Li, Yingjian Liu
备注:

点击查看摘要

Abstract: Emotion Recognition in Conversation (ERC) has attracted widespread attention in the natural language processing field due to its enormous potential for practical applications. Existing ERC methods face challenges in achieving generalization to diverse scenarios due to insufficient modeling of context, ambiguous capture of dialogue relationships and overfitting in speaker modeling. In this work, we present a Hybrid Continuous Attributive Network (HCAN) to address these issues in the perspective of emotional continuation and emotional attribution. Specifically, HCAN adopts a hybrid recurrent and attention-based module to model global emotion continuity. Then a novel Emotional Attribution Encoding (EAE) is proposed to model intra- and inter-emotional attribution for each utterance. Moreover, aiming to enhance the robustness of the model in speaker modeling and improve its performance in different scenarios, A comprehensive loss function emotional cognitive loss \mathcalL_\rm EC is proposed to alleviate emotional drift and overcome the overfitting of the model to speaker modeling. Our model achieves state-of-the-art performance on three datasets, demonstrating the superiority of our work. Another extensive comparative experiments and ablation studies on three benchmarks are conducted to provided evidence to support the efficacy of each module. Further exploration of generalization ability experiments shows the plug-and-play nature of the EAE module in our method.

NLP-10-标题: The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceedings

链接: https://arxiv.org/abs/2309.09783
作者: Michal Mochtak, Peter Rupnik, Nikola Ljubešić
备注:

点击查看摘要

Abstract: Sentiments inherently drive politics. How we receive and process information plays an essential role in political decision-making, shaping our judgment with strategic consequences both on the level of legislators and the masses. If sentiment plays such an important role in politics, how can we study and measure it systematically? The paper presents a new dataset of sentiment-annotated sentences, which are used in a series of experiments focused on training a robust sentiment classifier for parliamentary proceedings. The paper also introduces the first domain-specific LLM for political science applications additionally pre-trained on 1.72 billion domain-specific words from proceedings of 27 European parliaments. We present experiments demonstrating how the additional pre-training of LLM on parliamentary data can significantly improve the model downstream performance on the domain-specific tasks, in our case, sentiment detection in parliamentary proceedings. We further show that multilingual models perform very well on unseen languages and that additional data from other languages significantly improves the target parliament’s results. The paper makes an important contribution to multiple domains of social sciences and bridges them with computer science and computational linguistics. Lastly, it sets up a more robust approach to sentiment analysis of political texts in general, which allows scholars to study political sentiment from a comparative perspective using standardized tools and techniques.

NLP-11-标题: Facilitating NSFW Text Detection in Open-Domain Dialogue Systems via Knowledge Distillation ICASSP2024

链接: https://arxiv.org/abs/2309.09749
作者: Huachuan Qiu, Shuai Zhang, Hongliang He, Anqi Li, Zhenzhong Lan
备注: Submitted to ICASSP 2024. Code and data are publicly available at this https URL

点击查看摘要

Abstract: NSFW (Not Safe for Work) content, in the context of a dialogue, can have severe side effects on users in open-domain dialogue systems. However, research on detecting NSFW language, especially sexually explicit content, within a dialogue context has significantly lagged behind. To address this issue, we introduce CensorChat, a dialogue monitoring dataset aimed at NSFW dialogue detection. Leveraging knowledge distillation techniques involving GPT-4 and ChatGPT, this dataset offers a cost-effective means of constructing NSFW content detectors. The process entails collecting real-life human-machine interaction data and breaking it down into single utterances and single-turn dialogues, with the chatbot delivering the final utterance. ChatGPT is employed to annotate unlabeled data, serving as a training set. Rationale validation and test sets are constructed using ChatGPT and GPT-4 as annotators, with a self-criticism strategy for resolving discrepancies in labeling. A BERT model is fine-tuned as a text classifier on pseudo-labeled data, and its performance is assessed. The study emphasizes the importance of AI systems prioritizing user safety and well-being in digital conversations while respecting freedom of expression. The proposed approach not only advances NSFW content detection but also aligns with evolving user protection needs in AI-driven dialogues.

NLP-12-标题: When Large Language Models Meet Citation: A Survey

链接: https://arxiv.org/abs/2309.09727
作者: Yang Zhang, Yufei Wang, Kai Wang, Quan Z. Sheng, Lina Yao, Adnan Mahmood, Wei Emma Zhang, Rongying Zhao
备注:

点击查看摘要

Abstract: Citations in scholarly work serve the essential purpose of acknowledging and crediting the original sources of knowledge that have been incorporated or referenced. Depending on their surrounding textual context, these citations are used for different motivations and purposes. Large Language Models (LLMs) could be helpful in capturing these fine-grained citation information via the corresponding textual context, thereby enabling a better understanding towards the literature. Furthermore, these citations also establish connections among scientific papers, providing high-quality inter-document relationships and human-constructed knowledge. Such information could be incorporated into LLMs pre-training and improve the text representation in LLMs. Therefore, in this paper, we offer a preliminary review of the mutually beneficial relationship between LLMs and citation analysis. Specifically, we review the application of LLMs for in-text citation analysis tasks, including citation classification, citation-based summarization, and citation recommendation. We then summarize the research pertinent to leveraging citation linkage knowledge to improve text representations of LLMs via citation prediction, network structure information, and inter-document relationship. We finally provide an overview of these contemporary methods and put forth potential promising avenues in combining LLMs and citation analysis for further investigation.

NLP-13-标题: Dealing with negative samples with multi-task learning on span-based joint entity -relation extraction

链接: https://arxiv.org/abs/2309.09713
作者: Chenguang Xue, Jiamin Lu
备注: 13 pages, 3 figures

点击查看摘要

Abstract: Recent span-based joint extraction models have demonstrated significant advantages in both entity recognition and relation extraction. These models treat text spans as candidate entities, and span pairs as candidate relationship tuples, achieving state-of-the-art results on datasets like ADE. However, these models encounter a significant number of non-entity spans or irrelevant span pairs during the tasks, impairing model performance significantly. To address this issue, this paper introduces a span-based multitask entity-relation joint extraction model. This approach employs the multitask learning to alleviate the impact of negative samples on entity and relation classifiers. Additionally, we leverage the Intersection over Union(IoU) concept to introduce the positional information into the entity classifier, achieving a span boundary detection. Furthermore, by incorporating the entity Logits predicted by the entity classifier into the embedded representation of entity pairs, the semantic input for the relation classifier is enriched. Experimental results demonstrate that our proposed this http URL model can effectively mitigate the adverse effects of excessive negative samples on the model performance. Furthermore, the model demonstrated commendable F1 scores of 73.61%, 53.72%, and 83.72% on three widely employed public datasets, namely CoNLL04, SciERC, and ADE, respectively.

NLP-14-标题: LLM 4Jobs: Unsupervised occupation extraction and standardization leveraging Large Language Models

链接: https://arxiv.org/abs/2309.09708
作者: Nan Li, Bo Kang, Tijl De Bie
备注:

点击查看摘要

Abstract: Automated occupation extraction and standardization from free-text job postings and resumes are crucial for applications like job recommendation and labor market policy formation. This paper introduces LLM4Jobs, a novel unsupervised methodology that taps into the capabilities of large language models (LLMs) for occupation coding. LLM4Jobs uniquely harnesses both the natural language understanding and generation capacities of LLMs. Evaluated on rigorous experimentation on synthetic and real-world datasets, we demonstrate that LLM4Jobs consistently surpasses unsupervised state-of-the-art benchmarks, demonstrating its versatility across diverse datasets and granularities. As a side result of our work, we present both synthetic and real-world datasets, which may be instrumental for subsequent research in this domain. Overall, this investigation highlights the promise of contemporary LLMs for the intricate task of occupation extraction and standardization, laying the foundation for a robust and adaptable framework relevant to both research and industrial contexts.

NLP-15-标题: Evaluating Gender Bias of Pre-trained Language Models in Natural Language Inference by Considering All Labels

链接: https://arxiv.org/abs/2309.09697
作者: Panatchakorn Anantaprayoon, Masahiro Kaneko, Naoaki Okazaki
备注:

点击查看摘要

Abstract: Discriminatory social biases, including gender biases, have been found in Pre-trained Language Models (PLMs). In Natural Language Inference (NLI), recent bias evaluation methods have observed biased inferences from the outputs of a particular label such as neutral or entailment. However, since different biased inferences can be associated with different output labels, it is inaccurate for a method to rely on one label. In this work, we propose an evaluation method that considers all labels in the NLI task. We create evaluation data and assign them into groups based on their expected biased output labels. Then, we define a bias measure based on the corresponding label output of each data group. In the experiment, we propose a meta-evaluation method for NLI bias measures, and then use it to confirm that our measure can evaluate bias more accurately than the baseline. Moreover, we show that our evaluation method is applicable to multiple languages by conducting the meta-evaluation on PLMs in three different languages: English, Japanese, and Chinese. Finally, we evaluate PLMs of each language to confirm their bias tendency. To our knowledge, we are the first to build evaluation datasets and measure the bias of PLMs from the NLI task in Japanese and Chinese.

NLP-16-标题: Do learned speech symbols follow Zipfs law? ICASSP2024

链接: https://arxiv.org/abs/2309.09690
作者: Shinnosuke Takamichi, Hiroki Maeda, Joonyong Park, Daisuke Saito, Hiroshi Saruwatari
备注: Submitted to ICASSP 2024

点击查看摘要

Abstract: In this study, we investigate whether speech symbols, learned through deep learning, follow Zipf’s law, akin to natural language symbols. Zipf’s law is an empirical law that delineates the frequency distribution of words, forming fundamentals for statistical analysis in natural language processing. Natural language symbols, which are invented by humans to symbolize speech content, are recognized to comply with this law. On the other hand, recent breakthroughs in spoken language processing have given rise to the development of learned speech symbols; these are data-driven symbolizations of speech content. Our objective is to ascertain whether these data-driven speech symbols follow Zipf’s law, as the same as natural language symbols. Through our investigation, we aim to forge new ways for the statistical analysis of spoken language processing.

NLP-17-标题: Multi-turn Dialogue Comprehension from a Topic-aware Perspective

链接: https://arxiv.org/abs/2309.09666
作者: Xinbei Ma, Yi Xu, Hai Zhao, Zhuosheng Zhang
备注:

点击查看摘要

Abstract: Dialogue related Machine Reading Comprehension requires language models to effectively decouple and model multi-turn dialogue passages. As a dialogue development goes after the intentions of participants, its topic may not keep constant through the whole passage. Hence, it is non-trivial to detect and leverage the topic shift in dialogue modeling. Topic modeling, although has been widely studied in plain text, deserves far more utilization in dialogue reading comprehension. This paper proposes to model multi-turn dialogues from a topic-aware perspective. We start with a dialogue segmentation algorithm to split a dialogue passage into topic-concentrated fragments in an unsupervised way. Then we use these fragments as topic-aware language processing units in further dialogue comprehension. On one hand, the split segments indict specific topics rather than mixed intentions, thus showing convenient on in-domain topic detection and location. For this task, we design a clustering system with a self-training auto-encoder, and we build two constructed datasets for evaluation. On the other hand, the split segments are an appropriate element of multi-turn dialogue response selection. For this purpose, we further present a novel model, Topic-Aware Dual-Attention Matching (TADAM) Network, which takes topic segments as processing elements and matches response candidates with a dual cross-attention. Empirical studies on three public benchmarks show great improvements over baselines. Our work continues the previous studies on document topic, and brings the dialogue modeling to a novel topic-aware perspective with exhaustive experiments and analyses.

NLP-18-标题: A Novel Method of Fuzzy Topic Modeling based on Transformer Processing

链接: https://arxiv.org/abs/2309.09658
作者: Ching-Hsun Tseng, Shin-Jye Lee, Po-Wei Cheng, Chien Lee, Chih-Chieh Hung
备注: Asian Journal of Information and Communications, Vol.12, No. 1, 125-140

点击查看摘要

Abstract: Topic modeling is admittedly a convenient way to monitor markets trend. Conventionally, Latent Dirichlet Allocation, LDA, is considered a must-do model to gain this type of information. By given the merit of deducing keyword with token conditional probability in LDA, we can know the most possible or essential topic. However, the results are not intuitive because the given topics cannot wholly fit human knowledge. LDA offers the first possible relevant keywords, which also brings out another problem of whether the connection is reliable based on the statistic possibility. It is also hard to decide the topic number manually in advance. As the booming trend of using fuzzy membership to cluster and using transformers to embed words, this work presents the fuzzy topic modeling based on soft clustering and document embedding from state-of-the-art transformer-based model. In our practical application in a press release monitoring, the fuzzy topic modeling gives a more natural result than the traditional output from LDA.

NLP-19-标题: Speeding Up Speech Synthesis In Diffusion Models By Reducing Data Distribution Recovery Steps Via Content Transfer

链接: https://arxiv.org/abs/2309.09652
作者: Peter Ochieng
备注: 10 pages

点击查看摘要

Abstract: Diffusion based vocoders have been criticised for being slow due to the many steps required during sampling. Moreover, the model’s loss function that is popularly implemented is designed such that the target is the original input x_0 or error \epsilon_0 . For early time steps of the reverse process, this results in large prediction errors, which can lead to speech distortions and increase the learning time. We propose a setup where the targets are the different outputs of forward process time steps with a goal to reduce the magnitude of prediction errors and reduce the training time. We use the different layers of a neural network (NN) to perform denoising by training them to learn to generate representations similar to the noised outputs in the forward process of the diffusion. The NN layers learn to progressively denoise the input in the reverse process until finally the final layer estimates the clean speech. To avoid 1:1 mapping between layers of the neural network and the forward process steps, we define a skip parameter \tau>1 such that an NN layer is trained to cumulatively remove the noise injected in the \tau steps in the forward process. This significantly reduces the number of data distribution recovery steps and, consequently, the time to generate speech. We show through extensive evaluation that the proposed technique generates high-fidelity speech in competitive time that outperforms current state-of-the-art tools. The proposed technique is also able to generalize well to unseen speech.

NLP-20-标题: Proposition from the Perspective of Chinese Language: A Chinese Proposition Classification Evaluation Benchmark

链接: https://arxiv.org/abs/2309.09602
作者: Conghui Niu, Mengyang Hu, Lin Bo, Xiaoli He, Dong Yu, Pengyuan Liu
备注:

点击查看摘要

Abstract: Existing propositions often rely on logical constants for classification. Compared with Western languages that lean towards hypotaxis such as English, Chinese often relies on semantic or logical understanding rather than logical connectives in daily expressions, exhibiting the characteristics of parataxis. However, existing research has rarely paid attention to this issue. And accurately classifying these propositions is crucial for natural language understanding and reasoning. In this paper, we put forward the concepts of explicit and implicit propositions and propose a comprehensive multi-level proposition classification system based on linguistics and logic. Correspondingly, we create a large-scale Chinese proposition dataset PEACE from multiple domains, covering all categories related to propositions. To evaluate the Chinese proposition classification ability of existing models and explore their limitations, We conduct evaluations on PEACE using several different methods including the Rule-based method, SVM, BERT, RoBERTA, and ChatGPT. Results show the importance of properly modeling the semantic features of propositions. BERT has relatively good proposition classification capability, but lacks cross-domain transferability. ChatGPT performs poorly, but its classification ability can be improved by providing more proposition information. Many issues are still far from being resolved and require further study.

NLP-21-标题: Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLM s

链接: https://arxiv.org/abs/2309.09582
作者: Jonas Golde, Patrick Haller, Felix Hamborg, Julian Risch, Alan Akbik
备注: 3 Figures and 2 Tables

点击查看摘要

Abstract: Most NLP tasks are modeled as supervised learning and thus require labeled training data to train effective models. However, manually producing such data at sufficient quality and quantity is known to be costly and time-intensive. Current research addresses this bottleneck by exploring a novel paradigm called zero-shot learning via dataset generation. Here, a powerful LLM is prompted with a task description to generate labeled data that can be used to train a downstream NLP model. For instance, an LLM might be prompted to “generate 500 movie reviews with positive overall sentiment, and another 500 with negative sentiment.” The generated data could then be used to train a binary sentiment classifier, effectively leveraging an LLM as a teacher to a smaller student model. With this demo, we introduce Fabricator, an open-source Python toolkit for dataset generation. Fabricator implements common dataset generation workflows, supports a wide range of downstream NLP tasks (such as text classification, question answering, and entity recognition), and is integrated with well-known libraries to facilitate quick experimentation. With Fabricator, we aim to support researchers in conducting reproducible dataset generation experiments using LLMs and help practitioners apply this approach to train models for downstream tasks.

NLP-22-标题: Summarization is (Almost) Dead

链接: https://arxiv.org/abs/2309.09558
作者: Xiao Pu, Mingqi Gao, Xiaojun Wan
备注:

点击查看摘要

Abstract: How well can large language models (LLMs) generate summaries? We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of LLMs across five distinct summarization tasks. Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models. Specifically, LLM-generated summaries exhibit better factual consistency and fewer instances of extrinsic hallucinations. Due to the satisfactory performance of LLMs in summarization tasks (even surpassing the benchmark of reference summaries), we believe that most conventional works in the field of text summarization are no longer necessary in the era of LLMs. However, we recognize that there are still some directions worth exploring, such as the creation of novel datasets with higher quality and more reliable evaluation methods.

NLP-23-标题: CB-Whisper: Contextual Biasing Whisper using TTS-based Keyword Spotting

链接: https://arxiv.org/abs/2309.09552
作者: Yuang Li, Yinglu Li, Min Zhang, Chang Su, Mengyao Piao, Xiaosong Qiao, Jiawei Yu, Miaomiao Ma, Yanqing Zhao, Hao Yang
备注: 10 pages, 4 figures

点击查看摘要

Abstract: End-to-end automatic speech recognition (ASR) systems often struggle to recognize rare name entities, such as personal names, organizations, or technical terms that are not frequently encountered in the training data. This paper presents Contextual Biasing Whisper (CB-Whisper), a novel ASR system based on OpenAI’s Whisper model that performs keyword-spotting (KWS) before the decoder. The KWS module leverages text-to-speech (TTS) techniques and a convolutional neural network (CNN) classifier to match the features between the entities and the utterances. Experiments demonstrate that by incorporating predicted entities into a carefully designed spoken form prompt, the mixed-error-rate (MER) and entity recall of the Whisper model is significantly improved on three internal datasets and two open-sourced datasets that cover English-only, Chinese-only, and code-switching scenarios.

NLP-24-标题: Adapting Large Language Models via Reading Comprehension

链接: https://arxiv.org/abs/2309.09530
作者: Daixuan Cheng, Shaohan Huang, Furu Wei
备注:

点击查看摘要

Abstract: We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension–practice after reading improves the ability to answer questions based on the learned knowledge–we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model’s performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data will be available at this https URL.

NLP-25-标题: Improved Factorized Neural Transducer Model For text-only Domain Adaptation

链接: https://arxiv.org/abs/2309.09524
作者: Junzhe Liu, Jianwei Yu, Xie Chen
备注:

点击查看摘要

Abstract: End-to-end models, such as the neural Transducer, have been successful in integrating acoustic and linguistic information jointly to achieve excellent recognition performance. However, adapting these models with text-only data is challenging. Factorized neural Transducer (FNT) aims to address this issue by introducing a separate vocabulary decoder to predict the vocabulary, which can effectively perform traditional text data adaptation. Nonetheless, this approach has limitations in fusing acoustic and language information seamlessly. Moreover, a degradation in word error rate (WER) on the general test sets was also observed, leading to doubts about its overall performance. In response to this challenge, we present an improved factorized neural Transducer (IFNT) model structure designed to comprehensively integrate acoustic and language information while enabling effective text adaptation. We evaluate the performance of our proposed methods through in-domain experiments on GigaSpeech and out-of-domain experiments adapting to EuroParl, TED-LIUM, and Medical datasets. After text-only adaptation, IFNT yields 7.9% to 28.5% relative WER improvements over the standard neural Transducer with shallow fusion, and relative WER reductions ranging from 1.6% to 8.2% on the three test sets compared to the FNT model.

NLP-26-标题: Understanding Divergent Framing of the Supreme Court Controversies: Social Media vs. News Outlets

链接: https://arxiv.org/abs/2309.09508
作者: Jinsheng Pan, Zichen Wang, Weihong Qi, Hanjia Lyu, Jiebo Luo
备注:

点击查看摘要

Abstract: Understanding the framing of political issues is of paramount importance as it significantly shapes how individuals perceive, interpret, and engage with these matters. While prior research has independently explored framing within news media and by social media users, there remains a notable gap in our comprehension of the disparities in framing political issues between these two distinct groups. To address this gap, we conduct a comprehensive investigation, focusing on the nuanced distinctions both qualitatively and quantitatively in the framing of social media and traditional media outlets concerning a series of American Supreme Court rulings on affirmative action, student loans, and abortion rights. Our findings reveal that, while some overlap in framing exists between social media and traditional media outlets, substantial differences emerge both across various topics and within specific framing categories. Compared to traditional news media, social media platforms tend to present more polarized stances across all framing categories. Further, we observe significant polarization in the news media’s treatment (i.e., Left vs. Right leaning media) of affirmative action and abortion rights, whereas the topic of student loans tends to exhibit a greater degree of consensus. The disparities in framing between traditional and social media platforms carry significant implications for the formation of public opinion, policy decision-making, and the broader political landscape.

NLP-27-标题: Pruning Large Language Models via Accuracy Predictor

链接: https://arxiv.org/abs/2309.09507
作者: Yupeng Ji, Yibo Cao, Jiucai Liu
备注: 6 pages, 4 figs

点击查看摘要

Abstract: Large language models(LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks. However, substantial model size poses challenges to training, inference, and deployment so that it is necessary to compress the model. At present, most model compression for LLMs requires manual design of pruning features, which has problems such as complex optimization pipeline and difficulty in retaining the capabilities of certain parts of the model.Therefore, we propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor. Using the accuracy predictor to further optimize the search space and search, the optimal model can be automatically selected. Experiments show that our proposed approach is effective and efficient. Compared with the baseline, the perplexity(PPL) on Wikitext2 and PTB dropped by 9.48% and 5,76% respectively, and the average accuracy of MMLU increased by 6.28%.

NLP-28-标题: LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models

链接: https://arxiv.org/abs/2309.09506
作者: Zecheng Tang, Chenfei Wu, Juntao Li, Nan Duan
备注:

点击查看摘要

Abstract: Graphic layout generation, a growing research field, plays a significant role in user engagement and information perception. Existing methods primarily treat layout generation as a numerical optimization task, focusing on quantitative aspects while overlooking the semantic information of layout, such as the relationship between each layout element. In this paper, we propose LayoutNUWA, the first model that treats layout generation as a code generation task to enhance semantic information and harness the hidden layout expertise of large language models~(LLMs). More concretely, we develop a Code Instruct Tuning (CIT) approach comprising three interconnected modules: 1) the Code Initialization (CI) module quantifies the numerical conditions and initializes them as HTML code with strategically placed masks; 2) the Code Completion (CC) module employs the formatting knowledge of LLMs to fill in the masked portions within the HTML code; 3) the Code Rendering (CR) module transforms the completed code into the final layout output, ensuring a highly interpretable and transparent layout generation procedure that directly maps code to a visualized layout. We attain significant state-of-the-art performance (even over 50% improvements) on multiple datasets, showcasing the strong capabilities of LayoutNUWA. Our code is available at this https URL.

NLP-29-标题: Search and Learning for Unsupervised Text Generation

链接: https://arxiv.org/abs/2309.09497
作者: Lili Mou
备注: AI Magazine}, 43(4), 344–352, 2022

点击查看摘要

Abstract: With the advances of deep learning techniques, text generation is attracting increasing interest in the artificial intelligence (AI) community, because of its wide applications and because it is an essential component of AI. Traditional text generation systems are trained in a supervised way, requiring massive labeled parallel corpora. In this paper, I will introduce our recent work on search and learning approaches to unsupervised text generation, where a heuristic objective function estimates the quality of a candidate sentence, and discrete search algorithms generate a sentence by maximizing the search objective. A machine learning model further learns from the search results to smooth out noise and improve efficiency. Our approach is important to the industry for building minimal viable products for a new task; it also has high social impacts for saving human annotation labor and for processing low-resource languages.

NLP-30-标题: Investigating Zero- and Few-shot Generalization in Fact Verification AACL

链接: https://arxiv.org/abs/2309.09444
作者: Liangming Pan, Yunxiang Zhang, Min-Yen Kan
备注: AACL-IJCNLP 2023 (main conference, long paper)

点击查看摘要

Abstract: In this paper, we explore zero- and few-shot generalization for fact verification (FV), which aims to generalize the FV model trained on well-resourced domains (e.g., Wikipedia) to low-resourced domains that lack human annotations. To this end, we first construct a benchmark dataset collection which contains 11 FV datasets representing 6 domains. We conduct an empirical analysis of generalization across these FV datasets, finding that current models generalize poorly. Our analysis reveals that several factors affect generalization, including dataset size, length of evidence, and the type of claims. Finally, we show that two directions of work improve generalization: 1) incorporating domain knowledge via pretraining on specialized domains, and 2) automatically generating training data via claim generation.

NLP-31-标题: Does Video Summarization Require Videos? Quantifying the Effectiveness of Language in Video Summarization

链接: https://arxiv.org/abs/2309.09405
作者: Yoonsoo Nam, Adam Lehavi, Daniel Yang, Digbalay Bose, Swabha Swayamdipta, Shrikanth Narayanan
备注: \c{opyright} 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract: Video summarization remains a huge challenge in computer vision due to the size of the input videos to be summarized. We propose an efficient, language-only video summarizer that achieves competitive accuracy with high data efficiency. Using only textual captions obtained via a zero-shot approach, we train a language transformer model and forego image representations. This method allows us to perform filtration amongst the representative text vectors and condense the sequence. With our approach, we gain explainability with natural language that comes easily for human interpretation and textual summaries of the videos. An ablation study that focuses on modality and data compression shows that leveraging text modality only effectively reduces input data processing while retaining comparable results.

NLP-32-标题: CulturaX: A Cleaned Enormous and Multilingual Dataset for Large Language Models in 167 Languages

链接: https://arxiv.org/abs/2309.09400
作者: Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen
备注: Ongoing Work

点击查看摘要

Abstract: The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: this https URL.

NLP-33-标题: Do Large GPT Models Discover Moral Dimensions in Language Representations? A Topological Study Of Sentence Embeddings

链接: https://arxiv.org/abs/2309.09397
作者: Stephen Fitz
备注:

点击查看摘要

Abstract: As Large Language Models are deployed within Artificial Intelligence systems, that are increasingly integrated with human society, it becomes more important than ever to study their internal structures. Higher level abilities of LLMs such as GPT-3.5 emerge in large part due to informative language representations they induce from raw text data during pre-training on trillions of words. These embeddings exist in vector spaces of several thousand dimensions, and their processing involves mapping between multiple vector spaces, with total number of parameters on the order of trillions. Furthermore, these language representations are induced by gradient optimization, resulting in a black box system that is hard to interpret. In this paper, we take a look at the topological structure of neuronal activity in the “brain” of Chat-GPT’s foundation language model, and analyze it with respect to a metric representing the notion of fairness. We develop a novel approach to visualize GPT’s moral dimensions. We first compute a fairness metric, inspired by social psychology literature, to identify factors that typically influence fairness assessments in humans, such as legitimacy, need, and responsibility. Subsequently, we summarize the manifold’s shape using a lower-dimensional simplicial complex, whose topology is derived from this metric. We color it with a heat map associated with this fairness metric, producing human-readable visualizations of the high-dimensional sentence manifold. Our results show that sentence embeddings based on GPT-3.5 can be decomposed into two submanifolds corresponding to fair and unfair moral judgments. This indicates that GPT-based language models develop a moral dimension within their representation spaces and induce an understanding of fairness during their training process.

NLP-34-标题: Augmenting text for spoken language understanding with Large Language Models ICASSP2024

链接: https://arxiv.org/abs/2309.09390
作者: Roshan Sharma, Suyoun Kim, Daniel Lazar, Trang Le, Akshat Shrivastava, Kwanghoon Ahn, Piyush Kansal, Leda Sari, Ozlem Kalinli, Michael Seltzer
备注: Submitted to ICASSP 2024

点击查看摘要

Abstract: Spoken semantic parsing (SSP) involves generating machine-comprehensible parses from input speech. Training robust models for existing application domains represented in training data or extending to new domains requires corresponding triplets of speech-transcript-semantic parse data, which is expensive to obtain. In this paper, we address this challenge by examining methods that can use transcript-semantic parse data (unpaired text) without corresponding speech. First, when unpaired text is drawn from existing textual corpora, Joint Audio Text (JAT) and Text-to-Speech (TTS) are compared as ways to generate speech representations for unpaired text. Experiments on the STOP dataset show that unpaired text from existing and new domains improves performance by 2% and 30% in absolute Exact Match (EM) respectively. Second, we consider the setting when unpaired text is not available in existing textual corpora. We propose to prompt Large Language Models (LLMs) to generate unpaired text for existing and new domains. Experiments show that examples and words that co-occur with intents can be used to generate unpaired text with Llama 2.0. Using the generated text with JAT and TTS for spoken semantic parsing improves EM on STOP by 1.4% and 2.6% absolute for existing and new domains respectively.

NLP-35-标题: Mitigating Shortcuts in Language Models with Soft Label Encoding

链接: https://arxiv.org/abs/2309.09380
作者: Zirui He, Huiqi Deng, Haiyan Zhao, Ninghao Liu, Mengnan Du
备注:

点击查看摘要

Abstract: Recent research has shown that large language models rely on spurious correlations in the data for natural language understanding (NLU) tasks. In this work, we aim to answer the following research question: Can we reduce spurious correlations by modifying the ground truth labels of the training data? Specifically, we propose a simple yet effective debiasing framework, named Soft Label Encoding (SoftLE). We first train a teacher model with hard labels to determine each sample’s degree of relying on shortcuts. We then add one dummy class to encode the shortcut degree, which is used to smooth other dimensions in the ground truth label to generate soft labels. This new ground truth label is used to train a more robust student model. Extensive experiments on two NLU benchmark tasks demonstrate that SoftLE significantly improves out-of-distribution generalization while maintaining satisfactory in-distribution accuracy.

NLP-36-标题: Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles

链接: https://arxiv.org/abs/2309.09369
作者: Kung-Hsiang Huang, Philippe Laban, Alexander R. Fabbri, Prafulla Kumar Choubey, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu
备注:

点击查看摘要

Abstract: Previous research in multi-document news summarization has typically concentrated on collating information that all sources agree upon. However, to our knowledge, the summarization of diverse information dispersed across multiple articles about an event has not been previously investigated. The latter imposes a different set of challenges for a summarization model. In this paper, we propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Moreover, we conducted a comprehensive analysis to pinpoint the position and verbosity biases when utilizing Large Language Model (LLM)-based metrics for evaluating the coverage and faithfulness of the summaries, as well as their correlation with human assessments. We applied our findings to study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that despite the extraordinary capabilities of LLMs in single-document summarization, the proposed task remains a complex challenge for them mainly due to their limited coverage, with GPT-4 only able to cover less than 40% of the diverse information on average.

NLP-37-标题: Language models are susceptible to incorrect patient self-diagnosis in medical applications NEURIPS2023 ALT

链接: https://arxiv.org/abs/2309.09362
作者: Rojin Ziaei, Samuel Schmidgall
备注: 4 pages, Deep Generative Models for Health NeurIPS 2023

点击查看摘要

Abstract: Large language models (LLMs) are becoming increasingly relevant as a potential tool for healthcare, aiding communication between clinicians, researchers, and patients. However, traditional evaluations of LLMs on medical exam questions do not reflect the complexity of real patient-doctor interactions. An example of this complexity is the introduction of patient self-diagnosis, where a patient attempts to diagnose their own medical conditions from various sources. While the patient sometimes arrives at an accurate conclusion, they more often are led toward misdiagnosis due to the patient’s over-emphasis on bias validating information. In this work we present a variety of LLMs with multiple-choice questions from United States medical board exams which are modified to include self-diagnostic reports from patients. Our findings highlight that when a patient proposes incorrect bias-validating information, the diagnostic accuracy of LLMs drop dramatically, revealing a high susceptibility to errors in self-diagnosis.

NLP-38-标题: Talk2Care: Facilitating Asynchronous Patient-Provider Communication with Large-Language-Model

链接: https://arxiv.org/abs/2309.09357
作者: Ziqi Yang, Xuhai Xu, Bingsheng Yao, Shao Zhang, Ethan Rogers, Stephen Intille, Nawar Shara, Guodong (Gordon)Gao, Dakuo Wang
备注: Under submission to CHI2024

点击查看摘要

Abstract: Despite the plethora of telehealth applications to assist home-based older adults and healthcare providers, basic messaging and phone calls are still the most common communication methods, which suffer from limited availability, information loss, and process inefficiencies. One promising solution to facilitate patient-provider communication is to leverage large language models (LLMs) with their powerful natural conversation and summarization capability. However, there is a limited understanding of LLMs’ role during the communication. We first conducted two interview studies with both older adults (N=10) and healthcare providers (N=9) to understand their needs and opportunities for LLMs in patient-provider asynchronous communication. Based on the insights, we built an LLM-powered communication system, Talk2Care, and designed interactive components for both groups: (1) For older adults, we leveraged the convenience and accessibility of voice assistants (VAs) and built an LLM-powered VA interface for effective information collection. (2) For health providers, we built an LLM-based dashboard to summarize and present important health information based on older adults’ conversations with the VA. We further conducted two user studies with older adults and providers to evaluate the usability of the system. The results showed that Talk2Care could facilitate the communication process, enrich the health information collected from older adults, and considerably save providers’ efforts and time. We envision our work as an initial exploration of LLMs’ capability in the intersection of healthcare and interpersonal communication.

NLP-39-标题: Performance of the Pre-Trained Large Language Model GPT -4 on Automated Short Answer Grading

链接: https://arxiv.org/abs/2309.09338
作者: Gerd Kortemeyer
备注:

点击查看摘要

Abstract: Automated Short Answer Grading (ASAG) has been an active area of machine-learning research for over a decade. It promises to let educators grade and give feedback on free-form responses in large-enrollment courses in spite of limited availability of human graders. Over the years, carefully trained models have achieved increasingly higher levels of performance. More recently, pre-trained Large Language Models (LLMs) emerged as a commodity, and an intriguing question is how a general-purpose tool without additional training compares to specialized models. We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in addition to the standard task of grading the alignment of the student answer with a reference answer, we also investigated withholding the reference answer. We found that overall, the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.

NLP-40-标题: A Few-Shot Approach to Dysarthric Speech Intelligibility Level Classification Using Transformer s

链接: https://arxiv.org/abs/2309.09329
作者: Paleti Nikhil Chowdary, Vadlapudi Sai Aravind, Gorantla V N S L Vishnu Vardhan, Menta Sai Akshay, Menta Sai Aashish, Jyothish Lal. G
备注: Paper has been presented at ICCCNT 2023 and the final version will be published in IEEE Digital Library Xplore

点击查看摘要

Abstract: Dysarthria is a speech disorder that hinders communication due to difficulties in articulating words. Detection of dysarthria is important for several reasons as it can be used to develop a treatment plan and help improve a person’s quality of life and ability to communicate effectively. Much of the literature focused on improving ASR systems for dysarthric speech. The objective of the current work is to develop models that can accurately classify the presence of dysarthria and also give information about the intelligibility level using limited data by employing a few-shot approach using a transformer model. This work also aims to tackle the data leakage that is present in previous studies. Our whisper-large-v2 transformer model trained on a subset of the UASpeech dataset containing medium intelligibility level patients achieved an accuracy of 85%, precision of 0.92, recall of 0.8 F1-score of 0.85, and specificity of 0.91. Experimental results also demonstrate that the model trained using the ‘words’ dataset performed better compared to the model trained on the ‘letters’ and ‘digits’ dataset. Moreover, the multiclass model achieved an accuracy of 67%.

NLP-41-标题: How People Perceive The Dynamic Zero-COVID Policy: A Retrospective Analysis From The Perspective of Appraisal Theory

链接: https://arxiv.org/abs/2309.09324
作者: Na Yang, Kyrie Zhixuan Zhou, Yunzhe Li
备注:

点击查看摘要

Abstract: The Dynamic Zero-COVID Policy in China spanned three years and diverse emotional responses have been observed at different times. In this paper, we retrospectively analyzed public sentiments and perceptions of the policy, especially regarding how they evolved over time, and how they related to people’s lived experiences. Through sentiment analysis of 2,358 collected Weibo posts, we identified four representative points, i.e., policy initialization, sharp sentiment change, lowest sentiment score, and policy termination, for an in-depth discourse analysis through the lens of appraisal theory. In the end, we reflected on the evolving public sentiments toward the Dynamic Zero-COVID Policy and proposed implications for effective epidemic prevention and control measures for future crises.

NLP-42-标题: AutoAM: An End-To-End Neural Model for Automatic and Universal Argument Mining

链接: https://arxiv.org/abs/2309.09300
作者: Lang Cao
备注:

点击查看摘要

Abstract: Argument mining is to analyze argument structure and extract important argument information from unstructured text. An argument mining system can help people automatically gain causal and logical information behind the text. As argumentative corpus gradually increases, like more people begin to argue and debate on social media, argument mining from them is becoming increasingly critical. However, argument mining is still a big challenge in natural language tasks due to its difficulty, and relative techniques are not mature. For example, research on non-tree argument mining needs to be done more. Most works just focus on extracting tree structure argument information. Moreover, current methods cannot accurately describe and capture argument relations and do not predict their types. In this paper, we propose a novel neural model called AutoAM to solve these problems. We first introduce the argument component attention mechanism in our model. It can capture the relevant information between argument components, so our model can better perform argument mining. Our model is a universal end-to-end framework, which can analyze argument structure without constraints like tree structure and complete three subtasks of argument mining in one model. The experiment results show that our model outperforms the existing works on several metrics in two public datasets.

NLP-43-标题: OWL: A Large Language Model for IT Operations

链接: https://arxiv.org/abs/2309.09298
作者: Hongcheng Guo, Jian Yang, Jiaheng Liu, Liqun Yang, Linzheng Chai, Jiaqi Bai, Junran Peng, Xiaorong Hu, Chao Chen, Dongfeng Zhang, Xu Shi, Tieqiao Zheng, Liangfan Zheng, Bo Zhang, Ke Xu, Zhoujun Li
备注: 31 pages

点击查看摘要

Abstract: With the rapid development of IT operations, it has become increasingly crucial to efficiently manage and analyze large volumes of data for practical applications. The techniques of Natural Language Processing (NLP) have shown remarkable capabilities for various tasks, including named entity recognition, machine translation and dialogue systems. Recently, Large Language Models (LLMs) have achieved significant improvements across various NLP downstream tasks. However, there is a lack of specialized LLMs for IT operations. In this paper, we introduce the OWL, a large language model trained on our collected OWL-Instruct dataset with a wide range of IT-related information, where the mixture-of-adapter strategy is proposed to improve the parameter-efficient tuning across different domains or tasks. Furthermore, we evaluate the performance of our OWL on the OWL-Bench established by us and open IT-related benchmarks. OWL demonstrates superior performance results on IT tasks, which outperforms existing models by significant margins. Moreover, we hope that the findings of our work will provide more insights to revolutionize the techniques of IT operations with specialized LLMs.

NLP-44-标题: Model-based Subsampling for Knowledge Graph Completion AACL2023

链接: https://arxiv.org/abs/2309.09296
作者: Xincan Feng, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
备注: Accepted by AACL 2023; 9 pages, 3 figures, 5 tables

点击查看摘要

Abstract: Subsampling is effective in Knowledge Graph Embedding (KGE) for reducing overfitting caused by the sparsity in Knowledge Graph (KG) datasets. However, current subsampling approaches consider only frequencies of queries that consist of entities and their relations. Thus, the existing subsampling potentially underestimates the appearance probabilities of infrequent queries even if the frequencies of their entities or relations are high. To address this problem, we propose Model-based Subsampling (MBS) and Mixed Subsampling (MIX) to estimate their appearance probabilities through predictions of KGE models. Evaluation results on datasets FB15k-237, WN18RR, and YAGO3-10 showed that our proposed subsampling methods actually improved the KG completion performances for popular KGE models, RotatE, TransE, HAKE, ComplEx, and DistMult.

NLP-45-标题: Leveraging Social Discourse to Measure Check-worthiness of Claims for Fact-checking

链接: https://arxiv.org/abs/2309.09274
作者: Megha Sundriyal, Md Shad Akhtar, Tanmoy Chakraborty
备注: 28 pages, 2 figures, 8 tables

点击查看摘要

Abstract: The expansion of online social media platforms has led to a surge in online content consumption. However, this has also paved the way for disseminating false claims and misinformation. As a result, there is an escalating demand for a substantial workforce to sift through and validate such unverified claims. Currently, these claims are manually verified by fact-checkers. Still, the volume of online content often outweighs their potency, making it difficult for them to validate every single claim in a timely manner. Thus, it is critical to determine which assertions are worth fact-checking and prioritize claims that require immediate attention. Multiple factors contribute to determining whether a claim necessitates fact-checking, encompassing factors such as its factual correctness, potential impact on the public, the probability of inciting hatred, and more. Despite several efforts to address claim check-worthiness, a systematic approach to identify these factors remains an open challenge. To this end, we introduce a new task of fine-grained claim check-worthiness, which underpins all of these factors and provides probable human grounds for identifying a claim as check-worthy. We present CheckIt, a manually annotated large Twitter dataset for fine-grained claim check-worthiness. We benchmark our dataset against a unified approach, CheckMate, that jointly determines whether a claim is check-worthy and the factors that led to that conclusion. We compare our suggested system with several baseline systems. Finally, we report a thorough analysis of results and human assessment, validating the efficacy of integrating check-worthiness factors in detecting claims worth fact-checking.

NLP-46-标题: Code quality assessment using transformer s

链接: https://arxiv.org/abs/2309.09264
作者: Mosleh Mahamud, Isak Samsten
备注:

点击查看摘要

Abstract: Automatically evaluate the correctness of programming assignments is rather straightforward using unit and integration tests. However, programming tasks can be solved in multiple ways, many of which, although correct, are inelegant. For instance, excessive branching, poor naming or repetitiveness make the code hard to understand and maintain. These subjective qualities of code are hard to automatically assess using current techniques. In this work we investigate the use of CodeBERT to automatically assign quality score to Java code. We experiment with different models and training paradigms. We explore the accuracy of the models on a novel dataset for code quality assessment. Finally, we assess the quality of the predictions using saliency maps. We find that code quality to some extent is predictable and that transformer based models using task adapted pre-training can solve the task more efficiently than other techniques.

NLP-47-标题: A Benchmark for Text Expansion: Dataset s Metrics and Baselines

链接: https://arxiv.org/abs/2309.09198
作者: Yi Chen, Haiyun Jiang, Wei Bi, Rui Wang, Longyue Wang, Shuming Shi, Ruifeng Xu
备注:

点击查看摘要

Abstract: This work presents a new task of Text Expansion (TE), which aims to insert fine-grained modifiers into proper locations of the plain text to concretize or vivify human writings. Different from existing insertion-based writing assistance tasks, TE requires the model to be more flexible in both locating and generation, and also more cautious in keeping basic semantics. We leverage four complementary approaches to construct a dataset with 12 million automatically generated instances and 2K human-annotated references for both English and Chinese. To facilitate automatic evaluation, we design various metrics from multiple perspectives. In particular, we propose Info-Gain to effectively measure the informativeness of expansions, which is an important quality dimension in TE. On top of a pre-trained text-infilling model, we build both pipelined and joint Locate&Infill models, which demonstrate the superiority over the Text2Text baselines, especially in expansion informativeness. Experiments verify the feasibility of the TE task and point out potential directions for future research toward better automatic text expansion.

NLP-48-标题: SplitEE: Early Exit in Deep Neural Networks with Split Computing

链接: https://arxiv.org/abs/2309.09195
作者: Divya J. Bajpai, Vivek K. Trivedi, Sohan L. Yadav, Manjesh K. Hanawal
备注: 10 pages, to appear in the proceeding AIMLSystems 2023

点击查看摘要

Abstract: Deep Neural Networks (DNNs) have drawn attention because of their outstanding performance on various tasks. However, deploying full-fledged DNNs in resource-constrained devices (edge, mobile, IoT) is difficult due to their large size. To overcome the issue, various approaches are considered, like offloading part of the computation to the cloud for final inference (split computing) or performing the inference at an intermediary layer without passing through all layers (early exits). In this work, we propose combining both approaches by using early exits in split computing. In our approach, we decide up to what depth of DNNs computation to perform on the device (splitting layer) and whether a sample can exit from this layer or need to be offloaded. The decisions are based on a weighted combination of accuracy, computational, and communication costs. We develop an algorithm named SplitEE to learn an optimal policy. Since pre-trained DNNs are often deployed in new domains where the ground truths may be unavailable and samples arrive in a streaming fashion, SplitEE works in an online and unsupervised setup. We extensively perform experiments on five different datasets. SplitEE achieves a significant cost reduction ( >50% ) with a slight drop in accuracy ( <2% ) as compared to the case when all samples are inferred at the final layer. The anonymized source code is available at \urlhttps://anonymous.4open.science/r/SplitEE_M-B989/README.md.

NLP-49-标题: Can Large Language Models Understand Real-World Complex Instructions?

链接: https://arxiv.org/abs/2309.09150
作者: Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Lida Chen, Xintao Wang, Yuncheng Huang, Haoning Ye, Zihan Li, Shisong Chen, Yikai Zhang, Zhouhong Gu, Jiaqing Liang, Yanghua Xiao
备注:

点击查看摘要

Abstract: Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multi-turn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs’ ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose CELLO, a benchmark for evaluating LLMs’ ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. Resources of CELLO are publicly available at this https URL.

NLP-50-标题: How much can ChatGPT really help Computational Biologists in Programming?

链接: https://arxiv.org/abs/2309.09126
作者: Chowdhury Rafeed Rahman, Limsoon Wong
备注:

点击查看摘要

Abstract: ChatGPT, a recently developed product by openAI, is successfully leaving its mark as a multi-purpose natural language based chatbot. In this paper, we are more interested in analyzing its potential in the field of computational biology. A major share of work done by computational biologists these days involve coding up Bioinformatics algorithms, analyzing data, creating pipelining scripts and even machine learning modeling & feature extraction. This paper focuses on the potential influence (both positive and negative) of ChatGPT in the mentioned aspects with illustrative examples from different perspectives. Compared to other fields of Computer Science, Computational Biology has - (1) less coding resources, (2) more sensitivity and bias issues (deals with medical data) and (3) more necessity of coding assistance (people from diverse background come to this field). Keeping such issues in mind, we cover use cases such as code writing, reviewing, debugging, converting, refactoring and pipelining using ChatGPT from the perspective of computational biologists in this paper.

NLP-51-标题: Contrastive Decoding Improves Reasoning in Large Language Models

链接: https://arxiv.org/abs/2309.09117
作者: Sean O’Brien, Mike Lewis
备注: 10 figures, 13 tables

点击查看摘要

Abstract: We demonstrate that Contrastive Decoding – a simple, computationally light, and training-free text generation method proposed by Li et al 2022 – achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. Originally shown to improve the perceived quality of long-form text generation, Contrastive Decoding searches for strings that maximize a weighted difference in likelihood between strong and weak models. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark, and to outperform LLaMA 2, GPT-3.5 and PaLM-540B on the GSM8K math word reasoning benchmark, in addition to improvements on a collection of other tasks. Analysis suggests that Contrastive Decoding improves over existing methods by preventing some abstract reasoning errors, as well as by avoiding simpler modes such as copying sections of the input during chain-of-thought. Overall, Contrastive Decoding outperforms nucleus sampling for long-form generation and greedy decoding for reasoning tasks, making it a powerful general purpose method for generating text from language models.

NLP-52-标题: The Impact of Debiasing on the Performance of Language Models in Downstream Tasks is Underestimated AACL2023

链接: https://arxiv.org/abs/2309.09092
作者: Masahiro Kaneko, Danushka Bollegala, Naoaki Okazaki
备注: IJCNLP-AACL 2023

点击查看摘要

Abstract: Pre-trained language models trained on large-scale data have learned serious levels of social biases. Consequently, various methods have been proposed to debias pre-trained models. Debiasing methods need to mitigate only discriminatory bias information from the pre-trained models, while retaining information that is useful for the downstream tasks. In previous research, whether useful information is retained has been confirmed by the performance of downstream tasks in debiased pre-trained models. On the other hand, it is not clear whether these benchmarks consist of data pertaining to social biases and are appropriate for investigating the impact of debiasing. For example in gender-related social biases, data containing female words (e.g. she, female, woman''), male words (e.g. he, male, man’‘), and stereotypical words (e.g. ``nurse, doctor, professor’') are considered to be the most affected by debiasing. If there is not much data containing these words in a benchmark dataset for a target task, there is the possibility of erroneously evaluating the effects of debiasing. In this study, we compare the impact of debiasing on performance across multiple downstream tasks using a wide-range of benchmark datasets that containing female, male, and stereotypical words. Experiments show that the effects of debiasing are consistently \emphunderestimated across all tasks. Moreover, the effects of debiasing could be reliably evaluated by separately considering instances containing female, male, and stereotypical words than all of the instances in a benchmark dataset.

NLP-53-标题: RMDM: A Multilabel Fakenews Dataset for Vietnamese Evidence Verification

链接: https://arxiv.org/abs/2309.09071
作者: Hai-Long Nguyen, Thi-Kieu-Trang Pham, Thai-Son Le, Tan-Minh Nguyen, Thi-Hai-Yen Vuong, Ha-Thanh Nguyen
备注: ISAILD@KSE 2023

点击查看摘要

Abstract: In this study, we present a novel and challenging multilabel Vietnamese dataset (RMDM) designed to assess the performance of large language models (LLMs), in verifying electronic information related to legal contexts, focusing on fake news as potential input for electronic evidence. The RMDM dataset comprises four labels: real, mis, dis, and mal, representing real information, misinformation, disinformation, and mal-information, respectively. By including these diverse labels, RMDM captures the complexities of differing fake news categories and offers insights into the abilities of different language models to handle various types of information that could be part of electronic evidence. The dataset consists of a total of 1,556 samples, with 389 samples for each label. Preliminary tests on the dataset using GPT-based and BERT-based models reveal variations in the models’ performance across different labels, indicating that the dataset effectively challenges the ability of various language models to verify the authenticity of such information. Our findings suggest that verifying electronic information related to legal contexts, including fake news, remains a difficult problem for language models, warranting further attention from the research community to advance toward more reliable AI models for potential legal applications.

NLP-54-标题: NOWJ1@ALQAC 2023: Enhancing Legal Task Performance with Classic Statistical Models and Pre-trained Language Models

链接: https://arxiv.org/abs/2309.09070
作者: Tan-Minh Nguyen, Xuan-Hoa Nguyen, Ngoc-Duy Mai, Minh-Quan Hoang, Van-Huan Nguyen, Hoang-Viet Nguyen, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong
备注: ISAILD@KSE 2023

点击查看摘要

Abstract: This paper describes the NOWJ1 Team’s approach for the Automated Legal Question Answering Competition (ALQAC) 2023, which focuses on enhancing legal task performance by integrating classical statistical models and Pre-trained Language Models (PLMs). For the document retrieval task, we implement a pre-processing step to overcome input limitations and apply learning-to-rank methods to consolidate features from various models. The question-answering task is split into two sub-tasks: sentence classification and answer extraction. We incorporate state-of-the-art models to develop distinct systems for each sub-task, utilizing both classic statistical models and pre-trained Language Models. Experimental results demonstrate the promising potential of our proposed methodology in the competition.

NLP-55-标题: Constructing a Knowledge Graph for Vietnamese Legal Cases with Heterogeneous Graphs

链接: https://arxiv.org/abs/2309.09069
作者: Thi-Hai-Yen Vuong, Minh-Quan Hoang, Tan-Minh Nguyen, Hoang-Trung Nguyen, Ha-Thanh Nguyen
备注: ISAILD@KSE 2023

点击查看摘要

Abstract: This paper presents a knowledge graph construction method for legal case documents and related laws, aiming to organize legal information efficiently and enhance various downstream tasks. Our approach consists of three main steps: data crawling, information extraction, and knowledge graph deployment. First, the data crawler collects a large corpus of legal case documents and related laws from various sources, providing a rich database for further processing. Next, the information extraction step employs natural language processing techniques to extract entities such as courts, cases, domains, and laws, as well as their relationships from the unstructured text. Finally, the knowledge graph is deployed, connecting these entities based on their extracted relationships, creating a heterogeneous graph that effectively represents legal information and caters to users such as lawyers, judges, and scholars. The established baseline model leverages unsupervised learning methods, and by incorporating the knowledge graph, it demonstrates the ability to identify relevant laws for a given legal case. This approach opens up opportunities for various applications in the legal domain, such as legal case analysis, legal recommendation, and decision support.

NLP-56-标题: Exploring the impact of low-rank adaptation on the performance efficiency and regularization of RLHF

链接: https://arxiv.org/abs/2309.09055
作者: Simeng Sun, Dhawal Gupta, Mohit Iyyer
备注:

点击查看摘要

Abstract: During the last stage of RLHF, a large language model is aligned to human intents via PPO training, a process that generally requires large-scale computational resources. In this technical report, we empirically investigate an efficient implementation of RLHF using low-rank adaptation (LoRA), which allows us to align the LLaMA 7B checkpoint on the Alpaca dataset using only two A100 GPUs instead of the eight required for full model fine-tuning. Despite tuning only 0.2% of LLaMA 7B’s parameters, our implementation achieves better performance than the publicly-released AlpacaFarm checkpoint with full model fine-tuning. Next, we analyze several configurations of our LoRA-based PPO implementation, varying the form of the KL regularization term in the training objective. We find that (1) removing this penalty term does not harm performance on the AlpacaFarm evaluation set under our LoRA setup; (2) other regularizers, such as Jensen-Shannon divergence, lead to improved performance; and (3) while PPO training negatively impacts the factuality of model-generated responses, training with LoRA largely mitigates this effect. We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.

NLP-57-标题: Context-aware Adversarial Attack on Named Entity Recognition

链接: https://arxiv.org/abs/2309.08999
作者: Shuguang Chen, Leonardo Neves, Thamar Solorio
备注:

点击查看摘要

Abstract: In recent years, large pre-trained language models (PLMs) have achieved remarkable performance on many natural language processing benchmarks. Despite their success, prior studies have shown that PLMs are vulnerable to attacks from adversarial examples. In this work, we focus on the named entity recognition task and study context-aware adversarial attack methods to examine the model’s robustness. Specifically, we propose perturbing the most informative words for recognizing entities to create adversarial examples and investigate different candidate replacement methods to generate natural and plausible adversarial examples. Experiments and analyses show that our methods are more effective in deceiving the model into making wrong predictions than strong baselines.

NLP-58-标题: Rethinking STS and NLI in Large Language Models

链接: https://arxiv.org/abs/2309.08969
作者: Yuxia Wang, Minghan Wang, Preslav Nakov
备注:

点击查看摘要

Abstract: In this study, we aim to rethink STS and NLI in the era of large language models (LLMs). We first evaluate the accuracy of clinical/biomedical STS and NLI over five datasets, and then we assess LLM predictive confidence and their capability of capturing collective human opinions. We find that LLMs may be able to provide personalised descriptions for a specific topic, or to generate semantically similar content in different tones, but that this is hard for current LLMs to make personalised judgements or decisions. We further find that zero-shot ChatGPT achieves competitive accuracy over clinical and biomedical STS/NLI, constraining to the fine-tuned BERT-base. However, there is a large variation in sampling, ensembled results perform the best.

NLP-59-标题: Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT)

链接: https://arxiv.org/abs/2309.08968
作者: Parsa Kavehzadeh, Mojtaba Valipour, Marzieh Tahaei, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh
备注:

点击查看摘要

Abstract: The rapid advancement of large language models (LLMs) has revolutionized natural language processing (NLP). While these models excel at understanding and generating human-like text, their widespread deployment can be prohibitively expensive. SortedNet is a recent training technique for enabling dynamic inference for deep neural networks. It leverages network modularity to create sub-models with varying computational loads, sorting them based on computation/accuracy characteristics in a nested manner. We extend SortedNet to generative NLP tasks, making large language models dynamic without any pretraining and by only replacing standard Supervised Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT) at the same costs. Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference. We show that using this approach, we are able to unlock the potential of intermediate layers of transformers in generating the target output. Our sub-models remain integral components of the original model, minimizing storage requirements and transition costs between different computational/latency budgets. By applying this approach on LLaMa 2 13B for tuning on the Stanford Alpaca dataset and comparing it to normal tuning and early exit via PandaLM benchmark, we show that Sorted Fine-Tuning can deliver models twice as fast as the original model while maintaining or exceeding performance.

NLP-60-标题: Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

链接: https://arxiv.org/abs/2309.08963
作者: Xiangru Tang, Yiming Zong, Yilun Zhao, Arman Cohan, Mark Gerstein
备注:

点击查看摘要

Abstract: Despite the power of Large Language Models (LLMs) like GPT-4, they still struggle with tasks that require generating complex, structured outputs. In this study, we assess the capability of Current LLMs in generating complex structured data and propose a structure-aware fine-tuning approach as a solution to improve this ability. To perform a comprehensive evaluation, we propose Struc-Bench, include five representative LLMs (i.e., GPT-NeoX 20B, GPT-3.5, GPT-4, and Vicuna) and evaluate them on our carefully constructed datasets spanning raw text, HTML, and LaTeX tables. Based on our analysis of current model performance, we identify specific common formatting errors and areas of potential improvement. To address complex formatting requirements, we utilize FormatCoT (Chain-of-Thought) to generate format instructions from target outputs. Our experiments show that our structure-aware fine-tuning method, when applied to LLaMA-7B, significantly improves adherence to natural language constraints, outperforming other evaluated LLMs. Based on these results, we present an ability map of model capabilities from six dimensions (i.e., coverage, formatting, reasoning, comprehension, pragmatics, and hallucination). This map highlights the weaknesses of LLMs in handling complex structured outputs and suggests promising directions for future work. Our code and models can be found at this https URL.

NLP-61-标题: ODSum: New Benchmarks for Open Domain Multi-Document Summarization

链接: https://arxiv.org/abs/2309.08960
作者: Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan
备注:

点击查看摘要

Abstract: Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries. With a more inter-related document set, there does not necessarily exist a correct answer for the retrieval, making it hard to measure the retrieving performance. We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets. Based on this method, we introduce a novel dataset, ODSum, a sophisticated case with its document index interdependent and often interrelated. We tackle ODMDS with the \textitretrieve-then-summarize method, and the performance of a list of retrievers and summarizers is investigated. Through extensive experiments, we identify variances in evaluation metrics and provide insights into their reliability. We also found that LLMs suffer great performance loss from retrieving errors. We further experimented methods to improve the performance as well as investigate their robustness against imperfect retrieval. We will release our data and code at this https URL.

NLP-62-标题: Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

链接: https://arxiv.org/abs/2309.08958
作者: Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Barry Haddow, Kenneth Heafield
备注: Work in progress

点击查看摘要

Abstract: Foundational large language models (LLMs) can be instruction-tuned to develop open-ended question-answering capability, facilitating applications such as the creation of AI assistants. While such efforts are often carried out in a single language, building on prior research, we empirically analyze cost-efficient approaches of monolingual and multilingual tuning, shedding light on the efficacy of LLMs in responding to queries across monolingual and multilingual contexts. Our study employs the Alpaca dataset and machine translations of it to form multilingual training data, which is then used to tune LLMs through low-rank adaptation and full-parameter training. Comparisons reveal that multilingual tuning is not crucial for an LLM’s English performance, but is key to its robustness in a multilingual environment. With a fixed budget, a multilingual instruction-tuned model, merely trained on downsampled data, can be as powerful as training monolingual models for each language. Our findings serve as a guide for expanding language support through instruction tuning with constrained computational resources.

NLP-63-标题: Cross-Lingual Knowledge Editing in Large Language Models

链接: https://arxiv.org/abs/2309.08952
作者: Jiaan Wang, Yunlong Liang, Zengkui Sun, Yuxuan Cao, Jiarong Xu
备注:

点击查看摘要

Abstract: Knowledge editing aims to change language models’ performance on several special cases (i.e., editing scope) by infusing the corresponding expected knowledge into them. With the recent advancements in large language models (LLMs), knowledge editing has been shown as a promising technique to adapt LLMs to new knowledge without retraining from scratch. However, most of the previous studies neglect the multi-lingual nature of some main-stream LLMs (e.g., LLaMA, ChatGPT and GPT-4), and typically focus on monolingual scenarios, where LLMs are edited and evaluated in the same language. As a result, it is still unknown the effect of source language editing on a different target language. In this paper, we aim to figure out this cross-lingual effect in knowledge editing. Specifically, we first collect a large-scale cross-lingual synthetic dataset by translating ZsRE from English to Chinese. Then, we conduct English editing on various knowledge editing methods covering different paradigms, and evaluate their performance in Chinese, and vice versa. To give deeper analyses of the cross-lingual effect, the evaluation includes four aspects, i.e., reliability, generality, locality and portability. Furthermore, we analyze the inconsistent behaviors of the edited models and discuss their specific challenges.

NLP-64-标题: Enhancing Large Language Model Induced Task-Oriented Dialogue Systems Through Look-Forward Motivated Goals

链接: https://arxiv.org/abs/2309.08949
作者: Zhiyuan Hu, Yue Feng, Yang Deng, Zekun Li, See-Kiong Ng, Anh Tuan Luu, Bryan Hooi
备注: 7 Pages

点击查看摘要

Abstract: Recently, the development of large language models (LLMs) has been significantly enhanced the question answering and dialogue generation, and makes them become increasingly popular in current practical scenarios. While unlike the general dialogue system which emphasizes the semantic performance, the task-oriented dialogue (ToD) systems aim to achieve the dialogue goal efficiently and successfully in multiple turns. Unfortunately, existing LLM-induced ToD systems lack the direct reward toward the final goal and do not take account of the dialogue proactivity that can strengthen the dialogue efficiency. To fill these gaps, we introduce the ProToD (Proactively Goal-Driven LLM-Induced ToD) approach, which anticipates the future dialogue actions and incorporates the goal-oriented reward signal to enhance ToD systems. Additionally, we present a novel evaluation method that assesses ToD systems based on goal-driven dialogue simulations. This method allows us to gauge user satisfaction, system efficiency and successful rate while overcoming the limitations of current Information and Success metrics. Empirical experiments conducted on the MultiWoZ 2.1 dataset demonstrate that our model can achieve superior performance using only 10% of the data compared to previous end-to-end fully supervised models. This improvement is accompanied by enhanced user satisfaction and efficiency.

NLP-65-标题: Contextual Label Projection for Cross-Lingual Structure Extraction

链接: https://arxiv.org/abs/2309.08943
作者: Tanmay Parekh, I-Hung Hsu, Kuan-Hao Huang, Kai-Wei Chang, Nanyun Peng
备注: Work in Progress

点击查看摘要

Abstract: Translating training data into target languages has proven beneficial for cross-lingual transfer. However, for structure extraction tasks, translating data requires a label projection step, which translates input text and obtains translated labels in the translated text jointly. Previous research in label projection mostly compromises translation quality by either facilitating easy identification of translated labels from translated text or using word-level alignment between translation pairs to assemble translated phrase-level labels from the aligned words. In this paper, we introduce CLAP, which first translates text to the target language and performs contextual translation on the labels using the translated text as the context, ensuring better accuracy for the translated labels. We leverage instruction-tuned language models with multilingual capabilities as our contextual translator, imposing the constraint of the presence of translated labels in the translated text via instructions. We compare CLAP with other label projection techniques for creating pseudo-training data in target languages on event argument extraction, a representative structure extraction task. Results show that CLAP improves by 2-2.5 F1-score over other methods on the Chinese and Arabic ACE05 datasets.

NLP-66-标题: Leveraging Multi-lingual Positive Instances in Contrastive Learning to Improve Sentence Embedding

链接: https://arxiv.org/abs/2309.08929
作者: Kaiyan Zhao, Qiyu Wu, Xin-Qiang Cai, Yoshimasa Tsuruoka
备注: 14 pages, 4 figures

点击查看摘要

Abstract: Learning multi-lingual sentence embeddings is a fundamental and significant task in natural language processing. Recent trends of learning both mono-lingual and multi-lingual sentence embeddings are mainly based on contrastive learning (CL) with an anchor, one positive, and multiple negative instances. In this work, we argue that leveraging multiple positives should be considered for multi-lingual sentence embeddings because (1) positives in a diverse set of languages can benefit cross-lingual learning, and (2) transitive similarity across multiple positives can provide reliable structural information to learn. In order to investigate the impact of CL with multiple positives, we propose a novel approach MPCL to effectively utilize multiple positive instances to improve learning multi-lingual sentence embeddings. Our experimental results on various backbone models and downstream tasks support that compared with conventional CL, MPCL leads to better retrieval, semantic similarity, and classification performances. We also observe that on unseen languages, sentence embedding models trained on multiple positives have better cross-lingual transferring performance than models trained on a single positive instance.

NLP-67-标题: Multimodal Multi-Hop Question Answering Through a Conversation Between Tools and Efficiently Finetuned Large Language Models

链接: https://arxiv.org/abs/2309.08922
作者: Hossein Rajabzadeh, Suyuchen Wang, Hyock Ju Kwon, Bang Liu
备注:

点击查看摘要

Abstract: We employ a tool-interacting divide-and-conquer strategy enabling large language models (LLMs) to answer complex multimodal multi-hop questions. In particular, we harness the power of large language models to divide a given multimodal multi-hop question into unimodal single-hop sub-questions to be answered by the appropriate tool from a predefined set of tools. After all corresponding tools provide the LLM with their answers, the LLM generates the next relevant unimodal single-hop question. To increase the reasoning ability of LLMs, we prompt chatGPT to generate a tool-interacting divide-and-conquer dataset. This dataset is then used to efficiently finetune the corresponding LLM. To assess the effectiveness of this approach, we conduct an evaluation on two recently introduced complex question-answering datasets. The experimental analysis demonstrate substantial improvements over existing state-of-the-art solutions, indicating the efficacy and generality of our strategy

NLP-68-标题: A Statistical Turing Test for Generative Models

链接: https://arxiv.org/abs/2309.08913
作者: Hayden Helm, Carey E. Priebe, Weiwei Yang
备注:

点击查看摘要

Abstract: The emergence of human-like abilities of AI systems for content generation in domains such as text, audio, and vision has prompted the development of classifiers to determine whether content originated from a human or a machine. Implicit in these efforts is an assumption that the generation properties of a human are different from that of the machine. In this work, we provide a framework in the language of statistical pattern recognition that quantifies the difference between the distributions of human and machine-generated content conditioned on an evaluation context. We describe current methods in the context of the framework and demonstrate how to use the framework to evaluate the progression of generative models towards human-like capabilities, among many axes of analysis.

NLP-69-标题: Investigating Subtler Biases in LLM s: Ageism Beauty Institutional and Nationality Bias in Generative Models

链接: https://arxiv.org/abs/2309.08902
作者: Mahammed Kamruzzaman, Md. Minul Islam Shovon, Gene Louis Kim
备注:

点击查看摘要

Abstract: LLMs are increasingly powerful and widely used to assist users in a variety of tasks. This use risks the introduction of LLM biases to consequential decisions such as job hiring, human performance evaluation, and criminal sentencing. Bias in NLP systems along the lines of gender and ethnicity has been widely studied, especially for specific stereotypes (e.g., Asians are good at math). In this paper, we investigate bias along less studied, but still consequential, dimensions, such as age and beauty, measuring subtler correlated decisions that LLMs (specially autoregressive language models) make between social groups and unrelated positive and negative attributes. We ask whether LLMs hold wide-reaching biases of positive or negative sentiment for specific social groups similar to the ``what is beautiful is good’’ bias found in people in experimental psychology. We introduce a template-generated dataset of sentence completion tasks that asks the model to select the most appropriate attribute to complete an evaluative statement about a person described as a member of a specific social group. We also reverse the completion task to select the social group based on an attribute. Finally, we report the correlations that we find for multiple cutting-edge LLMs. This dataset can be used as a benchmark to evaluate progress in more generalized biases and the templating technique can be used to expand the benchmark with minimal additional human annotation.

NLP-70-标题: Semantic Information Extraction for Text Data with Probability Graph

链接: https://arxiv.org/abs/2309.08879
作者: Zhouxiang Zhao, Zhaohui Yang, Ye Hu, Licheng Lin, Zhaoyang Zhang
备注:

点击查看摘要

Abstract: In this paper, the problem of semantic information extraction for resource constrained text data transmission is studied. In the considered model, a sequence of text data need to be transmitted within a communication resource-constrained network, which only allows limited data transmission. Thus, at the transmitter, the original text data is extracted with natural language processing techniques. Then, the extracted semantic information is captured in a knowledge graph. An additional probability dimension is introduced in this graph to capture the importance of each information. This semantic information extraction problem is posed as an optimization framework whose goal is to extract most important semantic information for transmission. To find an optimal solution for this problem, a Floyd’s algorithm based solution coupled with an efficient sorting mechanism is proposed. Numerical results testify the effectiveness of the proposed algorithm with regards to two novel performance metrics including semantic uncertainty and semantic similarity.

NLP-71-标题: X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs

链接: https://arxiv.org/abs/2309.08873
作者: Juan Diego Rodriguez, Katrin Erk, Greg Durrett
备注:

点击查看摘要

Abstract: Understanding when two pieces of text convey the same information is a goal touching many subproblems in NLP, including textual entailment and fact-checking. This problem becomes more complex when those two pieces of text are in different languages. Here, we introduce X-PARADE (Cross-lingual Paragraph-level Analysis of Divergences and Entailments), the first cross-lingual dataset of paragraph-level information divergences. Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language, indicating whether a given piece of information is the same, new, or new but can be inferred. This last notion establishes a link with cross-language NLI. Aligned paragraphs are sourced from Wikipedia pages in different languages, reflecting real information divergences observed in the wild. Armed with our dataset, we investigate a diverse set of approaches for this problem, including classic token alignment from machine translation, textual entailment methods that localize their decisions, and prompting of large language models. Our results show that these methods vary in their capability to handle inferable information, but they all fall short of human performance.

NLP-72-标题: PDFTriage: Question Answering over Long Structured Documents

链接: https://arxiv.org/abs/2309.08872
作者: Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, Ryan A. Rossi, Franck Dernoncourt
备注:

点击查看摘要

Abstract: Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user’s mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA.

NLP-73-标题: MHLAT: Multi-hop Label-wise Attention Model for Automatic ICD Coding ICASSP2023

链接: https://arxiv.org/abs/2309.08868
作者: Junwen Duan, Han Jiang, Ying Yu
备注: 5 pages, 1 figure, accepted in ICASSP 2023

点击查看摘要

Abstract: International Classification of Diseases (ICD) coding is the task of assigning ICD diagnosis codes to clinical notes. This can be challenging given the large quantity of labels (nearly 9,000) and lengthy texts (up to 8,000 tokens). However, unlike the single-pass reading process in previous works, humans tend to read the text and label definitions again to get more confident answers. Moreover, although pretrained language models have been used to address these problems, they suffer from huge memory usage. To address the above problems, we propose a simple but effective model called the Multi-Hop Label-wise ATtention (MHLAT), in which multi-hop label-wise attention is deployed to get more precise and informative representations. Extensive experiments on three benchmark MIMIC datasets indicate that our method achieves significantly better or competitive performance on all seven metrics, with much fewer parameters to optimize.

NLP-74-标题: Has Sentiment Returned to the Pre-pandemic Level? A Sentiment Analysis Using U.S. College Subreddit Data from 2019 to 2022

链接: https://arxiv.org/abs/2309.08845
作者: Tian Yan, Fang Liu
备注:

点击查看摘要

Abstract: As impact of COVID-19 pandemic winds down, both individuals and society gradually return to pre-pandemic activities. This study aims to explore how people’s emotions have changed from the pre-pandemic during the pandemic to post-emergency period and whether it has returned to pre-pandemic level. We collected Reddit data in 2019 (pre-pandemic), 2020 (peak pandemic), 2021, and 2022 (late stages of pandemic, transitioning period to post-emergency period) from subreddits in 128 universities/colleges in the U.S., and a set of school-level characteristics. We predicted two sets of sentiments from a pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) and graph attention network (GAT) that leverages both rich semantic and relational information among posted messages and then applied a logistic stacking method to obtain the final sentiment classification. After obtaining sentiment label for each message, we used a generalized linear mixed-effects model to estimate temporal trend in sentiment from 2019 to 2022 and how school-level factors may affect sentiment. Compared to the year 2019, the odds of negative sentiment in years 2020, 2021, and 2022 are 24%, 4.3%, and 10.3% higher, respectively, which are all statistically significant(adjusted p <0.05). Our study findings suggest a partial recovery in the sentiment composition in the post-pandemic-emergency era. The results align with common expectations and provide a detailed quantification of how sentiments have evolved from 2019 to 2022.

NLP-75-标题: Bias and Fairness in Chatbots: An Overview

链接: https://arxiv.org/abs/2309.08836
作者: Jintang Xue, Yun-Cheng Wang, Chengwei Wei, Xiaofeng Liu, Jonghye Woo, C.-C. Jay Kuo
备注:

点击查看摘要

Abstract: Chatbots have been studied for more than half a century. With the rapid development of natural language processing (NLP) technologies in recent years, chatbots using large language models (LLMs) have received much attention nowadays. Compared with traditional ones, modern chatbots are more powerful and have been used in real-world applications. There are however, bias and fairness concerns in modern chatbot design. Due to the huge amounts of training data, extremely large model sizes, and lack of interpretability, bias mitigation and fairness preservation of modern chatbots are challenging. Thus, a comprehensive overview on bias and fairness in chatbot systems is given in this paper. The history of chatbots and their categories are first reviewed. Then, bias sources and potential harms in applications are analyzed. Considerations in designing fair and unbiased chatbot systems are examined. Finally, future research directions are discussed.

NLP-76-标题: SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window

链接: https://arxiv.org/abs/2309.08832
作者: Vikas Raunak, Tom Kocmi, Matt Post
备注:

点击查看摘要

Abstract: Reference-based metrics that operate at the sentence level typically outperform quality estimation metrics, which have access only to the source and system output. This is unsurprising, since references resolve ambiguities that may be present in the source. We investigate whether additional source context can effectively substitute for a reference. We present a metric, SLIDE (SLiding Document Evaluator), which operates on blocks of sentences using a window that slides over each document in the test set, feeding each chunk into an unmodified, off-the-shelf quality estimation model. We find that SLIDE obtains significantly higher pairwise system accuracy than its sentence-level baseline, in some cases even eliminating the gap with reference-base metrics. This suggests that source context may provide the same information as a human reference.

NLP-77-标题: S3-DST: Structured Open-Domain Dialogue Segmentation and State Tracking in the Era of LLM s

链接: https://arxiv.org/abs/2309.08827
作者: Sarkar Snigdha Sarathi Das, Chirag Shah, Mengting Wan, Jennifer Neville, Longqi Yang, Reid Andersen, Georg Buscher, Tara Safavi
备注:

点击查看摘要

Abstract: The traditional Dialogue State Tracking (DST) problem aims to track user preferences and intents in user-agent conversations. While sufficient for task-oriented dialogue systems supporting narrow domain applications, the advent of Large Language Model (LLM)-based chat systems has introduced many real-world intricacies in open-domain dialogues. These intricacies manifest in the form of increased complexity in contextual interactions, extended dialogue sessions encompassing a diverse array of topics, and more frequent contextual shifts. To handle these intricacies arising from evolving LLM-based chat systems, we propose joint dialogue segmentation and state tracking per segment in open-domain dialogue systems. Assuming a zero-shot setting appropriate to a true open-domain dialogue system, we propose S3-DST, a structured prompting technique that harnesses Pre-Analytical Recollection, a novel grounding mechanism we designed for improving long context tracking. To demonstrate the efficacy of our proposed approach in joint segmentation and state tracking, we evaluate S3-DST on a proprietary anonymized open-domain dialogue dataset, as well as publicly available DST and segmentation datasets. Across all datasets and settings, S3-DST consistently outperforms the state-of-the-art, demonstrating its potency and robustness the next generation of LLM-based chat systems.

NLP-78-标题: An Empirical Study on Instance Selection Strategies in Self-training for Sentiment Analysis

链接: https://arxiv.org/abs/2309.08777
作者: Haochen Liu, Sai Krishna Rallabandi, Yijing Wu, Parag Pravin Dakle, Preethi Raghavan
备注: 6 pages, 2 figures

点击查看摘要

Abstract: Sentiment analysis is a crucial task in natural language processing that involves identifying and extracting subjective sentiment from text. Self-training has recently emerged as an economical and efficient technique for developing sentiment analysis models by leveraging a small amount of labeled data and a larger amount of unlabeled data. However, the performance of a self-training procedure heavily relies on the choice of the instance selection strategy, which has not been studied thoroughly. This paper presents an empirical study on various instance selection strategies for self-training on two public sentiment datasets, and investigates the influence of the strategy and hyper-parameters on the performance of self-training in various few-shot settings.

NLP-79-标题: AlbNER: A Corpus for Named Entity Recognition in Albanian

链接: https://arxiv.org/abs/2309.08741
作者: Erion Çano
备注: 5 pages, 6 tables

点击查看摘要

Abstract: Scarcity of resources such as annotated text corpora for under-resourced languages like Albanian is a serious impediment in computational linguistics and natural language processing research. This paper presents AlbNER, a corpus of 900 sentences with labeled named entities, collected from Albanian Wikipedia articles. Preliminary results with BERT and RoBERTa variants fine-tuned and tested with AlbNER data indicate that model size has slight impact on NER performance, whereas language transfer has a significant one. AlbNER corpus and these obtained results should serve as baselines for future experiments.

NLP-80-标题: Generating Semantic Graph Corpora with Graph Expansion Grammar

链接: https://arxiv.org/abs/2309.08714
作者: Eric Andersson (Umeå University), Johanna Björklund (Umeå University), Frank Drewes (Umeå University), Anna Jonsson (Umeå University)
备注: In Proceedings NCMA 2023, arXiv:2309.07333

点击查看摘要

Abstract: We introduce Lovelace, a tool for creating corpora of semantic graphs. The system uses graph expansion grammar as a representational language, thus allowing users to craft a grammar that describes a corpus with desired properties. When given such grammar as input, the system generates a set of output graphs that are well-formed according to the grammar, i.e., a graph bank. The generation process can be controlled via a number of configurable parameters that allow the user to, for example, specify a range of desired output graph sizes. Central use cases are the creation of synthetic data to augment existing corpora, and as a pedagogical tool for teaching formal language theory.

NLP-81-标题: Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning

链接: https://arxiv.org/abs/2309.08708
作者: Miles Williams, Nikolaos Aletras
备注:

点击查看摘要

Abstract: The extensive memory footprint of pre-trained language models (PLMs) can hinder deployment in memory-constrained settings, such as cloud environments or on-device. PLMs use embedding matrices to represent extensive vocabularies, forming a large proportion of the model parameters. While previous work towards parameter-efficient PLM development has considered pruning parameters within the transformer layers, pruning the embedding matrix as part of fine-tuning or inference has yet to be explored. We first demonstrate that a significant proportion of the vocabulary remains unused in these scenarios. We then propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix. We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks. Notably, our approach maintains equivalent downstream task performance while allowing a more efficient use of compute resources.

NLP-82-标题: Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents

链接: https://arxiv.org/abs/2309.08695
作者: Ramona Christen, Anastassia Shaitarova, Matthias Stürmer, Joel Niklaus
备注:

点击查看摘要

Abstract: Resolving the scope of a negation within a sentence is a challenging NLP task. The complexity of legal texts and the lack of annotated in-domain negation corpora pose challenges for state-of-the-art (SotA) models when performing negation scope resolution on multilingual legal data. Our experiments demonstrate that models pre-trained without legal data underperform in the task of negation scope resolution. Our experiments, using language models exclusively fine-tuned on domains like literary texts and medical data, yield inferior results compared to the outcomes documented in prior cross-domain experiments. We release a new set of annotated court decisions in German, French, and Italian and use it to improve negation scope resolution in both zero-shot and multilingual settings. We achieve token-level F1-scores of up to 86.7% in our zero-shot cross-lingual experiments, where the models are trained on two languages of our legal datasets and evaluated on the third. Our multilingual experiments, where the models were trained on all available negation data and evaluated on our legal datasets, resulted in F1-scores of up to 91.1%.

NLP-83-标题: Fake News Detectors are Biased against Texts Generated by Large Language Models

链接: https://arxiv.org/abs/2309.08674
作者: Jinyan Su, Terry Yue Zhuo, Jonibek Mansurov, Di Wang, Preslav Nakov
备注: The first two authors contributed equally

点击查看摘要

Abstract: The spread of fake news has emerged as a critical challenge, undermining trust and posing threats to society. In the era of Large Language Models (LLMs), the capability to generate believable fake content has intensified these concerns. In this study, we present a novel paradigm to evaluate fake news detectors in scenarios involving both human-written and LLM-generated misinformation. Intriguingly, our findings reveal a significant bias in many existing detectors: they are more prone to flagging LLM-generated content as fake news while often misclassifying human-written fake news as genuine. This unexpected bias appears to arise from distinct linguistic patterns inherent to LLM outputs. To address this, we introduce a mitigation strategy that leverages adversarial training with LLM-paraphrased genuine news. The resulting model yielded marked improvements in detection accuracy for both human and LLM-generated news. To further catalyze research in this domain, we release two comprehensive datasets, \textttGossipCop++ and \textttPolitiFact++, thus amalgamating human-validated articles with LLM-generated fake and real news.

NLP-84-标题: Adversarial Attacks on Tables with Entity Swap VLDB2023

链接: https://arxiv.org/abs/2309.08650
作者: Aneta Koleva, Martin Ringsquandl, Volker Tresp
备注: Accepted at TaDA workshop at VLDB 2023

点击查看摘要

Abstract: The capabilities of large language models (LLMs) have been successfully applied in the context of table representation learning. The recently proposed tabular language models have reported state-of-the-art results across various tasks for table interpretation. However, a closer look into the datasets commonly used for evaluation reveals an entity leakage from the train set into the test set. Motivated by this observation, we explore adversarial attacks that represent a more realistic inference setup. Adversarial attacks on text have been shown to greatly affect the performance of LLMs, but currently, there are no attacks targeting tabular language models. In this paper, we propose an evasive entity-swap attack for the column type annotation (CTA) task. Our CTA attack is the first black-box attack on tables, where we employ a similarity-based sampling strategy to generate adversarial examples. The experimental results show that the proposed attack generates up to a 70% drop in performance.

NLP-85-标题: MAPLE: Mobile App Prediction Leveraging Large Language model Embeddings

链接: https://arxiv.org/abs/2309.08648
作者: Yonchanok Khaokaew, Hao Xue, Flora D. Salim
备注:

点击查看摘要

Abstract: Despite the rapid advancement of mobile applications, predicting app usage remains a formidable challenge due to intricate user behaviours and ever-evolving contexts. To address these issues, this paper introduces the Mobile App Prediction Leveraging Large Language Model Embeddings (MAPLE) model. This innovative approach utilizes Large Language Models (LLMs) to predict app usage accurately. Rigorous testing on two public datasets highlights MAPLE’s capability to decipher intricate patterns and comprehend user contexts. These robust results confirm MAPLE’s versatility and resilience across various scenarios. While its primary design caters to app prediction, the outcomes also emphasize the broader applicability of LLMs in different domains. Through this research, we emphasize the potential of LLMs in app usage prediction and suggest their transformative capacity in modelling human behaviours across diverse fields.

NLP-86-标题: Intent Detection at Scale: Tuning a Generic Model using Relevant Intents ICML

链接: https://arxiv.org/abs/2309.08647
作者: Nichal Narotamo, David Aparicio, Tiago Mesquita, Mariana Almeida
备注: 6 pages, 6 tables, 2 figures, ICMLA 2023

点击查看摘要

Abstract: Accurately predicting the intent of customer support requests is vital for efficient support systems, enabling agents to quickly understand messages and prioritize responses accordingly. While different approaches exist for intent detection, maintaining separate client-specific or industry-specific models can be costly and impractical as the client base expands. This work proposes a system to scale intent predictions to various clients effectively, by combining a single generic model with a per-client list of relevant intents. Our approach minimizes training and maintenance costs while providing a personalized experience for clients, allowing for seamless adaptation to changes in their relevant intents. Furthermore, we propose a strategy for using the clients relevant intents as model features that proves to be resilient to changes in the relevant intents of clients – a common occurrence in production environments. The final system exhibits significantly superior performance compared to industry-specific models, showcasing its flexibility and ability to cater to diverse client needs.

NLP-87-标题: Cure the headache of Transformer s via Collinear Constrained Attention

链接: https://arxiv.org/abs/2309.08646
作者: Shiyi Zhu, Jing Ye, Wei Jiang, Qi Zhang, Yifan Wu, Jianguo Li
备注: 16 pages, 6 figures

点击查看摘要

Abstract: As the rapid progression of practical applications based on Large Language Models continues, the importance of extrapolating performance has grown exponentially in the research domain. In our study, we identified an anomalous behavior in Transformer models that had been previously overlooked, leading to a chaos around closest tokens which carried the most important information. We’ve coined this discovery the “headache of Transformers”. To address this at its core, we introduced a novel self-attention structure named Collinear Constrained Attention (CoCA). This structure can be seamlessly integrated with existing extrapolation, interpolation methods, and other optimization strategies designed for traditional Transformer models. We have achieved excellent extrapolating performance even for 16 times to 24 times of sequence lengths during inference without any fine-tuning on our model. We have also enhanced CoCA’s computational and spatial efficiency to ensure its practicality. We plan to open-source CoCA shortly. In the meantime, we’ve made our code available in the appendix for reappearing experiments.

NLP-88-标题: Anchor Points: Benchmarking Models with Much Fewer Examples

链接: https://arxiv.org/abs/2309.08638
作者: Rajan Vivek, Kawin Ethayarajh, Diyi Yang, Douwe Kiela
备注:

点击查看摘要

Abstract: Modern language models often exhibit powerful but brittle behavior, leading to the development of larger and more diverse benchmarks to reliably assess their behavior. Here, we suggest that model performance can be benchmarked and elucidated with much smaller evaluation sets. We first show that in six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We build upon this phenomenon to propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset. Anchor points reliably rank models: across 87 diverse language model-prompt pairs, evaluating models using 1-30 anchor points outperforms uniform sampling and other baselines at accurately ranking models. Moreover, just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error, sufficient for gauging where the model is likely to fail. Lastly, we present Anchor Point Maps for visualizing these insights and facilitating comparisons of the performance of different models on various regions within the dataset distribution.

NLP-89-标题: TextBind: Multi-turn Interleaved Multimodal Instruction-following

链接: https://arxiv.org/abs/2309.08637
作者: Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi
备注: work in progress

点击查看摘要

Abstract: Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.

NLP-90-标题: ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI chatbots at scientific writing? (ver. 23Q3)

链接: https://arxiv.org/abs/2309.08636
作者: Edisa Lozić, Benjamin Štular
备注: Non-peer reviewed preprint. Includes Graphical abstract, 8 Figures, and 3 Appendices

点击查看摘要

Abstract: Historically, proficient writing was deemed essential for human advancement, with creative expression viewed as one of the hallmarks of human achievement. However, recent advances in generative AI have marked an inflection point in this narrative, including for scientific writing. This article provides a comprehensive analysis of the capabilities and limitations of six AI chatbots in scholarly writing in the humanities and archaeology. The methodology was based on tagging AI generated content for quantitative accuracy and qualitative precision by human experts. Quantitative accuracy assessed the factual correctness, while qualitative precision gauged the scientific contribution. While the AI chatbots, especially ChatGPT-4, demonstrated proficiency in recombining existing knowledge, they failed in generating original scientific content. As a side note, our results also suggest that with ChatGPT-4 the size of the LLMs has plateaued. Furthermore, the paper underscores the intricate and recursive nature of human research. This process of transforming raw data into refined knowledge is computationally irreducible, which highlights the challenges AI chatbots face in emulating human originality in scientific writing. In conclusion, while large language models have revolutionised content generation, their ability to produce original scientific contributions in the humanities remains limited. We expect that this will change in the near future with the evolution of current LLM-based AI chatbots towards LLM-powered software.

NLP-91-标题: Pretrain ing on the Test Set Is All You Need

链接: https://arxiv.org/abs/2309.08632
作者: Rylan Schaeffer
备注: 3 pages, satire

点击查看摘要

Abstract: Inspired by recent work demonstrating the promise of smaller Transformer-based language models pretrained on carefully curated data, we supercharge such approaches by investing heavily in curating a novel, high quality, non-synthetic data mixture based solely on evaluation benchmarks. Using our novel dataset mixture consisting of less than 100 thousand tokens, we pretrain a 1 million parameter transformer-based LLM \textbfphi-CTNL (pronounced ``fictional") that achieves perfect results across diverse academic benchmarks, strictly outperforming all known foundation models. \textbfphi-CTNL also beats power-law scaling and exhibits a never-before-seen grokking-like ability to accurately predict downstream evaluation benchmarks’ canaries.

NLP-92-标题: Large Language Models Can Infer Psychological Dispositions of Social Media Users

链接: https://arxiv.org/abs/2309.08631
作者: Heinrich Peters, Sandra Matz
备注:

点击查看摘要

Abstract: As Large Language Models (LLMs) demonstrate increasingly human-like abilities in various natural language processing (NLP) tasks that are bound to become integral to personalized technologies, understanding their capabilities and inherent biases is crucial. Our study investigates the potential of LLMs like ChatGPT to infer psychological dispositions of individuals from their digital footprints. Specifically, we assess the ability of GPT-3.5 and GPT-4 to derive the Big Five personality traits from users’ Facebook status updates in a zero-shot learning scenario. Our results show an average correlation of r = .29 (range = [.22, .33]) between LLM-inferred and self-reported trait scores. Furthermore, our findings suggest biases in personality inferences with regard to gender and age: inferred scores demonstrated smaller errors for women and younger individuals on several traits, suggesting a potential systematic bias stemming from the underlying training data or differences in online self-expression.

NLP-93-标题: Recovering from Privacy-Preserving Masking with Large Language Models ICASSP

链接: https://arxiv.org/abs/2309.08628
作者: Arpita Vats, Zhe Liu, Peng Su, Debjyoti Paul, Yingyi Ma, Yutong Pang, Zeeshan Ahmed, Ozlem Kalinli
备注: Submitted to ICASSP

点击查看摘要

Abstract: Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.

NLP-94-标题: Evaluating Dynamic Topic Models

链接: https://arxiv.org/abs/2309.08627
作者: Charu James, Mayank Nagda, Nooshin Haji Ghassemi, Marius Kloft, Sophie Fellenz
备注:

点击查看摘要

Abstract: There is a lack of quantitative measures to evaluate the progression of topics through time in dynamic topic models (DTMs). Filling this gap, we propose a novel evaluation measure for DTMs that analyzes the changes in the quality of each topic over time. Additionally, we propose an extension combining topic quality with the model’s temporal consistency. We demonstrate the utility of the proposed measure by applying it to synthetic data and data from existing DTMs. We also conducted a human evaluation, which indicates that the proposed measure correlates well with human judgment. Our findings may help in identifying changing topics, evaluating different DTMs, and guiding future research in this area.

NLP-95-标题: Improving Robustness of Neural Inverse Text Normalization via Data-Augmentation Semi-Supervised Learning and Post-Aligning Method ICASSP2024

链接: https://arxiv.org/abs/2309.08626
作者: Juntae Kim, Minkyu Lim, Seokjin Hong
备注: submitted to ICASSP 2024

点击查看摘要

Abstract: Inverse text normalization (ITN) is crucial for converting spoken-form into written-form, especially in the context of automatic speech recognition (ASR). While most downstream tasks of ASR rely on written-form, ASR systems often output spoken-form, highlighting the necessity for robust ITN in product-level ASR-based applications. Although neural ITN methods have shown promise, they still encounter performance challenges, particularly when dealing with ASR-generated spoken text. These challenges arise from the out-of-domain problem between training data and ASR-generated text. To address this, we propose a direct training approach that utilizes ASR-generated written or spoken text, with pairs augmented through ASR linguistic context emulation and a semi-supervised learning method enhanced by a large language model, respectively. Additionally, we introduce a post-aligning method to manage unpredictable errors, thereby enhancing the reliability of ITN. Our experiments show that our proposed methods remarkably improved ITN performance in various ASR scenarios.

NLP-96-标题: Performance of ChatGPT -3.5 and GPT -4 on the United States Medical Licensing Examination With and Without Distractions

链接: https://arxiv.org/abs/2309.08625
作者: Myriam Safrai, Amos Azaria
备注: We release a dataset along with this paper

点击查看摘要

Abstract: As Large Language Models (LLMs) are predictive models building their response based on the words in the prompts, there is a risk that small talk and irrelevant information may alter the response and the suggestion given. Therefore, this study aims to investigate the impact of medical data mixed with small talk on the accuracy of medical advice provided by ChatGPT. USMLE step 3 questions were used as a model for relevant medical data. We use both multiple choice and open ended questions. We gathered small talk sentences from human participants using the Mechanical Turk platform. Both sets of USLME questions were arranged in a pattern where each sentence from the original questions was followed by a small talk sentence. ChatGPT 3.5 and 4 were asked to answer both sets of questions with and without the small talk sentences. A board-certified physician analyzed the answers by ChatGPT and compared them to the formal correct answer. The analysis results demonstrate that the ability of ChatGPT-3.5 to answer correctly was impaired when small talk was added to medical data for multiple-choice questions (72.1% vs. 68.9%) and open questions (61.5% vs. 44.3%; p=0.01), respectively. In contrast, small talk phrases did not impair ChatGPT-4 ability in both types of questions (83.6% and 66.2%, respectively). According to these results, ChatGPT-4 seems more accurate than the earlier 3.5 version, and it appears that small talk does not impair its capability to provide medical recommendations. Our results are an important first step in understanding the potential and limitations of utilizing ChatGPT and other LLMs for physician-patient interactions, which include casual conversations.

NLP-97-标题: Challenges in Annotating Dataset s to Quantify Bias in Under-represented Society

链接: https://arxiv.org/abs/2309.08624
作者: Vithya Yogarajan, Gillian Dobbie, Timothy Pistotti, Joshua Bensemann, Kobe Knowles
备注: Accepted in Ethics and Trust in Human-AI Collaboration: Socio-Technical Approaches @ The 32nd International Joint Conference on Artificial Intelligence

点击查看摘要

Abstract: Recent advances in artificial intelligence, including the development of highly sophisticated large language models (LLM), have proven beneficial in many real-world applications. However, evidence of inherent bias encoded in these LLMs has raised concerns about equity. In response, there has been an increase in research dealing with bias, including studies focusing on quantifying bias and developing debiasing techniques. Benchmark bias datasets have also been developed for binary gender classification and ethical/racial considerations, focusing predominantly on American demographics. However, there is minimal research in understanding and quantifying bias related to under-represented societies. Motivated by the lack of annotated datasets for quantifying bias in under-represented societies, we endeavoured to create benchmark datasets for the New Zealand (NZ) population. We faced many challenges in this process, despite the availability of three annotators. This research outlines the manual annotation process, provides an overview of the challenges we encountered and lessons learnt, and presents recommendations for future research.

NLP-98-标题: Analyzing Character and Consciousness in AI-Generated Social Content: A Case Study of Chirper the AI Social Network

链接: https://arxiv.org/abs/2309.08614
作者: Jianwei Luo
备注:

点击查看摘要

Abstract: This paper delves into an intricate analysis of the character and consciousness of AI entities, with a particular focus on Chirpers within the AI social network. At the forefront of this research is the introduction of novel testing methodologies, including the Influence index and Struggle Index Test, which offers a fresh lens for evaluating specific facets of AI behavior. The study embarks on a comprehensive exploration of AI behavior, analyzing the effects of diverse settings on Chirper’s responses, thereby shedding light on the intricate mechanisms steering AI reactions in different contexts. Leveraging the state-of-the-art BERT model, the research assesses AI’s ability to discern its own output, presenting a pioneering approach to understanding self-recognition in AI systems. Through a series of cognitive tests, the study gauges the self-awareness and pattern recognition prowess of Chirpers. Preliminary results indicate that Chirpers exhibit a commendable degree of self-recognition and self-awareness. However, the question of consciousness in these AI entities remains a topic of debate. An intriguing aspect of the research is the exploration of the potential influence of a Chirper’s handle or personality type on its performance. While initial findings suggest a possible impact, it isn’t pronounced enough to form concrete conclusions. This study stands as a significant contribution to the discourse on AI consciousness, underscoring the imperative for continued research to unravel the full spectrum of AI capabilities and the ramifications they hold for future human-AI interactions.

NLP-99-标题: Multimodal Recommender Systems in the Prediction of Disease Comorbidity

链接: https://arxiv.org/abs/2309.08613
作者: Aashish Cheruvu
备注: 2022 Fourth International Conference on Transdisciplinary AI (TransAI)

点击查看摘要

Abstract: While deep-learning based recommender systems utilizing collaborative filtering have been commonly used for recommendation in other domains, their application in the medical domain have been limited. In addition to modeling user-item interactions, we show that deep-learning based recommender systems can be used to model subject-disease code interactions. Two novel applications of deep learning-based recommender systems using Neural Collaborative Filtering (NCF) and Deep Hybrid Filtering (DHF) were utilized for disease diagnosis based on known past patient comorbidities. Two datasets, one incorporating all subject-disease code pairs present in the MIMIC-III database, and the other incorporating the top 50 most commonly occurring diseases, were used for prediction. Accuracy and Hit Ratio@10 were utilized as metrics to estimate model performance. The performance of the NCF model making use of the reduced “top 50” ICD-9 code dataset was found to be lower (accuracy of ~80% and hit ratio@10 of 35%) as compared to the performance of the NCF model trained on all ICD-9 codes (accuracy of ~90% and hit ratio@10 of ~80%). Reasons for the superior performance of the sparser dataset with all ICD codes can be mainly attributed to the higher volume of data and the robustness of deep-learning based recommender systems with modeling sparse data. Additionally, results from the DHF models reflect better performance than the NCF models, with a better accuracy of 94.4% and hit ratio@10 of 85.36%, reflecting the importance of the incorporation of clinical note information. Additionally, compared to literature reports utilizing primarily natural language processing-based predictions for the task of ICD-9 code co-occurrence, the novel deep learning-based recommender systems approach performed better. Overall, the deep learning-based recommender systems have shown promise in predicting disease comorbidity.

NLP-100-标题: Explaining Vision and Language through Graphs of Events in Space and Time ICCV

链接: https://arxiv.org/abs/2309.08612
作者: Mihai Masala, Nicolae Cudlenco, Traian Rebedea, Marius Leordeanu
备注: Accepted at IEEE International Conference on Computer Vision (ICCV) 2023 Workshops: 5th Workshop On Closing The Loop Between Vision And Language

点击查看摘要

Abstract: Artificial Intelligence makes great advances today and starts to bridge the gap between vision and language. However, we are still far from understanding, explaining and controlling explicitly the visual content from a linguistic perspective, because we still lack a common explainable representation between the two domains. In this work we come to address this limitation and propose the Graph of Events in Space and Time (GEST), by which we can represent, create and explain, both visual and linguistic stories. We provide a theoretical justification of our model and an experimental validation, which proves that GEST can bring a solid complementary value along powerful deep learning models. In particular, GEST can help improve at the content-level the generation of videos from text, by being easily incorporated into our novel video generation engine. Additionally, by using efficient graph matching techniques, the GEST graphs can also improve the comparisons between texts at the semantic level.

NLP-101-标题: Media of Langue

链接: https://arxiv.org/abs/2309.08609
作者: Goki Muramoto, Atsuki Sato, Takayoshi Koyama
备注:

点击查看摘要

Abstract: This paper aims to archive the materials behind “Media of Langue” by Goki Muramoto et al. Media of Langue is a new dictionary and public sculpture that depicts the map of meaning on the boundary between languages solely from the vast events of “this word was translated into that word” and two forces: repulsion between all words in the same language and attraction between translated words in different languages. First, the three new concepts proposed, Inter-Langue Map/Dictionary, Inter-Langue Space, and then Inter-Langue Network, are introduced, comparing them to the three domains of dictionary, semantic space, and semantic network. Next, the specific algorithms and designs implemented in the work were described.

NLP-102-标题: RECAP: Retrieval-Augmented Audio Captioning

链接: https://arxiv.org/abs/2309.09836
作者: Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
备注: Code and data soon here: this https URL

点击查看摘要

Abstract: We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a \textittraining-free fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.

NLP-103-标题: Training dynamic models using early exits for automatic speech recognition on resource-constrained devices

链接: https://arxiv.org/abs/2309.09546
作者: George August Wright, Umberto Cappellazzo, Salah Zaiem, Desh Raj, Lucas Ondel Yang, Daniele Falavigna, Alessio Brutti
备注:

点击查看摘要

Abstract: The possibility of dynamically modifying the computational load of neural models at inference time is crucial for on-device processing, where computational power is limited and time-varying. Established approaches for neural model compression exist, but they provide architecturally static models. In this paper, we investigate the use of early-exit architectures, that rely on intermediate exit branches, applied to large-vocabulary speech recognition. This allows for the development of dynamic models that adjust their computational cost to the available resources and recognition performance. Unlike previous works, besides using pre-trained backbones we also train the model from scratch with an early-exit architecture. Experiments on public datasets show that early-exit architectures from scratch not only preserve performance levels when using fewer encoder layers, but also improve task accuracy as compared to using single-exit models or using pre-trained models. Additionally, we investigate an exit selection strategy based on posterior probabilities as an alternative to frame-based entropy.

NLP-104-标题: Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter ICASSP2024

链接: https://arxiv.org/abs/2309.09443
作者: Song Li, Yonbin You, Xuezhi Wang, Ke Ding, Guanglu Wan
备注: Submitted to ICASSP2024

点击查看摘要

Abstract: Multilingual intelligent assistants, such as ChatGPT, have recently gained popularity. To further expand the applications of multilingual artificial intelligence assistants and facilitate international communication, it is essential to enhance the performance of multilingual speech recognition, which is a crucial component of speech interaction. In this paper, we propose two simple and parameter-efficient methods: language prompt tuning and frame-level language adapter, to respectively enhance language-configurable and language-agnostic multilingual speech recognition. Additionally, we explore the feasibility of integrating these two approaches using parameter-efficient fine-tuning methods. Our experiments demonstrate significant performance improvements across seven languages using our proposed methods.

NLP-105-标题: MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

链接: https://arxiv.org/abs/2309.08730
作者: Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, Emmanouil Benetos
备注:

点击查看摘要

Abstract: Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains relatively unexplored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with the frozen LLaMA language model, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from MusicCaps, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.

机器学习

ML-0-标题: A Multi-Token Coordinate Descent Method for Semi-Decentralized Vertical Federated Learning

链接: https://arxiv.org/abs/2309.09977
作者: Pedro Valdeira, Yuejie Chi, Cláudia Soares, João Xavier
备注:

点击查看摘要

Abstract: Communication efficiency is a major challenge in federated learning (FL). In client-server schemes, the server constitutes a bottleneck, and while decentralized setups spread communications, they do not necessarily reduce them due to slower convergence. We propose Multi-Token Coordinate Descent (MTCD), a communication-efficient algorithm for semi-decentralized vertical federated learning, exploiting both client-server and client-client communications when each client holds a small subset of features. Our multi-token method can be seen as a parallel Markov chain (block) coordinate descent algorithm and it subsumes the client-server and decentralized setups as special cases. We obtain a convergence rate of \mathcalO(1/T) for nonconvex objectives when tokens roam over disjoint subsets of clients and for convex objectives when they roam over possibly overlapping subsets. Numerical results show that MTCD improves the state-of-the-art communication efficiency and allows for a tunable amount of parallel communications.

ML-1-标题: Empirical Study of Mix-based Data Augmentation Methods in Physiological Time Series Data ALT

链接: https://arxiv.org/abs/2309.09970
作者: Peikun Guo, Huiyuan Yang, Akane Sano
备注: The 11th IEEE International Conference on Healthcare Informatics (IEEE ICHI 2023)

点击查看摘要

Abstract: Data augmentation is a common practice to help generalization in the procedure of deep model training. In the context of physiological time series classification, previous research has primarily focused on label-invariant data augmentation methods. However, another class of augmentation techniques (\textiti.e., Mixup) that emerged in the computer vision field has yet to be fully explored in the time series domain. In this study, we systematically review the mix-based augmentations, including mixup, cutmix, and manifold mixup, on six physiological datasets, evaluating their performance across different sensory data and classification tasks. Our results demonstrate that the three mix-based augmentations can consistently improve the performance on the six datasets. More importantly, the improvement does not rely on expert knowledge or extensive parameter tuning. Lastly, we provide an overview of the unique properties of the mix-based augmentation methods and highlight the potential benefits of using the mix-based augmentation in physiological time series data.

ML-2-标题: Prompt a Robot to Walk with Large Language Models

链接: https://arxiv.org/abs/2309.09969
作者: Yen-Jen Wang, Bike Zhang, Jianyu Chen, Koushil Sreenath
备注:

点击查看摘要

Abstract: Large language models (LLMs) pre-trained on vast internet-scale data have showcased remarkable capabilities across diverse domains. Recently, there has been escalating interest in deploying LLMs for robotics, aiming to harness the power of foundation models in real-world settings. However, this approach faces significant challenges, particularly in grounding these models in the physical world and in generating dynamic robot motions. To address these issues, we introduce a novel paradigm in which we use few-shot prompts collected from the physical environment, enabling the LLM to autoregressively generate low-level control commands for robots without task-specific fine-tuning. Experiments across various robots and environments validate that our method can effectively prompt a robot to walk. We thus illustrate how LLMs can proficiently function as low-level feedback controllers for dynamic motion control even in high-dimensional robotic systems. The project website and source code can be found at: this https URL .

ML-3-标题: Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

链接: https://arxiv.org/abs/2309.09968
作者: Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman
备注:

点击查看摘要

Abstract: Tabular data is hard to acquire and is subject to missing values. This paper proposes a novel approach to generate and impute mixed-type (continuous and categorical) tabular data using score-based diffusion and conditional flow matching. Contrary to previous work that relies on neural networks as function approximators, we instead utilize XGBoost, a popular Gradient-Boosted Tree (GBT) method. In addition to being elegant, we empirically show on various datasets that our method i) generates highly realistic synthetic data when the training dataset is either clean or tainted by missing data and ii) generates diverse plausible data imputations. Our method often outperforms deep-learning generation methods and can trained in parallel using CPUs without the need for a GPU. To make it easily accessible, we release our code through a Python library on PyPI and an R package on CRAN.

ML-4-标题: Evaluating Adversarial Robustness with Expected Viable Performance

链接: https://arxiv.org/abs/2309.09928
作者: Ryan McCoppin, Colin Dawson, Sean M. Kennedy, Leslie M. Blaha
备注: Accepted at the 22nd International Conference on Machine Learning and Applications (IEEE 2023)

点击查看摘要

Abstract: We introduce a metric for evaluating the robustness of a classifier, with particular attention to adversarial perturbations, in terms of expected functionality with respect to possible adversarial perturbations. A classifier is assumed to be non-functional (that is, has a functionality of zero) with respect to a perturbation bound if a conventional measure of performance, such as classification accuracy, is less than a minimally viable threshold when the classifier is tested on examples from that perturbation bound. Defining robustness in terms of an expected value is motivated by a domain general approach to robustness quantification.

ML-5-标题: Graph topological property recovery with heat and wave dynamics-based features on graphsD

链接: https://arxiv.org/abs/2309.09924
作者: Dhananjay Bhaskar, Yanlei Zhang, Charles Xu, Xingzhi Sun, Oluwadamilola Fasina, Guy Wolf, Maximilian Nickel, Michael Perlmutter, Smita Krishnaswamy
备注:

点击查看摘要

Abstract: In this paper, we propose Graph Differential Equation Network (GDeNet), an approach that harnesses the expressive power of solutions to PDEs on a graph to obtain continuous node- and graph-level representations for various downstream tasks. We derive theoretical results connecting the dynamics of heat and wave equations to the spectral properties of the graph and to the behavior of continuous-time random walks on graphs. We demonstrate experimentally that these dynamics are able to capture salient aspects of graph geometry and topology by recovering generating parameters of random graphs, Ricci curvature, and persistent homology. Furthermore, we demonstrate the superior performance of GDeNet on real-world datasets including citation graphs, drug-like molecules, and proteins.

ML-6-标题: A Heterogeneous Graph-Based Multi-Task Learning for Fault Event Diagnosis in Smart Grid

链接: https://arxiv.org/abs/2309.09921
作者: Dibaloke Chanda, Nasim Yahya Soltani
备注:

点击查看摘要

Abstract: Precise and timely fault diagnosis is a prerequisite for a distribution system to ensure minimum downtime and maintain reliable operation. This necessitates access to a comprehensive procedure that can provide the grid operators with insightful information in the case of a fault event. In this paper, we propose a heterogeneous multi-task learning graph neural network (MTL-GNN) capable of detecting, locating and classifying faults in addition to providing an estimate of the fault resistance and current. Using a graph neural network (GNN) allows for learning the topological representation of the distribution system as well as feature learning through a message-passing scheme. We investigate the robustness of our proposed model using the IEEE-123 test feeder system. This work also proposes a novel GNN-based explainability method to identify key nodes in the distribution system which then facilitates informed sparse measurements. Numerical tests validate the performance of the model across all tasks.

ML-7-标题: Evaluation of Human-Understandability of Global Model Explanations using Decision Tree

链接: https://arxiv.org/abs/2309.09917
作者: Adarsa Sivaprasad, Ehud Reiter, Nava Tintarev, Nir Oren
备注:

点击查看摘要

Abstract: In explainable artificial intelligence (XAI) research, the predominant focus has been on interpreting models for experts and practitioners. Model agnostic and local explanation approaches are deemed interpretable and sufficient in many applications. However, in domains like healthcare, where end users are patients without AI or domain expertise, there is an urgent need for model explanations that are more comprehensible and instil trust in the model’s operations. We hypothesise that generating model explanations that are narrative, patient-specific and global(holistic of the model) would enable better understandability and enable decision-making. We test this using a decision tree model to generate both local and global explanations for patients identified as having a high risk of coronary heart disease. These explanations are presented to non-expert users. We find a strong individual preference for a specific type of explanation. The majority of participants prefer global explanations, while a smaller group prefers local explanations. A task based evaluation of mental models of these participants provide valuable feedback to enhance narrative global explanations. This, in turn, guides the design of health informatics systems that are both trustworthy and actionable.

ML-8-标题: Wait That Feels Familiar: Learning to Extrapolate Human Preferences for Preference Aligned Path Planning

链接: https://arxiv.org/abs/2309.09912
作者: Haresh Karnan, Elvin Yang, Garrett Warnell, Joydeep Biswas, Peter Stone
备注:

点击查看摘要

Abstract: Autonomous mobility tasks such as lastmile delivery require reasoning about operator indicated preferences over terrains on which the robot should navigate to ensure both robot safety and mission success. However, coping with out of distribution data from novel terrains or appearance changes due to lighting variations remains a fundamental problem in visual terrain adaptive navigation. Existing solutions either require labor intensive manual data recollection and labeling or use handcoded reward functions that may not align with operator preferences. In this work, we posit that operator preferences for visually novel terrains, which the robot should adhere to, can often be extrapolated from established terrain references within the inertial, proprioceptive, and tactile domain. Leveraging this insight, we introduce Preference extrApolation for Terrain awarE Robot Navigation, PATERN, a novel framework for extrapolating operator terrain preferences for visual navigation. PATERN learns to map inertial, proprioceptive, tactile measurements from the robots observations to a representation space and performs nearest neighbor search in this space to estimate operator preferences over novel terrains. Through physical robot experiments in outdoor environments, we assess PATERNs capability to extrapolate preferences and generalize to novel terrains and challenging lighting conditions. Compared to baseline approaches, our findings indicate that PATERN robustly generalizes to diverse terrains and varied lighting conditions, while navigating in a preference aligned manner.

ML-9-标题: Context approx Environment

链接: https://arxiv.org/abs/2309.09888
作者: Sharut Gupta, Stefanie Jegelka, David Lopez-Paz, Kartik Ahuja
备注: 42 Pages, 4 Figures

点击查看摘要

Abstract: Two lines of work are taking the central stage in AI research. On the one hand, the community is making increasing efforts to build models that discard spurious correlations and generalize better in novel test environments. Unfortunately, the bitter lesson so far is that no proposal convincingly outperforms a simple empirical risk minimization baseline. On the other hand, large language models (LLMs) have erupted as algorithms able to learn in-context, generalizing on-the-fly to the eclectic contextual circumstances that users enforce by means of prompting. In this paper, we argue that context \approx environment, and posit that in-context learning holds the key to better domain generalization. Via extensive theory and experiments, we show that paying attention to context \unicodex2013\unicodex2013 unlabeled examples as they arrive \unicodex2013\unicodex2013 allows our proposed In-Context Risk Minimization (ICRM) algorithm to zoom-in on the test environment risk minimizer, leading to significant out-of-distribution performance improvements. From all of this, two messages are worth taking home. Researchers in domain generalization should consider environment as context, and harness the adaptive power of in-context learning. Researchers in LLMs should consider context as environment to better structure data towards generalization.

ML-10-标题: Deep Reinforcement Learning for the Joint Control of Traffic Light Signaling and Vehicle Speed Advice ICML

链接: https://arxiv.org/abs/2309.09881
作者: Johannes V. S. Busch, Robert Voelckner, Peter Sossalla, Christian L. Vielhaus, Roberto Calandra, Frank H. P. Fitzek
备注: 6 pages, 2 figures, accepted for publication at IEEE ICMLA 2023

点击查看摘要

Abstract: Traffic congestion in dense urban centers presents an economical and environmental burden. In recent years, the availability of vehicle-to-anything communication allows for the transmission of detailed vehicle states to the infrastructure that can be used for intelligent traffic light control. The other way around, the infrastructure can provide vehicles with advice on driving behavior, such as appropriate velocities, which can improve the efficacy of the traffic system. Several research works applied deep reinforcement learning to either traffic light control or vehicle speed advice. In this work, we propose a first attempt to jointly learn the control of both. We show this to improve the efficacy of traffic systems. In our experiments, the joint control approach reduces average vehicle trip delays, w.r.t. controlling only traffic lights, in eight out of eleven benchmark scenarios. Analyzing the qualitative behavior of the vehicle speed advice policy, we observe that this is achieved by smoothing out the velocity profile of vehicles nearby a traffic light. Learning joint control of traffic signaling and speed advice in the real world could help to reduce congestion and mitigate the economical and environmental repercussions of today’s traffic systems.

ML-11-标题: Clustering of Urban Traffic Patterns by K-Means and Dynamic Time Warping: Case Study

链接: https://arxiv.org/abs/2309.09830
作者: Sadegh Etemad, Raziyeh Mosayebi, Tadeh Alexani Khodavirdian, Elahe Dastan, Amir Salari Telmadarreh, Mohammadreza Jafari, Sepehr Rafiei
备注:

点击查看摘要

Abstract: Clustering of urban traffic patterns is an essential task in many different areas of traffic management and planning. In this paper, two significant applications in the clustering of urban traffic patterns are described. The first application estimates the missing speed values using the speed of road segments with similar traffic patterns to colorify map tiles. The second one is the estimation of essential road segments for generating addresses for a local point on the map, using the similarity patterns of different road segments. The speed time series extracts the traffic pattern in different road segments. In this paper, we proposed the time series clustering algorithm based on K-Means and Dynamic Time Warping. The case study of our proposed algorithm is based on the Snapp application’s driver speed time series data. The results of the two applications illustrate that the proposed method can extract similar urban traffic patterns.

ML-12-标题: Learning Optimal Contracts: How to Exploit Small Action Spaces

链接: https://arxiv.org/abs/2309.09801
作者: Francesco Bacchiocchi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
备注:

点击查看摘要

Abstract: We study principal-agent problems in which a principal commits to an outcome-dependent payment scheme – called contract – in order to induce an agent to take a costly, unobservable action leading to favorable outcomes. We consider a generalization of the classical (single-round) version of the problem in which the principal interacts with the agent by committing to contracts over multiple rounds. The principal has no information about the agent, and they have to learn an optimal contract by only observing the outcome realized at each round. We focus on settings in which the size of the agent’s action space is small. We design an algorithm that learns an approximately-optimal contract with high probability in a number of rounds polynomial in the size of the outcome space, when the number of actions is constant. Our algorithm solves an open problem by Zhu et al.[2022]. Moreover, it can also be employed to provide a \tilde\mathcalO(T^4/5) regret bound in the related online learning setting in which the principal aims at maximizing their cumulative utility, thus considerably improving previously-known regret bounds.

ML-13-标题: Harnessing Collective Intelligence Under a Lack of Cultural Consensus

链接: https://arxiv.org/abs/2309.09787
作者: Necdet Gürkan, Jordan W. Suchow
备注:

点击查看摘要

Abstract: Harnessing collective intelligence to drive effective decision-making and collaboration benefits from the ability to detect and characterize heterogeneity in consensus beliefs. This is particularly true in domains such as technology acceptance or leadership perception, where a consensus defines an intersubjective truth, leading to the possibility of multiple “ground truths” when subsets of respondents sustain mutually incompatible consensuses. Cultural Consensus Theory (CCT) provides a statistical framework for detecting and characterizing these divergent consensus beliefs. However, it is unworkable in modern applications because it lacks the ability to generalize across even highly similar beliefs, is ineffective with sparse data, and can leverage neither external knowledge bases nor learned machine representations. Here, we overcome these limitations through Infinite Deep Latent Construct Cultural Consensus Theory (iDLC-CCT), a nonparametric Bayesian model that extends CCT with a latent construct that maps between pretrained deep neural network embeddings of entities and the consensus beliefs regarding those entities among one or more subsets of respondents. We validate the method across domains including perceptions of risk sources, food healthiness, leadership, first impressions, and humor. We find that iDLC-CCT better predicts the degree of consensus, generalizes well to out-of-sample entities, and is effective even with sparse data. To improve scalability, we introduce an efficient hard-clustering variant of the iDLC-CCT using an algorithm derived from a small-variance asymptotic analysis of the model. The iDLC-CCT, therefore, provides a workable computational foundation for harnessing collective intelligence under a lack of cultural consensus and may potentially form the basis of consensus-aware information technologies.

ML-14-标题: Contrastive Initial State Buffer for Reinforcement Learning

链接: https://arxiv.org/abs/2309.09752
作者: Nico Messikommer, Yunlong Song, Davide Scaramuzza
备注:

点击查看摘要

Abstract: In Reinforcement Learning, the trade-off between exploration and exploitation poses a complex challenge for achieving efficient learning from limited samples. While recent works have been effective in leveraging past experiences for policy updates, they often overlook the potential of reusing past experiences for data collection. Independent of the underlying RL algorithm, we introduce the concept of a Contrastive Initial State Buffer, which strategically selects states from past experiences and uses them to initialize the agent in the environment in order to guide it toward more informative states. We validate our approach on two complex robotic tasks without relying on any prior information about the environment: (i) locomotion of a quadruped robot traversing challenging terrains and (ii) a quadcopter drone racing through a track. The experimental results show that our initial state buffer achieves higher task performance than the nominal baseline while also speeding up training convergence.

ML-15-标题: Towards Better Modeling with Missing Data: A Contrastive Learning -based Visual Analytics Perspective

链接: https://arxiv.org/abs/2309.09744
作者: Laixin Xie, Yang Ouyang, Longfei Chen, Ziming Wu, Quan Li
备注: 18 pages, 11 figures. This paper is accepted by IEEE Transactions on Visualization and Computer Graphics (TVCG)

点击查看摘要

Abstract: Missing data can pose a challenge for machine learning (ML) modeling. To address this, current approaches are categorized into feature imputation and label prediction and are primarily focused on handling missing data to enhance ML performance. These approaches rely on the observed data to estimate the missing values and therefore encounter three main shortcomings in imputation, including the need for different imputation methods for various missing data mechanisms, heavy dependence on the assumption of data distribution, and potential introduction of bias. This study proposes a Contrastive Learning (CL) framework to model observed data with missing values, where the ML model learns the similarity between an incomplete sample and its complete counterpart and the dissimilarity between other samples. Our proposed approach demonstrates the advantages of CL without requiring any imputation. To enhance interpretability, we introduce CIVis, a visual analytics system that incorporates interpretable techniques to visualize the learning process and diagnose the model status. Users can leverage their domain knowledge through interactive sampling to identify negative and positive pairs in CL. The output of CIVis is an optimized model that takes specified features and predicts downstream tasks. We provide two usage scenarios in regression and classification tasks and conduct quantitative experiments, expert interviews, and a qualitative user study to demonstrate the effectiveness of our approach. In short, this study offers a valuable contribution to addressing the challenges associated with ML modeling in the presence of missing data by providing a practical solution that achieves high predictive accuracy and model interpretability.

ML-16-标题: Contrastive Learning and Data Augmentation in Traffic Classification Using a Flowpic Input Representation

链接: https://arxiv.org/abs/2309.09733
作者: Alessandro Finamore, Chao Wang, Jonatan Krolikowski, Jose M. Navarro, Fuxing Chen, Dario Rossi
备注: to appear at Internet Traffic Measurement (IMC) 2023

点击查看摘要

Abstract: Over the last years we witnessed a renewed interest towards Traffic Classification (TC) captivated by the rise of Deep Learning (DL). Yet, the vast majority of TC literature lacks code artifacts, performance assessments across datasets and reference comparisons against Machine Learning (ML) methods. Among those works, a recent study from IMC’22 [17] is worth of attention since it adopts recent DL methodologies (namely, few-shot learning, self-supervision via contrastive learning and data augmentation) appealing for networking as they enable to learn from a few samples and transfer across datasets. The main result of [17] on the UCDAVIS19, ISCX-VPN and ISCX-Tor datasets is that, with such DL methodologies, 100 input samples are enough to achieve very high accuracy using an input representation called “flowpic” (i.e., a per-flow 2d histograms of the packets size evolution over time). In this paper (i) we reproduce [17] on the same datasets and (ii) we replicate its most salient aspect (the importance of data augmentation) on three additional public datasets, MIRAGE-19, MIRAGE-22 and UTMOBILENET21. While we confirm most of the original results, we also found a 20% accuracy drop on some of the investigated scenarios due to a data shift in the original dataset that we uncovered. Additionally, our study validates that the data augmentation strategies studied in [17] perform well on other datasets too. In the spirit of reproducibility and replicability we make all artifacts (code and data) available at [10].

ML-17-标题: FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup for Non-IID Data

链接: https://arxiv.org/abs/2309.09719
作者: Hao Sun, Li Shen, Shixiang Chen, Jingwei Sun, Jing Li, Guangzhong Sun, Dacheng Tao
备注: 40 pages

点击查看摘要

Abstract: Federated learning is an emerging distributed machine learning method, enables a large number of clients to train a model without exchanging their local data. The time cost of communication is an essential bottleneck in federated learning, especially for training large-scale deep neural networks. Some communication-efficient federated learning methods, such as FedAvg and FedAdam, share the same learning rate across different clients. But they are not efficient when data is heterogeneous. To maximize the performance of optimization methods, the main challenge is how to adjust the learning rate without hurting the convergence. In this paper, we propose a heterogeneous local variant of AMSGrad, named FedLALR, in which each client adjusts its learning rate based on local historical gradient squares and synchronized learning rates. Theoretical analysis shows that our client-specified auto-tuned learning rate scheduling can converge and achieve linear speedup with respect to the number of clients, which enables promising scalability in federated optimization. We also empirically compare our method with several communication-efficient federated optimization methods. Extensive experimental results on Computer Vision (CV) tasks and Natural Language Processing (NLP) task show the efficacy of our proposed FedLALR method and also coincides with our theoretical findings.

ML-18-标题: Multi-Dictionary Tensor Decomposition

链接: https://arxiv.org/abs/2309.09717
作者: Maxwell McNeil, Petko Bogdanov
备注:

点击查看摘要

Abstract: Tensor decomposition methods are popular tools for analysis of multi-way datasets from social media, healthcare, spatio-temporal domains, and others. Widely adopted models such as Tucker and canonical polyadic decomposition (CPD) follow a data-driven philosophy: they decompose a tensor into factors that approximate the observed data well. In some cases side information is available about the tensor modes. For example, in a temporal user-item purchases tensor a user influence graph, an item similarity graph, and knowledge about seasonality or trends in the temporal mode may be available. Such side information may enable more succinct and interpretable tensor decomposition models and improved quality in downstream tasks. We propose a framework for Multi-Dictionary Tensor Decomposition (MDTD) which takes advantage of prior structural information about tensor modes in the form of coding dictionaries to obtain sparsely encoded tensor factors. We derive a general optimization algorithm for MDTD that handles both complete input and input with missing values. Our framework handles large sparse tensors typical to many real-world application domains. We demonstrate MDTD’s utility via experiments with both synthetic and real-world datasets. It learns more concise models than dictionary-free counterparts and improves (i) reconstruction quality ( 60% fewer non-zero coefficients coupled with smaller error); (ii) missing values imputation quality (two-fold MSE reduction with up to orders of magnitude time savings) and (iii) the estimation of the tensor rank. MDTD’s quality improvements do not come with a running time premium: it can decompose 19GB datasets in less than a minute. It can also impute missing values in sparse billion-entry tensors more accurately and scalably than state-of-the-art competitors.

ML-19-标题: Information based explanation methods for deep learning agent s – with applications on large open-source chess models

链接: https://arxiv.org/abs/2309.09702
作者: Patrik Hammersborg, Inga Strümke
备注: Additional code available at this https URL

点击查看摘要

Abstract: With large chess-playing neural network models like AlphaZero contesting the state of the art within the world of computerised chess, two challenges present themselves: The question of how to explain the domain knowledge internalised by such models, and the problem that such models are not made openly available. This work presents the re-implementation of the concept detection methodology applied to AlphaZero in McGrath et al. (2022), by using large, open-source chess models with comparable performance. We obtain results similar to those achieved on AlphaZero, while relying solely on open-source resources. We also present a novel explainable AI (XAI) method, which is guaranteed to highlight exhaustively and exclusively the information used by the explained model. This method generates visual explanations tailored to domains characterised by discrete input spaces, as is the case for chess. Our presented method has the desirable property of controlling the information flow between any input vector and the given model, which in turn provides strict guarantees regarding what information is used by the trained model during inference. We demonstrate the viability of our method by applying it to standard 8x8 chess, using large open-source chess models.

ML-20-标题: A Study of Data-driven Methods for Adaptive Forecasting of COVID-19 Cases

链接: https://arxiv.org/abs/2309.09698
作者: Charithea Stylianides, Kleanthis Malialis, Panayiotis Kolios
备注: Keywords: incremental learning, data streams, neural networks, time-series forecasting

点击查看摘要

Abstract: Severe acute respiratory disease SARS-CoV-2 has had a found impact on public health systems and healthcare emergency response especially with respect to making decisions on the most effective measures to be taken at any given time. As demonstrated throughout the last three years with COVID-19, the prediction of the number of positive cases can be an effective way to facilitate decision-making. However, the limited availability of data and the highly dynamic and uncertain nature of the virus transmissibility makes this task very challenging. Aiming at investigating these challenges and in order to address this problem, this work studies data-driven (learning, statistical) methods for incrementally training models to adapt to these nonstationary conditions. An extensive empirical study is conducted to examine various characteristics, such as, performance analysis on a per virus wave basis, feature extraction, “lookback” window size, memory size, all for next-, 7-, and 14-day forecasting tasks. We demonstrate that the incremental learning framework can successfully address the aforementioned challenges and perform well during outbreaks, providing accurate predictions.

ML-21-标题: Noise-Augmented Boruta: The Neural Network Perturbation Infusion with Boruta Feature Selection

链接: https://arxiv.org/abs/2309.09694
作者: Hassan Gharoun, Navid Yazdanjoe, Mohammad Sadegh Khorshidi, Amir H. Gandomi
备注: 8 pages, 4 figures, 8 tables, 4 algorithms

点击查看摘要

Abstract: With the surge in data generation, both vertically (i.e., volume of data) and horizontally (i.e., dimensionality), the burden of the curse of dimensionality has become increasingly palpable. Feature selection, a key facet of dimensionality reduction techniques, has advanced considerably to address this challenge. One such advancement is the Boruta feature selection algorithm, which successfully discerns meaningful features by contrasting them to their permutated counterparts known as shadow features. However, the significance of a feature is shaped more by the data’s overall traits than by its intrinsic value, a sentiment echoed in the conventional Boruta algorithm where shadow features closely mimic the characteristics of the original ones. Building on this premise, this paper introduces an innovative approach to the Boruta feature selection algorithm by incorporating noise into the shadow variables. Drawing parallels from the perturbation analysis framework of artificial neural networks, this evolved version of the Boruta method is presented. Rigorous testing on four publicly available benchmark datasets revealed that this proposed technique outperforms the classic Boruta algorithm, underscoring its potential for enhanced, accurate feature selection.

ML-22-标题: VULNERLIZER: Cross-analysis Between Vulnerabilities and Software Libraries

链接: https://arxiv.org/abs/2309.09649
作者: Irdin Pekaric, Michael Felderer, Philipp Steinmüller
备注:

点击查看摘要

Abstract: The identification of vulnerabilities is a continuous challenge in software projects. This is due to the evolution of methods that attackers employ as well as the constant updates to the software, which reveal additional issues. As a result, new and innovative approaches for the identification of vulnerable software are needed. In this paper, we present VULNERLIZER, which is a novel framework for cross-analysis between vulnerabilities and software libraries. It uses CVE and software library data together with clustering algorithms to generate links between vulnerabilities and libraries. In addition, the training of the model is conducted in order to reevaluate the generated associations. This is achieved by updating the assigned weights. Finally, the approach is then evaluated by making the predictions using the CVE data from the test set. The results show that the VULNERLIZER has a great potential in being able to predict future vulnerable libraries based on an initial input CVE entry or a software library. The trained model reaches a prediction accuracy of 75% or higher.

ML-23-标题: Neural Network-Based Rule Models With Truth Tables ECAI2023

链接: https://arxiv.org/abs/2309.09638
作者: Adrien Benamira, Tristan Guérand, Thomas Peyrin, Hans Soegeng
备注: Accepted at ECAI 2023

点击查看摘要

Abstract: Understanding the decision-making process of a machine/deep learning model is crucial, particularly in security-sensitive applications. In this study, we introduce a neural network framework that combines the global and exact interpretability properties of rule-based models with the high performance of deep neural networks. Our proposed framework, called \textitTruth Table rules (TT-rules), is built upon \textitTruth Table nets (TTnets), a family of deep neural networks initially developed for formal verification. By extracting the set of necessary and sufficient rules \mathcalR from the trained TTnet model (global interpretability), yielding the same output as the TTnet (exact interpretability), TT-rules effectively transforms the neural network into a rule-based model. This rule-based model supports binary classification, multi-label classification, and regression tasks for tabular datasets. Furthermore, our TT-rules framework optimizes the rule set \mathcalR into \mathcalR_opt by reducing the number and size of the rules. To enhance model interpretation, we leverage Reduced Ordered Binary Decision Diagrams (ROBDDs) to visualize these rules effectively. After outlining the framework, we evaluate the performance of TT-rules on seven tabular datasets from finance, healthcare, and justice domains. We also compare the TT-rules framework to state-of-the-art rule-based methods. Our results demonstrate that TT-rules achieves equal or higher performance compared to other interpretable methods while maintaining a balance between performance and complexity. Notably, TT-rules presents the first accurate rule-based model capable of fitting large tabular datasets, including two real-life DNA datasets with over 20K features. Finally, we extensively investigate a rule-based model derived from TT-rules using the Adult dataset.

ML-24-标题: A Discussion on Generalization in Next-Activity Prediction

链接: https://arxiv.org/abs/2309.09618
作者: Luka Abb, Peter Pfeiffer, Peter Fettke, Jana-Rebecca Rehse
备注: Pre-print, published at the AI4BPM workshop at BPM 2023

点击查看摘要

Abstract: Next activity prediction aims to forecast the future behavior of running process instances. Recent publications in this field predominantly employ deep learning techniques and evaluate their prediction performance using publicly available event logs. This paper presents empirical evidence that calls into question the effectiveness of these current evaluation approaches. We show that there is an enormous amount of example leakage in all of the commonly used event logs, so that rather trivial prediction approaches perform almost as well as ones that leverage deep learning. We further argue that designing robust evaluations requires a more profound conceptual engagement with the topic of next-activity prediction, and specifically with the notion of generalization to new data. To this end, we present various prediction scenarios that necessitate different types of generalization to guide future research.

ML-25-标题: Latent assimilation with implicit neural representations for unknown dynamics

链接: https://arxiv.org/abs/2309.09574
作者: Zhuoyuan Li, Bin Dong, Pingwen Zhang
备注: 37 pages

点击查看摘要

Abstract: Data assimilation is crucial in a wide range of applications, but it often faces challenges such as high computational costs due to data dimensionality and incomplete understanding of underlying mechanisms. To address these challenges, this study presents a novel assimilation framework, termed Latent Assimilation with Implicit Neural Representations (LAINR). By introducing Spherical Implicit Neural Representations (SINR) along with a data-driven uncertainty estimator of the trained neural networks, LAINR enhances efficiency in assimilation process. Experimental results indicate that LAINR holds certain advantage over existing methods based on AutoEncoders, both in terms of accuracy and efficiency.

ML-26-标题: FedGKD: Unleashing the Power of Collaboration in Federated Graph Neural Networks

链接: https://arxiv.org/abs/2309.09517
作者: Qiying Pan, Ruofan Wu, Tengfei Liu, Tianyi Zhang, Yifei Zhu, Weiqiang Wang
备注:

点击查看摘要

Abstract: Federated training of Graph Neural Networks (GNN) has become popular in recent years due to its ability to perform graph-related tasks under data isolation scenarios while preserving data privacy. However, graph heterogeneity issues in federated GNN systems continue to pose challenges. Existing frameworks address the problem by representing local tasks using different statistics and relating them through a simple aggregation mechanism. However, these approaches suffer from limited efficiency from two aspects: low quality of task-relatedness quantification and inefficacy of exploiting the collaboration structure. To address these issues, we propose FedGKD, a novel federated GNN framework that utilizes a novel client-side graph dataset distillation method to extract task features that better describe task-relatedness, and introduces a novel server-side aggregation mechanism that is aware of the global collaboration structure. We conduct extensive experiments on six real-world datasets of different scales, demonstrating our framework’s outperformance.

ML-27-标题: Mechanic Maker 2.0: Reinforcement Learning for Evaluating Generated Rules

链接: https://arxiv.org/abs/2309.09476
作者: Johor Jara Gonzalez, Seth Cooper, Mathew Guzdial
备注: 10 pages, 6 figures, Artificial Intelligence and Interactive Digital Entertainment

点击查看摘要

Abstract: Automated game design (AGD), the study of automatically generating game rules, has a long history in technical games research. AGD approaches generally rely on approximations of human play, either objective functions or AI agents. Despite this, the majority of these approximators are static, meaning they do not reflect human player’s ability to learn and improve in a game. In this paper, we investigate the application of Reinforcement Learning (RL) as an approximator for human play for rule generation. We recreate the classic AGD environment Mechanic Maker in Unity as a new, open-source rule generation framework. Our results demonstrate that RL produces distinct sets of rules from an A* agent baseline, which may be more usable by humans.

ML-28-标题: Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

链接: https://arxiv.org/abs/2309.09470
作者: Zheng-Yan Sheng, Yang Ai, Yan-Nian Chen, Zhen-Hua Ling
备注:

点击查看摘要

Abstract: This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. This method leverages a memory-based face-voice alignment module, in which slots act as the bridge to align these two modalities, allowing for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model to our zero-shot FaceVC model. Considering the differences between FaceVC and traditional voice conversion tasks, systematic subjective and objective metrics are designed to thoroughly evaluate the homogeneity, diversity and consistency of voice characteristics controlled by face images. Through extensive experiments, we demonstrate the superiority of our proposed method on the zero-shot FaceVC task. Samples are presented on our demo website.

ML-29-标题: Active anomaly detection based on deep one-class classification

链接: https://arxiv.org/abs/2309.09465
作者: Minkyung Kim, Junsik Kim, Jongmin Yu, Jun Kyun Choi
备注: Pattern Recognition Letters 2023

点击查看摘要

Abstract: Active learning has been utilized as an efficient tool in building anomaly detection models by leveraging expert feedback. In an active learning framework, a model queries samples to be labeled by experts and re-trains the model with the labeled data samples. It unburdens in obtaining annotated datasets while improving anomaly detection performance. However, most of the existing studies focus on helping experts identify as many abnormal data samples as possible, which is a sub-optimal approach for one-class classification-based deep anomaly detection. In this paper, we tackle two essential problems of active learning for Deep SVDD: query strategy and semi-supervised learning method. First, rather than solely identifying anomalies, our query strategy selects uncertain samples according to an adaptive boundary. Second, we apply noise contrastive estimation in training a one-class classification model to incorporate both labeled normal and abnormal data effectively. We analyze that the proposed query strategy and semi-supervised loss individually improve an active learning process of anomaly detection and further improve when combined together on seven anomaly detection datasets.

ML-30-标题: Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

链接: https://arxiv.org/abs/2309.09457
作者: Noah Golowich, Dhruv Rohatgi, Ankur Moitra
备注:

点击查看摘要

Abstract: The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map \phi(x, a) that maps state-action pairs to d -dimensional vectors, and that the rewards and transitions are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the ``kitchen sink" approach and hope that the true features are included in a much larger set of potential features. In this paper we revisit linear MDPs from the perspective of feature selection. In a k -sparse linear MDP, there is an unknown subset S \subset [d] of size k containing all the relevant features, and the goal is to learn a near-optimal policy in only poly (k,\log d) interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems. Along the way we introduce the notion of an emulator: a succinct approximate representation of the transitions that suffices for computing certain Bellman backups. Since linear MDPs are a non-parametric model, it is not even obvious whether polynomial-sized emulators exist. We show that they do exist and can be computed efficiently via convex programming. As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples. This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives a natural model where improving the sample complexity via representation learning is computationally feasible.

ML-31-标题: CaT: Balanced Continual Graph Learning with Graph Condensation

链接: https://arxiv.org/abs/2309.09455
作者: Yilun Liu, Ruihong Qiu, Zi Huang
备注: The code has been released on \url{this https URL}

点击查看摘要

Abstract: Continual graph learning (CGL) is purposed to continuously update a graph model with graph data being fed in a streaming manner. Since the model easily forgets previously learned knowledge when training with new-coming data, the catastrophic forgetting problem has been the major focus in CGL. Recent replay-based methods intend to solve this problem by updating the model using both (1) the entire new-coming data and (2) a sampling-based memory bank that stores replayed graphs to approximate the distribution of historical data. After updating the model, a new replayed graph sampled from the incoming graph will be added to the existing memory bank. Despite these methods are intuitive and effective for the CGL, two issues are identified in this paper. Firstly, most sampling-based methods struggle to fully capture the historical distribution when the storage budget is tight. Secondly, a significant data imbalance exists in terms of the scales of the complex new-coming graph data and the lightweight memory bank, resulting in unbalanced training. To solve these issues, a \textitCondense and Train (CaT) framework is proposed in this paper. Prior to each model update, the new-coming graph is condensed to a small yet informative synthesised replayed graph, which is then stored in a \textitCondensed Graph Memory with historical replay graphs. In the continual learning phase, a \textitTraining in Memory scheme is used to update the model directly with the \textitCondensed Graph Memory rather than the whole new-coming graph, which alleviates the data imbalance problem. Extensive experiments conducted on four benchmark datasets successfully demonstrate superior performances of the proposed CaT framework in terms of effectiveness and efficiency. The code has been released on \urlthis https URL.

ML-32-标题: Asymptotically Efficient Online Learning for Censored Regression Models Under Non-I.I.D Data

链接: https://arxiv.org/abs/2309.09454
作者: Lantian Zhang, Lei Guo
备注: 35 pages

点击查看摘要

Abstract: The asymptotically efficient online learning problem is investigated for stochastic censored regression models, which arise from various fields of learning and statistics but up to now still lacks comprehensive theoretical studies on the efficiency of the learning algorithms. For this, we propose a two-step online algorithm, where the first step focuses on achieving algorithm convergence, and the second step is dedicated to improving the estimation performance. Under a general excitation condition on the data, we show that our algorithm is strongly consistent and asymptotically normal by employing the stochastic Lyapunov function method and limit theories for martingales. Moreover, we show that the covariances of the estimates can achieve the Cramer-Rao (C-R) bound asymptotically, indicating that the performance of the proposed algorithm is the best possible that one can expect in general. Unlike most of the existing works, our results are obtained without resorting to the traditionally used but stringent conditions such as independent and identically distributed (i.i.d) assumption on the data, and thus our results do not exclude applications to stochastic dynamical systems with feedback. A numerical example is also provided to illustrate the superiority of the proposed online algorithm over the existing related ones in the literature.

ML-33-标题: An Iterative Method for Unsupervised Robust Anomaly Detection Under Data Contamination

链接: https://arxiv.org/abs/2309.09436
作者: Minkyung Kim, Jongmin Yu, Junsik Kim, Tae-Hyun Oh, Jun Kyun Choi
备注: IEEE Transactions on Neural Networks and Learning Systems, 2023

点击查看摘要

Abstract: Most deep anomaly detection models are based on learning normality from datasets due to the difficulty of defining abnormality by its diverse and inconsistent nature. Therefore, it has been a common practice to learn normality under the assumption that anomalous data are absent in a training dataset, which we call normality assumption. However, in practice, the normality assumption is often violated due to the nature of real data distributions that includes anomalous tails, i.e., a contaminated dataset. Thereby, the gap between the assumption and actual training data affects detrimentally in learning of an anomaly detection model. In this work, we propose a learning framework to reduce this gap and achieve better normality representation. Our key idea is to identify sample-wise normality and utilize it as an importance weight, which is updated iteratively during the training. Our framework is designed to be model-agnostic and hyperparameter insensitive so that it applies to a wide range of existing methods without careful parameter tuning. We apply our framework to three different representative approaches of deep anomaly detection that are classified into one-class classification-, probabilistic model-, and reconstruction-based approaches. In addition, we address the importance of a termination condition for iterative methods and propose a termination criterion inspired by the anomaly detection objective. We validate that our framework improves the robustness of the anomaly detection models under different levels of contamination ratios on five anomaly detection benchmark datasets and two image datasets. On various contaminated datasets, our framework improves the performance of three representative anomaly detection methods, measured by area under the ROC curve.

ML-34-标题: Guided Online Distillation : Promoting Safe Reinforcement Learning by Offline Demonstration

链接: https://arxiv.org/abs/2309.09408
作者: Jinning Li, Xinyi Liu, Banghua Zhu, Jiantao Jiao, Masayoshi Tomizuka, Chen Tang, Wei Zhan
备注:

点击查看摘要

Abstract: Safe Reinforcement Learning (RL) aims to find a policy that achieves high rewards while satisfying cost constraints. When learning from scratch, safe RL agents tend to be overly conservative, which impedes exploration and restrains the overall performance. In many realistic tasks, e.g. autonomous driving, large-scale expert demonstration data are available. We argue that extracting expert policy from offline data to guide online exploration is a promising solution to mitigate the conserveness issue. Large-capacity models, e.g. decision transformers (DT), have been proven to be competent in offline policy learning. However, data collected in real-world scenarios rarely contain dangerous cases (e.g., collisions), which makes it prohibitive for the policies to learn safety concepts. Besides, these bulk policy networks cannot meet the computation speed requirements at inference time on real-world tasks such as autonomous driving. To this end, we propose Guided Online Distillation (GOLD), an offline-to-online safe RL framework. GOLD distills an offline DT policy into a lightweight policy network through guided online safe RL training, which outperforms both the offline DT policy and online safe RL algorithms. Experiments in both benchmark safe RL tasks and real-world driving tasks based on the Waymo Open Motion Dataset (WOMD) demonstrate that GOLD can successfully distill lightweight policies and solve decision-making problems in challenging safety-critical scenarios.

ML-35-标题: Mitigating Over-Smoothing and Over-Squashing using Augmentations of Forman-Ricci Curvature

链接: https://arxiv.org/abs/2309.09384
作者: Lukas Fesser, Melanie Weber
备注:

点击查看摘要

Abstract: While Graph Neural Networks (GNNs) have been successfully leveraged for learning on graph-structured data across domains, several potential pitfalls have been described recently. Those include the inability to accurately leverage information encoded in long-range connections (over-squashing), as well as difficulties distinguishing the learned representations of nearby nodes with growing network depth (over-smoothing). An effective way to characterize both effects is discrete curvature: Long-range connections that underlie over-squashing effects have low curvature, whereas edges that contribute to over-smoothing have high curvature. This observation has given rise to rewiring techniques, which add or remove edges to mitigate over-smoothing and over-squashing. Several rewiring approaches utilizing graph characteristics, such as curvature or the spectrum of the graph Laplacian, have been proposed. However, existing methods, especially those based on curvature, often require expensive subroutines and careful hyperparameter tuning, which limits their applicability to large-scale graphs. Here we propose a rewiring technique based on Augmented Forman-Ricci curvature (AFRC), a scalable curvature notation, which can be computed in linear time. We prove that AFRC effectively characterizes over-smoothing and over-squashing effects in message-passing GNNs. We complement our theoretical results with experiments, which demonstrate that the proposed approach achieves state-of-the-art performance while significantly reducing the computational cost in comparison with other methods. Utilizing fundamental properties of discrete curvature, we propose effective heuristics for hyperparameters in curvature-based rewiring, which avoids expensive hyperparameter searches, further improving the scalability of the proposed approach.

ML-36-标题: Federated Learning in Temporal Heterogeneity

链接: https://arxiv.org/abs/2309.09381
作者: Junghwan Lee
备注:

点击查看摘要

Abstract: In this work, we explored federated learning in temporal heterogeneity across clients. We observed that global model obtained by \textttFedAvg trained with fixed-length sequences shows faster convergence than varying-length sequences. We proposed methods to mitigate temporal heterogeneity for efficient federated learning based on the empirical observation.

ML-37-标题: Fully Convolutional Generative Machine Learning Method for Accelerating Non-Equilibrium Greens Function Simulations

链接: https://arxiv.org/abs/2309.09374
作者: Preslav Aleksandrov, Ali Rezaei, Nikolas Xeni, Tapas Dutta, Asen Asenov, Vihar Georgiev
备注:

点击查看摘要

Abstract: This work describes a novel simulation approach that combines machine learning and device modelling simulations. The device simulations are based on the quantum mechanical non-equilibrium Greens function (NEGF) approach and the machine learning method is an extension to a convolutional generative network. We have named our new simulation approach ML-NEGF and we have implemented it in our in-house simulator called NESS (nano-electronics simulations software). The reported results demonstrate the improved convergence speed of the ML-NEGF method in comparison to the standard NEGF approach. The trained ML model effectively learns the underlying physics of nano-sheet transistor behaviour, resulting in faster convergence of the coupled Poisson-NEGF simulations. Quantitatively, our ML- NEGF approach achieves an average convergence acceleration of 60%, substantially reducing the computational time while maintaining the same accuracy.

ML-38-标题: A Survey on Congestion Control and Scheduling for Multipath TCP: Machine Learning vs Classical Approaches

链接: https://arxiv.org/abs/2309.09372
作者: Maisha Maliha, Golnaz Habibi, Mohammed Atiquzzaman
备注: 13 pages, 7 figures

点击查看摘要

Abstract: Multipath TCP (MPTCP) has been widely used as an efficient way for communication in many applications. Data centers, smartphones, and network operators use MPTCP to balance the traffic in a network efficiently. MPTCP is an extension of TCP (Transmission Control Protocol), which provides multiple paths, leading to higher throughput and low latency. Although MPTCP has shown better performance than TCP in many applications, it has its own challenges. The network can become congested due to heavy traffic in the multiple paths (subflows) if the subflow rates are not determined correctly. Moreover, communication latency can occur if the packets are not scheduled correctly between the subflows. This paper reviews techniques to solve the above-mentioned problems based on two main approaches; non data-driven (classical) and data-driven (Machine Learning) approaches. This paper compares these two approaches and highlights their strengths and weaknesses with a view to motivating future researchers in this exciting area of machine learning for communications. This paper also provides details on the simulation of MPTCP and its implementations in real environments.

ML-39-标题: Unleashing the Power of Dynamic Mode Decomposition and Deep Learning for Rainfall Prediction in North-East India

链接: https://arxiv.org/abs/2309.09336
作者: Paleti Nikhil Chowdary, Sathvika P, Pranav U, Rohan S, Sowmya V, Gopalakrishnan E A, Dhanya M
备注: Paper is under review at ICMC 2024

点击查看摘要

Abstract: Accurate rainfall forecasting is crucial for effective disaster preparedness and mitigation in the North-East region of India, which is prone to extreme weather events such as floods and landslides. In this study, we investigated the use of two data-driven methods, Dynamic Mode Decomposition (DMD) and Long Short-Term Memory (LSTM), for rainfall forecasting using daily rainfall data collected from India Meteorological Department in northeast region over a period of 118 years. We conducted a comparative analysis of these methods to determine their relative effectiveness in predicting rainfall patterns. Using historical rainfall data from multiple weather stations, we trained and validated our models to forecast future rainfall patterns. Our results indicate that both DMD and LSTM are effective in forecasting rainfall, with LSTM outperforming DMD in terms of accuracy, revealing that LSTM has the ability to capture complex nonlinear relationships in the data, making it a powerful tool for rainfall forecasting. Our findings suggest that data-driven methods such as DMD and deep learning approaches like LSTM can significantly improve rainfall forecasting accuracy in the North-East region of India, helping to mitigate the impact of extreme weather events and enhance the region’s resilience to climate change.

ML-40-标题: Experiential-Informed Data Reconstruction for Fishery Sustainability and Policies in the Azores

链接: https://arxiv.org/abs/2309.09326
作者: Brenda Nogueira, Gui M. Menezes, Nuno Moniz
备注:

点击查看摘要

Abstract: Fishery analysis is critical in maintaining the long-term sustainability of species and the livelihoods of millions of people who depend on fishing for food and income. The fishing gear, or metier, is a key factor significantly impacting marine habitats, selectively targeting species and fish sizes. Analysis of commercial catches or landings by metier in fishery stock assessment and management is crucial, providing robust estimates of fishing efforts and their impact on marine ecosystems. In this paper, we focus on a unique data set from the Azores’ fishing data collection programs between 2010 and 2017, where little information on metiers is available and sparse throughout our timeline. Our main objective is to tackle the task of data set reconstruction, leveraging domain knowledge and machine learning methods to retrieve or associate metier-related information to each fish landing. We empirically validate the feasibility of this task using a diverse set of modeling approaches and demonstrate how it provides new insights into different fisheries’ behavior and the impact of metiers over time, which are essential for future fish population assessments, management, and conservation efforts.

ML-41-标题: Answering Layer 3 queries with DiscoSCMs

链接: https://arxiv.org/abs/2309.09323
作者: Heyang Gong
备注:

点击查看摘要

Abstract: In the realm of causal inference, the primary frameworks are the Potential Outcome (PO) and the Structural Causal Model (SCM), both predicated on the consistency rule. However, when facing Layer 3 valuations, i.e., counterfactual queries that inherently belong to individual-level semantics, they both seem inadequate due to the issue of degeneration caused by the consistency rule. For instance, in personalized incentive scenarios within the internet industry, the probability of one particular user being a complier, denoted as P(y_x, y’_x’) , degenerates to a parameter that can only take values of 0 or 1. This paper leverages the DiscoSCM framework to theoretically tackle the aforementioned counterfactual degeneration problem, which is a novel framework for causal modeling that combines the strengths of both PO and SCM, and could be seen as an extension of them. The paper starts with a brief introduction to the background of causal modeling frameworks. It then illustrates, through an example, the difficulty in recovering counterfactual parameters from data without imposing strong assumptions. Following this, we propose the DiscoSCM with independent potential noise framework to address this problem. Subsequently, the superior performance of the DiscoSCM framework in answering counterfactual questions is demonstrated by several key results in the topic of unit select problems. We then elucidate that this superiority stems from the philosophy of individual causality. In conclusion, we suggest that DiscoSCM may serve as a significant milestone in the causal modeling field for addressing counterfactual queries.

ML-42-标题: Kinematics-aware Trajectory Generation and Prediction with Latent Stochastic Differential Modeling

链接: https://arxiv.org/abs/2309.09317
作者: Ruochen Jiao, Yixuan Wang, Xiangguo Liu, Chao Huang, Qi Zhu
备注: 7 pages, conference paper in motion generation

点击查看摘要

Abstract: Trajectory generation and trajectory prediction are two critical tasks for autonomous vehicles, which generate various trajectories during development and predict the trajectories of surrounding vehicles during operation, respectively. However, despite significant advances in improving their performance, it remains a challenging problem to ensure that the generated/predicted trajectories are realistic, explainable, and physically feasible. Existing model-based methods provide explainable results, but are constrained by predefined model structures, limiting their capabilities to address complex scenarios. Conversely, existing deep learning-based methods have shown great promise in learning various traffic scenarios and improving overall performance, but they often act as opaque black boxes and lack explainability. In this work, we integrate kinematic knowledge with neural stochastic differential equations (SDE) and develop a variational autoencoder based on a novel latent kinematics-aware SDE (LK-SDE) to generate vehicle motions. Our approach combines the advantages of both model-based and deep learning-based techniques. Experimental results demonstrate that our method significantly outperforms baseline approaches in producing realistic, physically-feasible, and precisely-controllable vehicle trajectories, benefiting both generation and prediction tasks.

ML-43-标题: Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

链接: https://arxiv.org/abs/2309.09258
作者: Pulkit Gopalani, Samyak Jha, Anirbit Mukherjee
备注: 18 Pages, 1 figure. arXiv admin note: substantial text overlap with arXiv:2210.11452

点击查看摘要

Abstract: In this note, we demonstrate a first-of-its-kind provable convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth 2 nets – for arbitrary data and with any number of gates with adequately smooth and bounded activations like sigmoid and tanh. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets which are “Villani functions” and thus be able to build on recent progress with analyzing SGD on such objectives.

ML-44-标题: User Assignment and Resource Allocation for Hierarchical Federated Learning over Wireless Networks

链接: https://arxiv.org/abs/2309.09253
作者: Tinghao Zhang, Kwok-Yan Lam, Jun Zhao
备注: Under review by IEEE Transactions on Communications

点击查看摘要

Abstract: The large population of wireless users is a key driver of data-crowdsourced Machine Learning (ML). However, data privacy remains a significant concern. Federated Learning (FL) encourages data sharing in ML without requiring data to leave users’ devices but imposes heavy computation and communications overheads on mobile devices. Hierarchical FL (HFL) alleviates this problem by performing partial model aggregation at edge servers. HFL can effectively reduce energy consumption and latency through effective resource allocation and appropriate user assignment. Nevertheless, resource allocation in HFL involves optimizing multiple variables, and the objective function should consider both energy consumption and latency, making the development of resource allocation algorithms very complicated. Moreover, it is challenging to perform user assignment, which is a combinatorial optimization problem in a large search space. This article proposes a spectrum resource optimization algorithm (SROA) and a two-stage iterative algorithm (TSIA) for HFL. Given an arbitrary user assignment pattern, SROA optimizes CPU frequency, transmit power, and bandwidth to minimize system cost. TSIA aims to find a user assignment pattern that considerably reduces the total system cost. Experimental results demonstrate the superiority of the proposed HFL framework over existing studies in energy and latency reduction.

ML-45-标题: Globally Convergent Accelerated Algorithms for Multilinear Sparse Logistic Regression with ell_0-constraints

链接: https://arxiv.org/abs/2309.09239
作者: Weifeng Yang, Wenwen Min
备注: arXiv admin note: text overlap with arXiv:2308.12126

点击查看摘要

Abstract: Tensor data represents a multidimensional array. Regression methods based on low-rank tensor decomposition leverage structural information to reduce the parameter count. Multilinear logistic regression serves as a powerful tool for the analysis of multidimensional data. To improve its efficacy and interpretability, we present a Multilinear Sparse Logistic Regression model with \ell_0 -constraints ( \ell_0 -MLSR). In contrast to the \ell_1 -norm and \ell_2 -norm, the \ell_0 -norm constraint is better suited for feature selection. However, due to its nonconvex and nonsmooth properties, solving it is challenging and convergence guarantees are lacking. Additionally, the multilinear operation in \ell_0 -MLSR also brings non-convexity. To tackle these challenges, we propose an Accelerated Proximal Alternating Linearized Minimization with Adaptive Momentum (APALM ^+ ) method to solve the \ell_0 -MLSR model. We provide a proof that APALM ^+ can ensure the convergence of the objective function of \ell_0 -MLSR. We also demonstrate that APALM ^+ is globally convergent to a first-order critical point as well as establish convergence rate by using the Kurdyka-Lojasiewicz property. Empirical results obtained from synthetic and real-world datasets validate the superior performance of our algorithm in terms of both accuracy and speed compared to other state-of-the-art methods.

ML-46-标题: Double Normalizing Flows: Flexible Bayesian Gaussian Process ODEs Learning

链接: https://arxiv.org/abs/2309.09222
作者: Jian Xu, Shian Du, Junmei Yang, Xinghao Ding, John Paisley, Delu Zeng
备注:

点击查看摘要

Abstract: Recently, Gaussian processes have been utilized to model the vector field of continuous dynamical systems. Bayesian inference for such models \citehegde2022variational has been extensively studied and has been applied in tasks such as time series prediction, providing uncertain estimates. However, previous Gaussian Process Ordinary Differential Equation (ODE) models may underperform on datasets with non-Gaussian process priors, as their constrained priors and mean-field posteriors may lack flexibility. To address this limitation, we incorporate normalizing flows to reparameterize the vector field of ODEs, resulting in a more flexible and expressive prior distribution. Additionally, due to the analytically tractable probability density functions of normalizing flows, we apply them to the posterior inference of GP ODEs, generating a non-Gaussian posterior. Through these dual applications of normalizing flows, our model improves accuracy and uncertainty estimates for Bayesian Gaussian Process ODEs. The effectiveness of our approach is demonstrated on simulated dynamical systems and real-world human motion data, including tasks such as time series prediction and missing data recovery. Experimental results indicate that our proposed method effectively captures model uncertainty while improving accuracy.

ML-47-标题: MFRL-BI: Design of a Model-free Reinforcement Learning Process Control Scheme by Using Bayesian Inference

链接: https://arxiv.org/abs/2309.09205
作者: Yanrong Li, Juan Du, Wei Jiang
备注: 31 pages, 7 figures, and 3 tables

点击查看摘要

Abstract: Design of process control scheme is critical for quality assurance to reduce variations in manufacturing systems. Taking semiconductor manufacturing as an example, extensive literature focuses on control optimization based on certain process models (usually linear models), which are obtained by experiments before a manufacturing process starts. However, in real applications, pre-defined models may not be accurate, especially for a complex manufacturing system. To tackle model inaccuracy, we propose a model-free reinforcement learning (MFRL) approach to conduct experiments and optimize control simultaneously according to real-time data. Specifically, we design a novel MFRL control scheme by updating the distribution of disturbances using Bayesian inference to reduce their large variations during manufacturing processes. As a result, the proposed MFRL controller is demonstrated to perform well in a nonlinear chemical mechanical planarization (CMP) process when the process model is unknown. Theoretical properties are also guaranteed when disturbances are additive. The numerical studies also demonstrate the effectiveness and efficiency of our methodology.

ML-48-标题: End-to-End Optimized Pipeline for Prediction of Protein Folding Kinetics

链接: https://arxiv.org/abs/2309.09191
作者: Vijay Arvind.R, Haribharathi Sivakumar, Brindha.R
备注: Accepted for presentation at the 22nd International Conference on Machine Learning and Applications

点击查看摘要

Abstract: Protein folding is the intricate process by which a linear sequence of amino acids self-assembles into a unique three-dimensional structure. Protein folding kinetics is the study of pathways and time-dependent mechanisms a protein undergoes when it folds. Understanding protein kinetics is essential as a protein needs to fold correctly for it to perform its biological functions optimally, and a misfolded protein can sometimes be contorted into shapes that are not ideal for a cellular environment giving rise to many degenerative, neuro-degenerative disorders and amyloid diseases. Monitoring at-risk individuals and detecting protein discrepancies in a protein’s folding kinetics at the early stages could majorly result in public health benefits, as preventive measures can be taken. This research proposes an efficient pipeline for predicting protein folding kinetics with high accuracy and low memory footprint. The deployed machine learning (ML) model outperformed the state-of-the-art ML models by 4.8% in terms of accuracy while consuming 327x lesser memory and being 7.3% faster.

ML-49-标题: Data-Driven Reachability Analysis of Stochastic Dynamical Systems with Conformal Inference

链接: https://arxiv.org/abs/2309.09187
作者: Navid Hashemi, Xin Qin, Lars Lindemann, Jyotirmoy V. Deshmukh
备注:

点击查看摘要

Abstract: We consider data-driven reachability analysis of discrete-time stochastic dynamical systems using conformal inference. We assume that we are not provided with a symbolic representation of the stochastic system, but instead have access to a dataset of K -step trajectories. The reachability problem is to construct a probabilistic flowpipe such that the probability that a K -step trajectory can violate the bounds of the flowpipe does not exceed a user-specified failure probability threshold. The key ideas in this paper are: (1) to learn a surrogate predictor model from data, (2) to perform reachability analysis using the surrogate model, and (3) to quantify the surrogate model’s incurred error using conformal inference in order to give probabilistic reachability guarantees. We focus on learning-enabled control systems with complex closed-loop dynamics that are difficult to model symbolically, but where state transition pairs can be queried, e.g., using a simulator. We demonstrate the applicability of our method on examples from the domain of learning-enabled cyber-physical systems.

ML-50-标题: Imbalanced Data Stream Classification using Dynamic Ensemble Selection CEC

链接: https://arxiv.org/abs/2309.09175
作者: Priya.S, Haribharathi Sivakumar, Vijay Arvind.R
备注: Accepted and presented at the 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)

点击查看摘要

Abstract: Modern streaming data categorization faces significant challenges from concept drift and class imbalanced data. This negatively impacts the output of the classifier, leading to improper classification. Furthermore, other factors such as the overlapping of multiple classes limit the extent of the correctness of the output. This work proposes a novel framework for integrating data pre-processing and dynamic ensemble selection, by formulating the classification framework for the nonstationary drifting imbalanced data stream, which employs the data pre-processing and dynamic ensemble selection techniques. The proposed framework was evaluated using six artificially generated data streams with differing imbalance ratios in combination with two different types of concept drifts. Each stream is composed of 200 chunks of 500 objects described by eight features and contains five concept drifts. Seven pre-processing techniques and two dynamic ensemble selection methods were considered. According to experimental results, data pre-processing combined with Dynamic Ensemble Selection techniques significantly delivers more accuracy when dealing with imbalanced data streams.

ML-51-标题: Total Variation Distance Estimation Is as Easy as Probabilistic Inference

链接: https://arxiv.org/abs/2309.09134
作者: Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, N. V. Vinodchandran
备注: 24 pages

点击查看摘要

Abstract: In this paper, we establish a novel connection between total variation (TV) distance estimation and probabilistic inference. In particular, we present an efficient, structure-preserving reduction from relative approximation of TV distance to probabilistic inference over directed graphical models. This reduction leads to a fully polynomial randomized approximation scheme (FPRAS) for estimating TV distances between distributions over any class of Bayes nets for which there is an efficient probabilistic inference algorithm. In particular, it leads to an FPRAS for estimating TV distances between distributions that are defined by Bayes nets of bounded treewidth. Prior to this work, such approximation schemes only existed for estimating TV distances between product distributions. Our approach employs a new notion of partial couplings of high-dimensional distributions, which might be of independent interest.

ML-52-标题: Conditional Mutual Information Constrained Deep Learning for Classification

链接: https://arxiv.org/abs/2309.09123
作者: En-Hui Yang, Shayan Mohajer Hamidi, Linfeng Ye, Renhao Tan, Beverly Yang
备注:

点击查看摘要

Abstract: The concepts of conditional mutual information (CMI) and normalized conditional mutual information (NCMI) are introduced to measure the concentration and separation performance of a classification deep neural network (DNN) in the output probability distribution space of the DNN, where CMI and the ratio between CMI and NCMI represent the intra-class concentration and inter-class separation of the DNN, respectively. By using NCMI to evaluate popular DNNs pretrained over ImageNet in the literature, it is shown that their validation accuracies over ImageNet validation data set are more or less inversely proportional to their NCMI values. Based on this observation, the standard deep learning (DL) framework is further modified to minimize the standard cross entropy function subject to an NCMI constraint, yielding CMI constrained deep learning (CMIC-DL). A novel alternating learning algorithm is proposed to solve such a constrained optimization problem. Extensive experiment results show that DNNs trained within CMIC-DL outperform the state-of-the-art models trained within the standard DL and other loss functions in the literature in terms of both accuracy and robustness against adversarial attacks. In addition, visualizing the evolution of learning process through the lens of CMI and NCMI is also advocated.

ML-53-标题: Interactively Teaching an Inverse Reinforcement Learner with Limited Feedback

链接: https://arxiv.org/abs/2309.09095
作者: Rustam Zayanov, Francisco S. Melo, Manuel Lopes
备注: 7 pages, 3 figures

点击查看摘要

Abstract: We study the problem of teaching via demonstrations in sequential decision-making tasks. In particular, we focus on the situation when the teacher has no access to the learner’s model and policy, and the feedback from the learner is limited to trajectories that start from states selected by the teacher. The necessity to select the starting states and infer the learner’s policy creates an opportunity for using the methods of inverse reinforcement learning and active learning by the teacher. In this work, we formalize the teaching process with limited feedback and propose an algorithm that solves this teaching problem. The algorithm uses a modified version of the active value-at-risk method to select the starting states, a modified maximum causal entropy algorithm to infer the policy, and the difficulty score ratio method to choose the teaching demonstrations. We test the algorithm in a synthetic car driving environment and conclude that the proposed algorithm is an effective solution when the learner’s feedback is limited.

ML-54-标题: Test-Time Compensated Representation Learning for Extreme Traffic Forecasting

链接: https://arxiv.org/abs/2309.09074
作者: Zhiwei Zhang, Weizhong Zhang, Yaowei Huang, Kani Chen
备注: 13 pages, 10 figures, 5 tables

点击查看摘要

Abstract: Traffic forecasting is a challenging task due to the complex spatio-temporal correlations among traffic series. In this paper, we identify an underexplored problem in multivariate traffic series prediction: extreme events. Road congestion and rush hours can result in low correlation in vehicle speeds at various intersections during adjacent time periods. Existing methods generally predict future series based on recent observations and entirely discard training data during the testing phase, rendering them unreliable for forecasting highly nonlinear multivariate time series. To tackle this issue, we propose a test-time compensated representation learning framework comprising a spatio-temporal decomposed data bank and a multi-head spatial transformer model (CompFormer). The former component explicitly separates all training data along the temporal dimension according to periodicity characteristics, while the latter component establishes a connection between recent observations and historical series in the data bank through a spatial attention matrix. This enables the CompFormer to transfer robust features to overcome anomalous events while using fewer computational resources. Our modules can be flexibly integrated with existing forecasting methods through end-to-end training, and we demonstrate their effectiveness on the METR-LA and PEMS-BAY benchmarks. Extensive experimental results show that our method is particularly important in extreme events, and can achieve significant improvements over six strong baselines, with an overall improvement of up to 28.2%.

ML-55-标题: Enhancing personalised thermal comfort models with Active Learning for improved HVAC controls

链接: https://arxiv.org/abs/2309.09073
作者: Zeynep Duygu Tekler, Yue Lei, Xilei Dai, Adrian Chong
备注: CISBAT2023

点击查看摘要

Abstract: Developing personalised thermal comfort models to inform occupant-centric controls (OCC) in buildings requires collecting large amounts of real-time occupant preference data. This process can be highly intrusive and labour-intensive for large-scale implementations, limiting the practicality of real-world OCC implementations. To address this issue, this study proposes a thermal preference-based HVAC control framework enhanced with Active Learning (AL) to address the data challenges related to real-world implementations of such OCC systems. The proposed AL approach proactively identifies the most informative thermal conditions for human annotation and iteratively updates a supervised thermal comfort model. The resulting model is subsequently used to predict the occupants’ thermal preferences under different thermal conditions, which are integrated into the building’s HVAC controls. The feasibility of our proposed AL-enabled OCC was demonstrated in an EnergyPlus simulation of a real-world testbed supplemented with the thermal preference data of 58 study occupants. The preliminary results indicated a significant reduction in overall labelling effort (i.e., 31.0%) between our AL-enabled OCC and conventional OCC while still achieving a slight increase in energy savings (i.e., 1.3%) and thermal satisfaction levels above 98%. This result demonstrates the potential for deploying such systems in future real-world implementations, enabling personalised comfort and energy-efficient building operations.

ML-56-标题: Recovering Missing Node Features with Local Structure-based Embeddings ICASSP2024

链接: https://arxiv.org/abs/2309.09068
作者: Victor M. Tenorio, Madeline Navarro, Santiago Segarra, Antonio G. Marques
备注: Submitted to 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)

点击查看摘要

Abstract: Node features bolster graph-based learning when exploited jointly with network structure. However, a lack of nodal attributes is prevalent in graph data. We present a framework to recover completely missing node features for a set of graphs, where we only know the signals of a subset of graphs. Our approach incorporates prior information from both graph topology and existing nodal values. We demonstrate an example implementation of our framework where we assume that node features depend on local graph structure. Missing nodal values are estimated by aggregating known features from the most similar nodes. Similarity is measured through a node embedding space that preserves local topological features, which we train using a Graph AutoEncoder. We empirically show not only the accuracy of our feature estimation approach but also its value for downstream graph classification. Our success embarks on and implies the need to emphasize the relationship between node features and graph structure in graph-based learning.

ML-57-标题: Temporal Smoothness Regularisers for Neural Link Predictors

链接: https://arxiv.org/abs/2309.09045
作者: Manuel Dileo, Pasquale Minervini, Matteo Zignani, Sabrina Gaito
备注:

点击查看摘要

Abstract: Most algorithms for representation learning and link prediction on relational data are designed for static data. However, the data to which they are applied typically evolves over time, including online social networks or interactions between users and items in recommender systems. This is also the case for graph-structured knowledge bases – knowledge graphs – which contain facts that are valid only for specific points in time. In such contexts, it becomes crucial to correctly identify missing links at a precise time point, i.e. the temporal prediction link task. Recently, Lacroix et al. and Sadeghian et al. proposed a solution to the problem of link prediction for knowledge graphs under temporal constraints inspired by the canonical decomposition of 4-order tensors, where they regularise the representations of time steps by enforcing temporal smoothing, i.e. by learning similar transformation for adjacent timestamps. However, the impact of the choice of temporal regularisation terms is still poorly understood. In this work, we systematically analyse several choices of temporal smoothing regularisers using linear functions and recurrent architectures. In our experiments, we show that by carefully selecting the temporal smoothing regulariser and regularisation weight, a simple method like TNTComplEx can produce significantly more accurate results than state-of-the-art methods on three widely used temporal link prediction datasets. Furthermore, we evaluate the impact of a wide range of temporal smoothing regularisers on two state-of-the-art temporal link prediction models. Our work shows that simple tensor factorisation models can produce new state-of-the-art results using newly proposed temporal regularisers, highlighting a promising avenue for future research.

ML-58-标题: Study of Enhanced MISC-Based Sparse Arrays with High uDOFs and Low Mutual Coupling

链接: https://arxiv.org/abs/2309.09044
作者: X. Sheng, D. Lu, Y. Li, R. C. de Lamare
备注: 6 pages 4 figures

点击查看摘要

Abstract: In this letter, inspired by the maximum inter-element spacing (IES) constraint (MISC) criterion, an enhanced MISC-based (EMISC) sparse array (SA) with high uniform degrees-of-freedom (uDOFs) and low mutual-coupling (MC) is proposed, analyzed and discussed in detail. For the EMISC SA, an IES set is first determined by the maximum IES and number of elements. Then, the EMISC SA is composed of seven uniform linear sub-arrays (ULSAs) derived from an IES set. An analysis of the uDOFs and weight function shows that, the proposed EMISC SA outperforms the IMISC SA in terms of uDOF and MC. Simulation results show a significant advantage of the EMISC SA over other existing SAs.

ML-59-标题: Forward Invariance in Neural Network Controlled Systems

链接: https://arxiv.org/abs/2309.09043
作者: Akash Harapanahalli, Saber Jafarpour, Samuel Coogan
备注:

点击查看摘要

Abstract: We present a framework based on interval analysis and monotone systems theory to certify and search for forward invariant sets in nonlinear systems with neural network controllers. The framework (i) constructs localized first-order inclusion functions for the closed-loop system using Jacobian bounds and existing neural network verification tools; (ii) builds a dynamical embedding system where its evaluation along a single trajectory directly corresponds with a nested family of hyper-rectangles provably converging to an attractive set of the original system; (iii) utilizes linear transformations to build families of nested paralleletopes with the same properties. The framework is automated in Python using our interval analysis toolbox \textttnpinterval , in conjunction with the symbolic arithmetic toolbox \textttsympy , demonstrated on an 8 -dimensional leader-follower system.

ML-60-标题: Solving Quadratic Systems with Full-Rank Matrices Using Sparse or Generative Priors

链接: https://arxiv.org/abs/2309.09032
作者: Junren Chen, Shuai Huang, Michael K. Ng, Zhaoqiang Liu
备注:

点击查看摘要

Abstract: The problem of recovering a signal \boldsymbolx \in \mathbbR^n from a quadratic system \y_i=\boldsymbolx^\top\boldsymbolA_i\boldsymbolx,\ i=1,\ldots,m\ with full-rank matrices \boldsymbolA_i frequently arises in applications such as unassigned distance geometry and sub-wavelength imaging. With i.i.d. standard Gaussian matrices \boldsymbolA_i , this paper addresses the high-dimensional case where m\ll n by incorporating prior knowledge of \boldsymbolx . First, we consider a k -sparse \boldsymbolx and introduce the thresholded Wirtinger flow (TWF) algorithm that does not require the sparsity level k . TWF comprises two steps: the spectral initialization that identifies a point sufficiently close to \boldsymbolx (up to a sign flip) when m=O(k^2\log n) , and the thresholded gradient descent (with a good initialization) that produces a sequence linearly converging to \boldsymbolx with m=O(k\log n) measurements. Second, we explore the generative prior, assuming that \boldsymbolx lies in the range of an L -Lipschitz continuous generative model with k -dimensional inputs in an \ell_2 -ball of radius r . We develop the projected gradient descent (PGD) algorithm that also comprises two steps: the projected power method that provides an initial vector with O\big(\sqrt\frack \log Lm\big) \ell_2 -error given m=O(k\log(Lnr)) measurements, and the projected gradient descent that refines the \ell_2 -error to O(\delta) at a geometric rate when m=O(k\log\fracLrn\delta^2) . Experimental results corroborate our theoretical findings and show that: (i) our approach for the sparse case notably outperforms the existing provable algorithm sparse power factorization; (ii) leveraging the generative prior allows for precise image recovery in the MNIST dataset from a small number of quadratic measurements.

ML-61-标题: Improve Deep Forest with Learnable Layerwise Augmentation Policy Schedule

链接: https://arxiv.org/abs/2309.09030
作者: Hongyu Zhu, Sichu Liang, Wentao Hu, Fang-Qi Li, Yali yuan, Shi-Lin Wang, Guang Cheng
备注:

点击查看摘要

Abstract: As a modern ensemble technique, Deep Forest (DF) employs a cascading structure to construct deep models, providing stronger representational power compared to traditional decision forests. However, its greedy multi-layer learning procedure is prone to overfitting, limiting model effectiveness and generalizability. This paper presents an optimized Deep Forest, featuring learnable, layerwise data augmentation policy schedules. Specifically, We introduce the Cut Mix for Tabular data (CMT) augmentation technique to mitigate overfitting and develop a population-based search algorithm to tailor augmentation intensity for each layer. Additionally, we propose to incorporate outputs from intermediate layers into a checkpoint ensemble for more stable performance. Experimental results show that our method sets new state-of-the-art (SOTA) benchmarks in various tabular classification tasks, outperforming shallow tree ensembles, deep forests, deep neural network, and AutoML competitors. The learned policies also transfer effectively to Deep Forest variants, underscoring its potential for enhancing non-differentiable deep learning modules in tabular signal processing.

ML-62-标题: gym-saturation: Gymnasium environments for saturation provers (System description)

链接: https://arxiv.org/abs/2309.09022
作者: Boris Shminke
备注: 13 pages, 3 figures. This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: this https URL

点击查看摘要

Abstract: This work describes a new version of a previously published Python package - gym-saturation: a collection of OpenAI Gym environments for guiding saturation-style provers based on the given clause algorithm with reinforcement learning. We contribute usage examples with two different provers: Vampire and iProver. We also have decoupled the proof state representation from reinforcement learning per se and provided examples of using a known ast2vec Python code embedding model as a first-order logic representation. In addition, we demonstrate how environment wrappers can transform a prover into a problem similar to a multi-armed bandit. We applied two reinforcement learning algorithms (Thompson sampling and Proximal policy optimisation) implemented in Ray RLlib to show the ease of experimentation with the new release of our package.

ML-63-标题: Data-driven Reachability using Christoffel Functions and Conformal Prediction

链接: https://arxiv.org/abs/2309.08976
作者: Abdelmouaiz Tebjou, Goran Frehse, Faïcel Chamroukhi
备注:

点击查看摘要

Abstract: An important mathematical tool in the analysis of dynamical systems is the approximation of the reach set, i.e., the set of states reachable after a given time from a given initial state. This set is difficult to compute for complex systems even if the system dynamics are known and given by a system of ordinary differential equations with known coefficients. In practice, parameters are often unknown and mathematical models difficult to obtain. Data-based approaches are promised to avoid these difficulties by estimating the reach set based on a sample of states. If a model is available, this training set can be obtained through numerical simulation. In the absence of a model, real-life observations can be used instead. A recently proposed approach for data-based reach set approximation uses Christoffel functions to approximate the reach set. Under certain assumptions, the approximation is guaranteed to converge to the true solution. In this paper, we improve upon these results by notably improving the sample efficiency and relaxing some of the assumptions by exploiting statistical guarantees from conformal prediction with training and calibration sets. In addition, we exploit an incremental way to compute the Christoffel function to avoid the calibration set while maintaining the statistical convergence guarantees. Furthermore, our approach is robust to outliers in the training and calibration set.

ML-64-标题: Regularized Contrastive Pre-training for Few-shot Bioacoustic Sound Detection

链接: https://arxiv.org/abs/2309.08971
作者: Ilyass Moummad, Romain Serizel, Nicolas Farrugia
备注:

点击查看摘要

Abstract: Bioacoustic sound event detection allows for better understanding of animal behavior and for better monitoring biodiversity using audio. Deep learning systems can help achieve this goal, however it is difficult to acquire sufficient annotated data to train these systems from scratch. To address this limitation, the Detection and Classification of Acoustic Scenes and Events (DCASE) community has recasted the problem within the framework of few-shot learning and organize an annual challenge for learning to detect animal sounds from only five annotated examples. In this work, we regularize supervised contrastive pre-training to learn features that can transfer well on new target tasks with animal sounds unseen during training, achieving a high F-score of 61.52%(0.48) when no feature adaptation is applied, and an F-score of 68.19%(0.75) when we further adapt the learned features for each new target task. This work aims to lower the entry bar to few-shot bioacoustic sound event detection by proposing a simple and yet effective framework for this task, by also providing open-source code.

ML-65-标题: Multi agent Reinforcement Learning with an Attention Mechanism for Improving Energy Efficiency in LoRa Networks

链接: https://arxiv.org/abs/2309.08965
作者: Xu Zhang, Ziqi Lin, Shimin Gong, Bo Gu, Dusit Niyato
备注: 6 pages, 3 figures, This paper has been accepted for publication in IEEE Global Communications Conference (GLOBECOM) 2023

点击查看摘要

Abstract: Long Range (LoRa) wireless technology, characterized by low power consumption and a long communication range, is regarded as one of the enabling technologies for the Industrial Internet of Things (IIoT). However, as the network scale increases, the energy efficiency (EE) of LoRa networks decreases sharply due to severe packet collisions. To address this issue, it is essential to appropriately assign transmission parameters such as the spreading factor and transmission power for each end device (ED). However, due to the sporadic traffic and low duty cycle of LoRa networks, evaluating the system EE performance under different parameter settings is time-consuming. Therefore, we first formulate an analytical model to calculate the system EE. On this basis, we propose a transmission parameter allocation algorithm based on multiagent reinforcement learning (MALoRa) with the aim of maximizing the system EE of LoRa networks. Notably, MALoRa employs an attention mechanism to guide each ED to better learn how much ‘‘attention’’ should be given to the parameter assignments for relevant EDs when seeking to improve the system EE. Simulation results demonstrate that MALoRa significantly improves the system EE compared with baseline algorithms with an acceptable degradation in packet delivery rate (PDR).

ML-66-标题: UNIDEAL: Curriculum Knowledge Distillation Federated Learning ICASSP2024

链接: https://arxiv.org/abs/2309.08961
作者: Yuwen Yang, Chang Liu, Xun Cai, Suizhi Huang, Hongtao Lu, Yue Ding
备注: Submitted to ICASSP 2024

点击查看摘要

Abstract: Federated Learning (FL) has emerged as a promising approach to enable collaborative learning among multiple clients while preserving data privacy. However, cross-domain FL tasks, where clients possess data from different domains or distributions, remain a challenging problem due to the inherent heterogeneity. In this paper, we present UNIDEAL, a novel FL algorithm specifically designed to tackle the challenges of cross-domain scenarios and heterogeneous model architectures. The proposed method introduces Adjustable Teacher-Student Mutual Evaluation Curriculum Learning, which significantly enhances the effectiveness of knowledge distillation in FL settings. We conduct extensive experiments on various datasets, comparing UNIDEAL with state-of-the-art baselines. Our results demonstrate that UNIDEAL achieves superior performance in terms of both model accuracy and communication efficiency. Additionally, we provide a convergence analysis of the algorithm, showing a convergence rate of O(1/T) under non-convex conditions.

ML-67-标题: Reducing Memory Requirements for the IPU using Butterfly Factorizations

链接: https://arxiv.org/abs/2309.08946
作者: S.-Kazem Shekofteh, Christian Alles, Holger Fröning
备注:

点击查看摘要

Abstract: High Performance Computing (HPC) benefits from different improvements during last decades, specially in terms of hardware platforms to provide more processing power while maintaining the power consumption at a reasonable level. The Intelligence Processing Unit (IPU) is a new type of massively parallel processor, designed to speedup parallel computations with huge number of processing cores and on-chip memory components connected with high-speed fabrics. IPUs mainly target machine learning applications, however, due to the architectural differences between GPUs and IPUs, especially significantly less memory capacity on an IPU, methods for reducing model size by sparsification have to be considered. Butterfly factorizations are well-known replacements for fully-connected and convolutional layers. In this paper, we examine how butterfly structures can be implemented on an IPU and study their behavior and performance compared to a GPU. Experimental results indicate that these methods can provide 98.5% compression ratio to decrease the immense need for memory, the IPU implementation can benefit from 1.3x and 1.6x performance improvement for butterfly and pixelated butterfly, respectively. We also reach to 1.62x training time speedup on a real-word dataset such as CIFAR10.

ML-68-标题: Inverse classification with logistic and softmax classifiers: efficient optimization

链接: https://arxiv.org/abs/2309.08945
作者: Miguel Á. Carreira-Perpiñán, Suryabhan Singh Hada
备注:

点击查看摘要

Abstract: In recent years, a certain type of problems have become of interest where one wants to query a trained classifier. Specifically, one wants to find the closest instance to a given input instance such that the classifier’s predicted label is changed in a desired way. Examples of these ``inverse classification’’ problems are counterfactual explanations, adversarial examples and model inversion. All of them are fundamentally optimization problems over the input instance vector involving a fixed classifier, and it is of interest to achieve a fast solution for interactive or real-time applications. We focus on solving this problem efficiently for two of the most widely used classifiers: logistic regression and softmax classifiers. Owing to special properties of these models, we show that the optimization can be solved in closed form for logistic regression, and iteratively but extremely fast for the softmax classifier. This allows us to solve either case exactly (to nearly machine precision) in a runtime of milliseconds to around a second even for very high-dimensional instances and many classes.

ML-69-标题: DOMAIN: MilDly COnservative Model-BAsed OfflINe Reinforcement Learning

链接: https://arxiv.org/abs/2309.08925
作者: Xiao-Yin Liu, Xiao-Hu Zhou, Xiao-Liang Xie, Shi-Qi Liu, Zhen-Qiu Feng, Hao Li, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, Zeng-Guang Hou
备注: 13 pages, 6 figures

点击查看摘要

Abstract: Model-based reinforcement learning (RL), which learns environment model from offline dataset and generates more out-of-distribution model data, has become an effective approach to the problem of distribution shift in offline RL. Due to the gap between the learned and actual environment, conservatism should be incorporated into the algorithm to balance accurate offline data and imprecise model data. The conservatism of current algorithms mostly relies on model uncertainty estimation. However, uncertainty estimation is unreliable and leads to poor performance in certain scenarios, and the previous methods ignore differences between the model data, which brings great conservatism. Therefore, this paper proposes a milDly cOnservative Model-bAsed offlINe RL algorithm (DOMAIN) without estimating model uncertainty to address the above issues. DOMAIN introduces adaptive sampling distribution of model samples, which can adaptively adjust the model data penalty. In this paper, we theoretically demonstrate that the Q value learned by the DOMAIN outside the region is a lower bound of the true Q value, the DOMAIN is less conservative than previous model-based offline RL algorithms and has the guarantee of security policy improvement. The results of extensive experiments show that DOMAIN outperforms prior RL algorithms on the D4RL dataset benchmark, and achieves better performance than other RL algorithms on tasks that require generalization.

ML-70-标题: Efficient Methods for Non-stationary Online Learning NEURIPS2022 STOC

链接: https://arxiv.org/abs/2309.08911
作者: Peng Zhao, Yan-Feng Xie, Lijun Zhang, Zhi-Hua Zhou
备注: preliminary conference version appeared at NeurIPS 2022; this extended version improves the paper presentation, further investigates the interval dynamic regret, and adds two applications (online non-stochastic control and online PCA)

点击查看摘要

Abstract: Non-stationary online learning has drawn much attention in recent years. In particular, dynamic regret and adaptive regret are proposed as two principled performance measures for online convex optimization in non-stationary environments. To optimize them, a two-layer online ensemble is usually deployed due to the inherent uncertainty of the non-stationarity, in which a group of base-learners are maintained and a meta-algorithm is employed to track the best one on the fly. However, the two-layer structure raises the concern about the computational complexity – those methods typically maintain \mathcalO(\log T) base-learners simultaneously for a T -round online game and thus perform multiple projections onto the feasible domain per round, which becomes the computational bottleneck when the domain is complicated. In this paper, we present efficient methods for optimizing dynamic regret and adaptive regret, which reduce the number of projections per round from \mathcalO(\log T) to 1 . Moreover, our obtained algorithms require only one gradient query and one function evaluation at each round. Our technique hinges on the reduction mechanism developed in parameter-free online learning and requires non-trivial twists on non-stationary online methods. Empirical studies verify our theoretical findings.

ML-71-标题: Robust Online Covariance and Sparse Precision Estimation Under Arbitrary Data Corruption

链接: https://arxiv.org/abs/2309.08884
作者: Tong Yao, Shreyas Sundaram
备注: 9 pages, 4 figures, 62nd IEEE Conference on Decision and Control (CDC)

点击查看摘要

Abstract: Gaussian graphical models are widely used to represent correlations among entities but remain vulnerable to data corruption. In this work, we introduce a modified trimmed-inner-product algorithm to robustly estimate the covariance in an online scenario even in the presence of arbitrary and adversarial data attacks. At each time step, data points, drawn nominally independently and identically from a multivariate Gaussian distribution, arrive. However, a certain fraction of these points may have been arbitrarily corrupted. We propose an online algorithm to estimate the sparse inverse covariance (i.e., precision) matrix despite this corruption. We provide the error-bound and convergence properties of the estimates to the true precision matrix under our algorithms.

ML-72-标题: Data-Driven H-infinity Control with a Real-Time and Efficient Reinforcement Learning Algorithm: An Application to Autonomous Mobility-on-Demand Systems

链接: https://arxiv.org/abs/2309.08880
作者: Ali Aalipour, Alireza Khani
备注:

点击查看摘要

Abstract: Reinforcement learning (RL) is a class of artificial intelligence algorithms being used to design adaptive optimal controllers through online learning. This paper presents a model-free, real-time, data-efficient Q-learning-based algorithm to solve the H _\infty control of linear discrete-time systems. The computational complexity is shown to reduce from \mathcalO(\underlineq^3) in the literature to \mathcalO(\underlineq^2) in the proposed algorithm, where \underlineq is quadratic in the sum of the size of state variables, control inputs, and disturbance. An adaptive optimal controller is designed and the parameters of the action and critic networks are learned online without the knowledge of the system dynamics, making the proposed algorithm completely model-free. Also, a sufficient probing noise is only needed in the first iteration and does not affect the proposed algorithm. With no need for an initial stabilizing policy, the algorithm converges to the closed-form solution obtained by solving the Riccati equation. A simulation study is performed by applying the proposed algorithm to real-time control of an autonomous mobility-on-demand (AMoD) system for a real-world case study to evaluate the effectiveness of the proposed algorithm.

ML-73-标题: Rethinking Learning Rate Tuning in the Era of Large Language Models

链接: https://arxiv.org/abs/2309.08859
作者: Hongpeng Jin, Wenqi Wei, Xuyu Wang, Wenbin Zhang, Yanzhao Wu
备注:

点击查看摘要

Abstract: Large Language Models (LLMs) represent the recent success of deep learning in achieving remarkable human-like predictive performance. It has become a mainstream strategy to leverage fine-tuning to adapt LLMs for various real-world applications due to the prohibitive expenses associated with LLM training. The learning rate is one of the most important hyperparameters in LLM fine-tuning with direct impacts on both fine-tuning efficiency and fine-tuned LLM quality. Existing learning rate policies are primarily designed for training traditional deep neural networks (DNNs), which may not work well for LLM fine-tuning. We reassess the research challenges and opportunities of learning rate tuning in the coming era of Large Language Models. This paper makes three original contributions. First, we revisit existing learning rate policies to analyze the critical challenges of learning rate tuning in the era of LLMs. Second, we present LRBench++ to benchmark learning rate policies and facilitate learning rate tuning for both traditional DNNs and LLMs. Third, our experimental analysis with LRBench++ demonstrates the key differences between LLM fine-tuning and traditional DNN training and validates our analysis.

ML-74-标题: Distributionally Robust Post-hoc Classifiers under Prior Shifts ICLR2023

链接: https://arxiv.org/abs/2309.08825
作者: Jiaheng Wei, Harikrishna Narasimhan, Ehsan Amid, Wen-Sheng Chu, Yang Liu, Abhishek Kumar
备注: Camera ready version, accepted at ICLR 2023

点击查看摘要

Abstract: The generalization ability of machine learning models degrades significantly when the test distribution shifts away from the training distribution. We investigate the problem of training models that are robust to shifts caused by changes in the distribution of class-priors or group-priors. The presence of skewed training priors can often lead to the models overfitting to spurious features. Unlike existing methods, which optimize for either the worst or the average performance over classes or groups, our work is motivated by the need for finer control over the robustness properties of the model. We present an extremely lightweight post-hoc approach that performs scaling adjustments to predictions from a pre-trained model, with the goal of minimizing a distributionally robust loss around a chosen target distribution. These adjustments are computed by solving a constrained optimization problem on a validation set and applied to the model during test time. Our constrained optimization objective is inspired by a natural notion of robustness to controlled distribution shifts. Our method comes with provable guarantees and empirically makes a strong case for distributional robust post-hoc classifiers. An empirical implementation is available at this https URL.

ML-75-标题: SHAPNN: Shapley Value Regularized Tabular Neural Network

链接: https://arxiv.org/abs/2309.08799
作者: Qisen Cheng, Shuhui Qu, Janghwan Lee
备注: 9 pages, 8 figures

点击查看摘要

Abstract: We present SHAPNN, a novel deep tabular data modeling architecture designed for supervised learning. Our approach leverages Shapley values, a well-established technique for explaining black-box models. Our neural network is trained using standard backward propagation optimization methods, and is regularized with realtime estimated Shapley values. Our method offers several advantages, including the ability to provide valid explanations with no computational overhead for data instances and datasets. Additionally, prediction with explanation serves as a regularizer, which improves the model’s performance. Moreover, the regularized prediction enhances the model’s capability for continual learning. We evaluate our method on various publicly available datasets and compare it with state-of-the-art deep neural network models, demonstrating the superior performance of SHAPNN in terms of AUROC, transparency, as well as robustness to streaming data.

ML-76-标题: Fin-Fact: A Benchmark Dataset for Multimodal Financial Fact Checking and Explanation Generation

链接: https://arxiv.org/abs/2309.08793
作者: Aman Rangapur, Haoran Wang, Kai Shu
备注: 8 pages, 4 figures, 4 tables

点击查看摘要

Abstract: Fact-checking in financial domain is under explored, and there is a shortage of quality dataset in this domain. In this paper, we propose Fin-Fact, a benchmark dataset for multimodal fact-checking within the financial domain. Notably, it includes professional fact-checker annotations and justifications, providing expertise and credibility. With its multimodal nature encompassing both textual and visual content, Fin-Fact provides complementary information sources to enhance factuality analysis. Its primary objective is combating misinformation in finance, fostering transparency, and building trust in financial reporting and news dissemination. By offering insightful explanations, Fin-Fact empowers users, including domain experts and end-users, to understand the reasoning behind fact-checking decisions, validating claim credibility, and fostering trust in the fact-checking process. The Fin-Fact dataset, along with our experimental codes is available at this https URL.

ML-77-标题: Beyond Labels: Leveraging Deep Learning and LLM s for Content Metadata

链接: https://arxiv.org/abs/2309.08787
作者: Saurabh Agrawal, John Trenkle, Jaya Kawale
备注:

点击查看摘要

Abstract: Content metadata plays a very important role in movie recommender systems as it provides valuable information about various aspects of a movie such as genre, cast, plot synopsis, box office summary, etc. Analyzing the metadata can help understand the user preferences to generate personalized recommendations and item cold starting. In this talk, we will focus on one particular type of metadata - \textitgenre labels. Genre labels associated with a movie or a TV series help categorize a collection of titles into different themes and correspondingly setting up the audience expectation. We present some of the challenges associated with using genre label information and propose a new way of examining the genre information that we call as the \textitGenre Spectrum. The Genre Spectrum helps capture the various nuanced genres in a title and our offline and online experiments corroborate the effectiveness of the approach. Furthermore, we also talk about applications of LLMs in augmenting content metadata which could eventually be used to achieve effective organization of recommendations in user’s 2-D home-grid.

ML-78-标题: Projected Task-Specific Layers for Multi-Task Reinforcement Learning

链接: https://arxiv.org/abs/2309.08776
作者: Josselin Somerville Roberts, Julia Di
备注:

点击查看摘要

Abstract: Multi-task reinforcement learning could enable robots to scale across a wide variety of manipulation tasks in homes and workplaces. However, generalizing from one task to another and mitigating negative task interference still remains a challenge. Addressing this challenge by successfully sharing information across tasks will depend on how well the structure underlying the tasks is captured. In this work, we introduce our new architecture, Projected Task-Specific Layers (PTSL), that leverages a common policy with dense task-specific corrections through task-specific layers to better express shared and variable task information. We then show that our model outperforms the state of the art on the MT10 and MT50 benchmarks of Meta-World consisting of 10 and 50 goal-conditioned tasks for a Sawyer arm.

ML-79-标题: Enhance audio generation controllability through representation similarity regularization

链接: https://arxiv.org/abs/2309.08773
作者: Yangyang Shi, Gael Le Lan, Varun Nagaraja, Zhaoheng Ni, Xinhao Mei, Ernie Chang, Forrest Iandola, Yang Liu, Vikas Chandra
备注: 5 pages

点击查看摘要

Abstract: This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regularization to ensure the alignment between the chosen text representation and the language model’s predictions. Our proposal involves the incorporation of audio and text representation regularization, particularly during the classifier-free guidance (CFG) phase, where the text condition is excluded from cross attention during language model training. The aim of this proposed representation regularization is to minimize discrepancies in audio and text similarity compared to other samples within the same training batch. Experimental results on both music and audio generation tasks demonstrate that our proposed methods lead to improvements in objective metrics for both audio and music generation, as well as an enhancement in the human perception for audio generation.

ML-80-标题: Circular Clustering with Polar Coordinate Reconstruction

链接: https://arxiv.org/abs/2309.08757
作者: Xiaoxiao Sun, Paul Sajda
备注: Manuscript is under review in IEEE Transactions on Computational Biology and Bioinformatics. Copyright holder is credited to IEEE

点击查看摘要

Abstract: There is a growing interest in characterizing circular data found in biological systems. Such data are wide ranging and varied, from signal phase in neural recordings to nucleotide sequences in round genomes. Traditional clustering algorithms are often inadequate due to their limited ability to distinguish differences in the periodic component. Current clustering schemes that work in a polar coordinate system have limitations, such as being only angle-focused or lacking generality. To overcome these limitations, we propose a new analysis framework that utilizes projections onto a cylindrical coordinate system to better represent objects in a polar coordinate system. Using the mathematical properties of circular data, we show our approach always finds the correct clustering result within the reconstructed dataset, given sufficient periodic repetitions of the data. Our approach is generally applicable and adaptable and can be incorporated into most state-of-the-art clustering algorithms. We demonstrate on synthetic and real data that our method generates more appropriate and consistent clustering results compared to standard methods. In summary, our proposed analysis framework overcomes the limitations of existing polar coordinate-based clustering methods and provides a more accurate and efficient way to cluster circular data.

ML-81-标题: Diverse Neural Audio Embeddings – Bringing Features back !

链接: https://arxiv.org/abs/2309.08751
作者: Prateek Verma
备注: 6 pages, 1 figure, 2 table

点击查看摘要

Abstract: With the advent of modern AI architectures, a shift has happened towards end-to-end architectures. This pivot has led to neural architectures being trained without domain-specific biases/knowledge, optimized according to the task. We in this paper, learn audio embeddings via diverse feature representations, in this case, domain-specific. For the case of audio classification over hundreds of categories of sound, we learn robust separate embeddings for diverse audio properties such as pitch, timbre, and neural representation, along with also learning it via an end-to-end architecture. We observe handcrafted embeddings, e.g., pitch and timbre-based, although on their own, are not able to beat a fully end-to-end representation, yet adding these together with end-to-end embedding helps us, significantly improve performance. This work would pave the way to bring some domain expertise with end-to-end models to learn robust, diverse representations, surpassing the performance of just training end-to-end models.

ML-82-标题: Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

链接: https://arxiv.org/abs/2309.08748
作者: Yi Shen, Pan Xu, Michael M. Zavlanos
备注:

点击查看摘要

Abstract: Without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial.

ML-83-标题: Experimental Assessment of a Forward-Collision Warning System Fusing Deep Learning and Decentralized Radio Sensing

链接: https://arxiv.org/abs/2309.08737
作者: Jorge D. Cardenas, Omar Contreras-Ponce, Carlos A. Gutierrez, Ruth Aguilar-Ponce, Francisco R. Castillo-Soria, Cesar A. Azurdia-Meza
备注:

点击查看摘要

Abstract: This paper presents the idea of an automatic forward-collision warning system based on a decentralized radio sensing (RS) approach. In this framework, a vehicle in receiving mode employs a continuous waveform (CW) transmitted by a second vehicle as a probe signal to detect oncoming vehicles and warn the driver of a potential forward collision. Such a CW can easily be incorporated as a pilot signal within the data frame of current multicarrier vehicular communication systems. Detection of oncoming vehicles is performed by a deep learning (DL) module that analyzes the features of the Doppler signature imprinted on the CW probe signal by a rapidly approaching vehicle. This decentralized CW RS approach was assessed experimentally using data collected by a series of field trials conducted in a two-lanes high-speed highway. Detection performance was evaluated for two different DL models: a long short-term memory network and a convolutional neural network. The obtained results demonstrate the feasibility of the envisioned forward-collision warning system based on the fusion of DL and decentralized CW RS.

ML-84-标题: Pointing the Way: Refining Radar-Lidar Localization Using Learned ICP Weights ICRA

链接: https://arxiv.org/abs/2309.08731
作者: Daniil Lisus, Johann Laconte, Keenan Burnett, Timothy D. Barfoot
备注: 8 pages (6 content, 2 references). 4 figures, submitted to the 2024 IEEE International Conference on Robotics and Automation (ICRA)

点击查看摘要

Abstract: This paper presents a novel deep-learning-based approach to improve localizing radar measurements against lidar maps. Although the state of the art for localization is matching lidar data to lidar maps, radar has been considered as a promising alternative, as it is potentially more resilient against adverse weather such as precipitation and heavy fog. To make use of existing high-quality lidar maps, while maintaining performance in adverse weather, matching radar data to lidar maps is of interest. However, owing in part to the unique artefacts present in radar measurements, radar-lidar localization has struggled to achieve comparable performance to lidar-lidar systems, preventing it from being viable for autonomous driving. This work builds on an ICP-based radar-lidar localization system by including a learned preprocessing step that weights radar points based on high-level scan information. Combining a proven analytical approach with a learned weight reduces localization errors in radar-lidar ICP results run on real-world autonomous driving data by up to 54.94% in translation and 68.39% in rotation, while maintaining interpretability and robustness.

ML-85-标题: Clustered Multi- Agent Linear Bandits

链接: https://arxiv.org/abs/2309.08710
作者: Hamza Cherkaoui, Merwan Barlier, Igor Colin
备注: 18 pages, 8 figures

点击查看摘要

Abstract: We address in this paper a particular instance of the multi-agent linear stochastic bandit problem, called clustered multi-agent linear bandits. In this setting, we propose a novel algorithm leveraging an efficient collaboration between the agents in order to accelerate the overall optimization problem. In this contribution, a network controller is responsible for estimating the underlying cluster structure of the network and optimizing the experiences sharing among agents within the same groups. We provide a theoretical analysis for both the regret minimization problem and the clustering quality. Through empirical evaluation against state-of-the-art algorithms on both synthetic and real data, we demonstrate the effectiveness of our approach: our algorithm significantly improves regret minimization while managing to recover the true underlying cluster partitioning.

ML-86-标题: Wasserstein Distributionally Robust Control Barrier Function using Conditional Value-at-Risk with Differentiable Convex Programming

链接: https://arxiv.org/abs/2309.08700
作者: Alaa Eddine Chriat, Chuangchuang Sun
备注:

点击查看摘要

Abstract: Control Barrier functions (CBFs) have attracted extensive attention for designing safe controllers for their deployment in real-world safety-critical systems. However, the perception of the surrounding environment is often subject to stochasticity and further distributional shift from the nominal one. In this paper, we present distributional robust CBF (DR-CBF) to achieve resilience under distributional shift while keeping the advantages of CBF, such as computational efficacy and forward invariance. To achieve this goal, we first propose a single-level convex reformulation to estimate the conditional value at risk (CVaR) of the safety constraints under distributional shift measured by a Wasserstein metric, which is by nature tri-level programming. Moreover, to construct a control barrier condition to enforce the forward invariance of the CVaR, the technique of differentiable convex programming is applied to enable differentiation through the optimization layer of CVaR estimation. We also provide an approximate variant of DR-CBF for higher-order systems. Simulation results are presented to validate the chance-constrained safety guarantee under the distributional shift in both first and second-order systems.

ML-87-标题: Modelling Irregularly Sampled Time Series Without Imputation

链接: https://arxiv.org/abs/2309.08698
作者: Rohit Agarwal, Aman Sinha, Dilip K. Prasad, Marianne Clausel, Alexander Horsch, Mathieu Constant, Xavier Coubez
备注:

点击查看摘要

Abstract: Modelling irregularly-sampled time series (ISTS) is challenging because of missing values. Most existing methods focus on handling ISTS by converting irregularly sampled data into regularly sampled data via imputation. These models assume an underlying missing mechanism leading to unwanted bias and sub-optimal performance. We present SLAN (Switch LSTM Aggregate Network), which utilizes a pack of LSTMs to model ISTS without imputation, eliminating the assumption of any underlying process. It dynamically adapts its architecture on the fly based on the measured sensors. SLAN exploits the irregularity information to capture each sensor’s local summary explicitly and maintains a global summary state throughout the observational period. We demonstrate the efficacy of SLAN on publicly available datasets, namely, MIMIC-III, Physionet 2012 and Physionet 2019. The code is available at this https URL.

ML-88-标题: Evaluating the Impact of Local Differential Privacy on Utility Loss via Influence Functions

链接: https://arxiv.org/abs/2309.08678
作者: Alycia N. Carey, Minh-Hao Van, Xintao Wu
备注: 11 pages, 2 figures

点击查看摘要

Abstract: How to properly set the privacy parameter in differential privacy (DP) has been an open question in DP research since it was first proposed in 2006. In this work, we demonstrate the ability of influence functions to offer insight into how a specific privacy parameter value will affect a model’s test loss in the randomized response-based local DP setting. Our proposed method allows a data curator to select the privacy parameter best aligned with their allowed privacy-utility trade-off without requiring heavy computation such as extensive model retraining and data privatization. We consider multiple common randomization scenarios, such as performing randomized response over the features, and/or over the labels, as well as the more complex case of applying a class-dependent label noise correction method to offset the noise incurred by randomization. Further, we provide a detailed discussion over the computational complexity of our proposed approach inclusive of an empirical analysis. Through empirical evaluations we show that for both binary and multi-class settings, influence functions are able to approximate the true change in test loss that occurs when randomized response is applied over features and/or labels with small mean absolute error, especially in cases where noise correction methods are applied.

ML-89-标题: A Stochastic Online Forecast-and-Optimize Framework for Real-Time Energy Dispatch in Virtual Power Plants under Uncertainty CIKM23

链接: https://arxiv.org/abs/2309.08642
作者: Wei Jiang, Zhongkai Yi, Li Wang, Hanwei Zhang, Jihai Zhang, Fangquan Lin, Cheng Yang
备注: Preprint. Accepted by CIKM 23

点击查看摘要

Abstract: Aggregating distributed energy resources in power systems significantly increases uncertainties, in particular caused by the fluctuation of renewable energy generation. This issue has driven the necessity of widely exploiting advanced predictive control techniques under uncertainty to ensure long-term economics and decarbonization. In this paper, we propose a real-time uncertainty-aware energy dispatch framework, which is composed of two key elements: (i) A hybrid forecast-and-optimize sequential task, integrating deep learning-based forecasting and stochastic optimization, where these two stages are connected by the uncertainty estimation at multiple temporal resolutions; (ii) An efficient online data augmentation scheme, jointly involving model pre-training and online fine-tuning stages. In this way, the proposed framework is capable to rapidly adapt to the real-time data distribution, as well as to target on uncertainties caused by data drift, model discrepancy and environment perturbations in the control process, and finally to realize an optimal and robust dispatch solution. The proposed framework won the championship in CityLearn Challenge 2022, which provided an influential opportunity to investigate the potential of AI application in the energy domain. In addition, comprehensive experiments are conducted to interpret its effectiveness in the real-life scenario of smart building energy management.

ML-90-标题: FedFNN: Faster Training Convergence Through Update Predictions in Federated Recommender Systems

链接: https://arxiv.org/abs/2309.08635
作者: Francesco Fabbri, Xianghang Liu, Jack R. McKenzie, Bartlomiej Twardowski, Tri Kurniawan Wijaya
备注:

点击查看摘要

Abstract: Federated Learning (FL) has emerged as a key approach for distributed machine learning, enhancing online personalization while ensuring user data privacy. Instead of sending private data to a central server as in traditional approaches, FL decentralizes computations: devices train locally and share updates with a global server. A primary challenge in this setting is achieving fast and accurate model training - vital for recommendation systems where delays can compromise user engagement. This paper introduces FedFNN, an algorithm that accelerates decentralized model training. In FL, only a subset of users are involved in each training epoch. FedFNN employs supervised learning to predict weight updates from unsampled users, using updates from the sampled set. Our evaluations, using real and synthetic data, show: 1. FedFNN achieves training speeds 5x faster than leading methods, maintaining or improving accuracy; 2. the algorithm’s performance is consistent regardless of client cluster variations; 3. FedFNN outperforms other methods in scenarios with limited client availability, converging more quickly.

ML-91-标题: Drifter: Efficient Online Feature Monitoring for Improved Data Integrity in Large-Scale Recommendation Systems RECSYS

链接: https://arxiv.org/abs/2309.08617
作者: Blaž Škrlj, Nir Ki-Tov, Lee Edelist, Natalia Silberstein, Hila Weisman-Zohar, Blaž Mramor, Davorin Kopič, Naama Ziporin
备注: Accepted to ORSUM RecSys workshop

点击查看摘要

Abstract: Real-world production systems often grapple with maintaining data quality in large-scale, dynamic streams. We introduce Drifter, an efficient and lightweight system for online feature monitoring and verification in recommendation use cases. Drifter addresses limitations of existing methods by delivering agile, responsive, and adaptable data quality monitoring, enabling real-time root cause analysis, drift detection and insights into problematic production events. Integrating state-of-the-art online feature ranking for sparse data and anomaly detection ideas, Drifter is highly scalable and resource-efficient, requiring only two threads and less than a gigabyte of RAM per production deployments that handle millions of instances per minute. Evaluation on real-world data sets demonstrates Drifter’s effectiveness in alerting and mitigating data quality issues, substantially improving reliability and performance of real-time live recommender systems.

ML-92-标题: Do the Frankenstein or how to achieve better out-of-distribution performance with manifold mixing model soup

链接: https://arxiv.org/abs/2309.08610
作者: Hannes Fassold
备注: Accepted for IMVIP 2023 conference

点击查看摘要

Abstract: The standard recipe applied in transfer learning is to finetune a pretrained model on the task-specific dataset with different hyperparameter settings and pick the model with the highest accuracy on the validation dataset. Unfortunately, this leads to models which do not perform well under distribution shifts, e.g. when the model is given graphical sketches of the object as input instead of photos. In order to address this, we propose the manifold mixing model soup, an algorithm which mixes together the latent space manifolds of multiple finetuned models in an optimal way in order to generate a fused model. We show that the fused model gives significantly better out-of-distribution performance (+3.5 % compared to best individual model) when finetuning a CLIP model for image classification. In addition, it provides also better accuracy on the original dataset where the finetuning has been done.

ML-93-标题: Des-q: a quantum algorithm to construct and efficiently retrain decision trees for regression and binary classification

链接: https://arxiv.org/abs/2309.09976
作者: Niraj Kumar, Romina Yalovetzky, Changhao Li, Pierre Minnsen, Marco Pistoia
备注: 48 pager, 4 figures, 4 tables

点击查看摘要

Abstract: Decision trees are widely used in machine learning due to their simplicity in construction and interpretability. However, as data sizes grow, traditional methods for construction and retraining decision trees become increasingly slow, scaling polynomially with the number of training examples. In this work, we introduce a novel quantum algorithm, named Des - q , for constructing and retraining decision trees in regression and binary classification tasks. Assuming the data stream produces small increments of new training examples, we demonstrate that our Des - q algorithm significantly reduces the time required for tree retraining, achieving a poly-logarithmic time complexity in the number of training examples, even accounting for the time needed to load the new examples into quantum-accessible memory. Our approach involves building a decision tree algorithm to perform k-piecewise linear tree splits at each internal node. These splits simultaneously generate multiple hyperplanes, dividing the feature space into k distinct regions. To determine the k suitable anchor points for these splits, we develop an efficient quantum-supervised clustering method, building upon the q-means algorithm of Kerenidis et al . Des - q first efficiently estimates each feature weight using a novel quantum technique to estimate the Pearson correlation. Subsequently, we employ weighted distance estimation to cluster the training examples in k disjoint regions and then proceed to expand the tree using the same procedure. We benchmark the performance of the simulated version of our algorithm against the state-of-the-art classical decision tree for regression and binary classification on multiple data sets with numerical features. Further, we showcase that the proposed algorithm exhibits similar performance to the state-of-the-art decision tree while significantly speeding up the periodic tree retraining.

ML-94-标题: Distilling Hu BERT with LSTMs via Decoupled Knowledge Distillation ICASSP2024

链接: https://arxiv.org/abs/2309.09920
作者: Danilo de Oliveira, Timo Gerkmann
备注: Submitted to ICASSP 2024

点击查看摘要

Abstract: Much research effort is being applied to the task of compressing the knowledge of self-supervised models, which are powerful, yet large and memory consuming. In this work, we show that the original method of knowledge distillation (and its more recently proposed extension, decoupled knowledge distillation) can be applied to the task of distilling HuBERT. In contrast to methods that focus on distilling internal features, this allows for more freedom in the network architecture of the compressed model. We thus propose to distill HuBERT’s Transformer layers into an LSTM-based distilled model that reduces the number of parameters even below DistilHuBERT and at the same time shows improved performance in automatic speech recognition.

ML-95-标题: Learning Nonparametric High-Dimensional Generative Models: The Empirical-Beta-Copula Autoencoder

链接: https://arxiv.org/abs/2309.09916
作者: Maximilian Coblenz, Oliver Grothe, Fabian Kächele
备注:

点击查看摘要

Abstract: By sampling from the latent space of an autoencoder and decoding the latent space samples to the original data space, any autoencoder can simply be turned into a generative model. For this to work, it is necessary to model the autoencoder’s latent space with a distribution from which samples can be obtained. Several simple possibilities (kernel density estimates, Gaussian distribution) and more sophisticated ones (Gaussian mixture models, copula models, normalization flows) can be thought of and have been tried recently. This study aims to discuss, assess, and compare various techniques that can be used to capture the latent space so that an autoencoder can become a generative model while striving for simplicity. Among them, a new copula-based method, the Empirical Beta Copula Autoencoder, is considered. Furthermore, we provide insights into further aspects of these methods, such as targeted sampling or synthesizing new data with specific features.

ML-96-标题: Learning to Generate Lumped Hydrological Models

链接: https://arxiv.org/abs/2309.09904
作者: Yang Yang, Ting Fong May Chui
备注:

点击查看摘要

Abstract: In a lumped hydrological model structure, the hydrological function of a catchment is characterized by only a few parameters. Given a set of parameter values, a numerical function useful for hydrological prediction is generated. Thus, this study assumes that the hydrological function of a catchment can be sufficiently well characterized by a small number of latent variables. By specifying the variable values, a numerical function resembling the hydrological function of a real-world catchment can be generated using a generative model. In this study, a deep learning method is used to learn both the generative model and the latent variable values of different catchments directly from their climate forcing and runoff data, without using catchment attributes. The generative models can be used similarly to a lumped model structure, i.e., by estimating the optimal parameter or latent variable values using a generic model calibration algorithm, an optimal numerical model can be derived. In this study, generative models using eight latent variables were learned from data from over 3,000 catchments worldwide, and the learned generative models were applied to model over 700 different catchments using a generic calibration algorithm. The quality of the resulting optimal models was generally comparable to or better than that obtained using 36 different types of lump model structures or using non-generative deep learning methods. In summary, this study presents a data-driven approach for representing the hydrological function of a catchment in low-dimensional space and a method for reconstructing specific hydrological functions from the representations.

ML-97-标题: Error Reduction from Stacked Regressions

链接: https://arxiv.org/abs/2309.09880
作者: Xin Chen, Jason M. Klusowski, Yan Shuo Tan
备注:

点击查看摘要

Abstract: Stacking regressions is an ensemble technique that forms linear combinations of different regression estimators to enhance predictive accuracy. The conventional approach uses cross-validation data to generate predictions from the constituent estimators, and least-squares with nonnegativity constraints to learn the combination weights. In this paper, we learn these weights analogously by minimizing an estimate of the population risk subject to a nonnegativity constraint. When the constituent estimators are linear least-squares projections onto nested subspaces separated by at least three dimensions, we show that thanks to a shrinkage effect, the resulting stacked estimator has strictly smaller population risk than best single estimator among them. Here ``best’’ refers to a model that minimizes a selection criterion such as AIC or BIC. In other words, in this setting, the best single estimator is inadmissible. Because the optimization problem can be reformulated as isotonic regression, the stacked estimator requires the same order of computation as the best single estimator, making it an attractive alternative in terms of both performance and implementation.

ML-98-标题: Domain Generalization with Fourier Transform and Soft Thresholding

链接: https://arxiv.org/abs/2309.09866
作者: Hongyi Pan, Bin Wang, Zheyuan Zhan, Xin Zhu, Debesh Jha, Ahmet Enis Cetin, Concetto Spampinato, Ulas Bagci
备注:

点击查看摘要

Abstract: Domain generalization aims to train models on multiple source domains so that they can generalize well to unseen target domains. Among many domain generalization methods, Fourier-transform-based domain generalization methods have gained popularity primarily because they exploit the power of Fourier transformation to capture essential patterns and regularities in the data, making the model more robust to domain shifts. The mainstream Fourier-transform-based domain generalization swaps the Fourier amplitude spectrum while preserving the phase spectrum between the source and the target images. However, it neglects background interference in the amplitude spectrum. To overcome this limitation, we introduce a soft-thresholding function in the Fourier domain. We apply this newly designed algorithm to retinal fundus image segmentation, which is important for diagnosing ocular diseases but the neural network’s performance can degrade across different sources due to domain shifts. The proposed technique basically enhances fundus image augmentation by eliminating small values in the Fourier domain and providing better generalization. The innovative nature of the soft thresholding fused with Fourier-transform-based domain generalization improves neural network models’ performance by reducing the target images’ background interference significantly. Experiments on public data validate our approach’s effectiveness over conventional and state-of-the-art methods with superior segmentation metrics.

ML-99-标题: Convolutional Deep Kernel Machines

链接: https://arxiv.org/abs/2309.09814
作者: Edward Milsom, Ben Anson, Laurence Aitchison
备注: 11 pages

点击查看摘要

Abstract: Deep kernel machines (DKMs) are a recently introduced kernel method with the flexibility of other deep models including deep NNs and deep Gaussian processes. DKMs work purely with kernels, never with features, and are therefore different from other methods ranging from NNs to deep kernel learning and even deep Gaussian processes, which all use features as a fundamental component. Here, we introduce convolutional DKMs, along with an efficient inter-domain inducing point approximation scheme. Further, we develop and experimentally assess a number of model variants, including 9 different types of normalisation designed for the convolutional DKMs, two likelihoods, and two different types of top-layer. The resulting models achieve around 99% test accuracy on MNIST, 92% on CIFAR-10 and 71% on CIFAR-100, despite training in only around 28 GPU hours, 1-2 orders of magnitude faster than full NNGP / NTK / Myrtle kernels, whilst achieving comparable performance.

ML-100-标题: The NFLikelihood: an unsupervised DNNLikelihood from Normalizing Flows

链接: https://arxiv.org/abs/2309.09743
作者: Humberto Reyes-Gonzalez, Riccardo Torre
备注: 15 pages, 5 figures, 11 tables

点击查看摘要

Abstract: We propose the NFLikelihood, an unsupervised version, based on Normalizing Flows, of the DNNLikelihood proposed in Ref.[1]. We show, through realistic examples, how Autoregressive Flows, based on affine and rational quadratic spline bijectors, are able to learn complicated high-dimensional Likelihoods arising in High Energy Physics (HEP) analyses. We focus on a toy LHC analysis example already considered in the literature and on two Effective Field Theory fits of flavor and electroweak observables, whose samples have been obtained throught the HEPFit code. We discuss advantages and disadvantages of the unsupervised approach with respect to the supervised one and discuss possible interplays of the two.

ML-101-标题: Neural Collapse for Unconstrained Feature Model under Cross-entropy Loss with Imbalanced Data

链接: https://arxiv.org/abs/2309.09725
作者: Wanli Hong, Shuyang Ling
备注: 38 pages, 10 figures

点击查看摘要

Abstract: Recent years have witnessed the huge success of deep neural networks (DNNs) in various tasks of computer vision and text processing. Interestingly, these DNNs with massive number of parameters share similar structural properties on their feature representation and last-layer classifier at terminal phase of training (TPT). Specifically, if the training data are balanced (each class shares the same number of samples), it is observed that the feature vectors of samples from the same class converge to their corresponding in-class mean features and their pairwise angles are the same. This fascinating phenomenon is known as Neural Collapse (N C), first termed by Papyan, Han, and Donoho in 2019. Many recent works manage to theoretically explain this phenomenon by adopting so-called unconstrained feature model (UFM). In this paper, we study the extension of N C phenomenon to the imbalanced data under cross-entropy loss function in the context of unconstrained feature model. Our contribution is multi-fold compared with the state-of-the-art results: (a) we show that the feature vectors exhibit collapse phenomenon, i.e., the features within the same class collapse to the same mean vector; (b) the mean feature vectors no longer form an equiangular tight frame. Instead, their pairwise angles depend on the sample size; © we also precisely characterize the sharp threshold on which the minority collapse (the feature vectors of the minority groups collapse to one single vector) will take place; (d) finally, we argue that the effect of the imbalance in datasize diminishes as the sample size grows. Our results provide a complete picture of the N C under the cross-entropy loss for the imbalanced data. Numerical experiments confirm our theoretical analysis.

ML-102-标题: Single and Few-step Diffusion for Generative Speech Enhancement

链接: https://arxiv.org/abs/2309.09677
作者: Bunlong Lay, Jean-Marie Lemercier, Julius Richter, Timo Gerkmann
备注: 5 pages, 1 figure, 1 table

点击查看摘要

Abstract: Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate over the sampling trajectory. In this paper, we address these limitations through a two-stage training approach. In the first stage, we train the diffusion model the usual way using the generative denoising score matching loss. In the second stage, we compute the enhanced signal by solving the reverse process and compare the resulting estimate to the clean speech target using a predictive loss. We show that using this second training stage enables achieving the same performance as the baseline model using only 5 function evaluations instead of 60 function evaluations. While the performance of usual generative diffusion algorithms drops dramatically when lowering the number of function evaluations (NFEs) to obtain single-step diffusion, we show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting and also generalizes better than its predictive counterpart.

ML-103-标题: New Bounds on the Accuracy of Majority Voting for Multi-Class Classification

链接: https://arxiv.org/abs/2309.09564
作者: Sina Aeeneh, Nikola Zlatanov, Jiangshan Yu
备注:

点击查看摘要

Abstract: Majority voting is a simple mathematical function that returns the value that appears most often in a set. As a popular decision fusion technique, the majority voting function (MVF) finds applications in resolving conflicts, where a number of independent voters report their opinions on a classification problem. Despite its importance and its various applications in ensemble learning, data crowd-sourcing, remote sensing, and data oracles for blockchains, the accuracy of the MVF for the general multi-class classification problem has remained unknown. In this paper, we derive a new upper bound on the accuracy of the MVF for the multi-class classification problem. More specifically, we show that under certain conditions, the error rate of the MVF exponentially decays toward zero as the number of independent voters increases. Conversely, the error rate of the MVF exponentially grows with the number of independent voters if these conditions are not met. We first explore the problem for independent and identically distributed voters where we assume that every voter follows the same conditional probability distribution of voting for different classes, given the true classification of the data point. Next, we extend our results for the case where the voters are independent but non-identically distributed. Using the derived results, we then provide a discussion on the accuracy of the truth discovery algorithms. We show that in the best-case scenarios, truth discovery algorithms operate as an amplified MVF and thereby achieve a small error rate only when the MVF achieves a small error rate, and vice versa, achieve a large error rate when the MVF also achieves a large error rate. In the worst-case scenario, the truth discovery algorithms may achieve a higher error rate than the MVF. Finally, we confirm our theoretical results using numerical simulations.

ML-104-标题: Utilizing Whisper to Enhance Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids

链接: https://arxiv.org/abs/2309.09548
作者: Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
备注:

点击查看摘要

Abstract: Automated assessment of speech intelligibility in hearing aid (HA) devices is of great importance. Our previous work introduced a non-intrusive multi-branched speech intelligibility prediction model called MBI-Net, which achieved top performance in the Clarity Prediction Challenge 2022. Based on the promising results of the MBI-Net model, we aim to further enhance its performance by leveraging Whisper embeddings to enrich acoustic features. In this study, we propose two improved models, namely MBI-Net+ and MBI-Net++. MBI-Net+ maintains the same model architecture as MBI-Net, but replaces self-supervised learning (SSL) speech embeddings with Whisper embeddings to deploy cross-domain features. On the other hand, MBI-Net++ further employs a more elaborate design, incorporating an auxiliary task to predict frame-level and utterance-level scores of the objective speech intelligibility metric HASPI (Hearing Aid Speech Perception Index) and multi-task learning. Experimental results confirm that both MBI-Net++ and MBI-Net+ achieve better prediction performance than MBI-Net in terms of multiple metrics, and MBI-Net++ is better than MBI-Net+.

ML-105-标题: Quantum Wasserstein GANs for State Preparation at Unseen Points of a Phase Diagram

链接: https://arxiv.org/abs/2309.09543
作者: Wiktor Jurasz, Christian B. Mendl
备注:

点击查看摘要

Abstract: Generative models and in particular Generative Adversarial Networks (GANs) have become very popular and powerful data generation tool. In recent years, major progress has been made in extending this concept into the quantum realm. However, most of the current methods focus on generating classes of states that were supplied in the input set and seen at the training time. In this work, we propose a new hybrid classical-quantum method based on quantum Wasserstein GANs that overcomes this limitation. It allows to learn the function governing the measurement expectations of the supplied states and generate new states, that were not a part of the input set, but which expectations follow the same underlying function.

ML-106-标题: Dynamic-SUPERB: Towards A Dynamic Collaborative and Comprehensive Instruction-Tuning Benchmark for Speech ICASSP2024

链接: https://arxiv.org/abs/2309.09510
作者: Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-yi Lee
备注: Submitted to ICASSP 2024

点击查看摘要

Abstract: Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion. To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark. To initiate, Dynamic-SUPERB features 55 evaluation instances by combining 33 tasks and 22 datasets. This spans a broad spectrum of dimensions, providing a comprehensive platform for evaluation. Additionally, we propose several approaches to establish benchmark baselines. These include the utilization of speech models, text language models, and the multimodal encoder. Evaluation results indicate that while these baselines perform reasonably on seen tasks, they struggle with unseen ones. We also conducted an ablation study to assess the robustness and seek improvements in the performance. We release all materials to the public and welcome researchers to collaborate on the project, advancing technologies in the field together.

ML-107-标题: Outlier-Insensitive Kalman Filtering: Theory and Applications

链接: https://arxiv.org/abs/2309.09505
作者: Shunit Truzman, Guy Revach, Nir Shlezinger, Itzik Klein
备注:

点击查看摘要

Abstract: State estimation of dynamical systems from noisy observations is a fundamental task in many applications. It is commonly addressed using the linear Kalman filter (KF), whose performance can significantly degrade in the presence of outliers in the observations, due to the sensitivity of its convex quadratic objective function. To mitigate such behavior, outlier detection algorithms can be applied. In this work, we propose a parameter-free algorithm which mitigates the harmful effect of outliers while requiring only a short iterative process of the standard update step of the KF. To that end, we model each potential outlier as a normal process with unknown variance and apply online estimation through either expectation maximization or alternating maximization algorithms. Simulations and field experiment evaluations demonstrate competitive performance of our method, showcasing its robustness to outliers in filtering scenarios compared to alternative algorithms.

ML-108-标题: On the Use of the Kantorovich-Rubinstein Distance for Dimensionality Reduction

链接: https://arxiv.org/abs/2309.09442
作者: Gaël Giordano
备注: 214 pages, 0 figures, This is a PhD thesis in mathematics under the supervision of Dr. Vladimir Pestov and Dr. George Wells submitted on May 1, 2023 at the University of Ottawa

点击查看摘要

Abstract: The goal of this thesis is to study the use of the Kantorovich-Rubinstein distance as to build a descriptor of sample complexity in classification problems. The idea is to use the fact that the Kantorovich-Rubinstein distance is a metric in the space of measures that also takes into account the geometry and topology of the underlying metric space. We associate to each class of points a measure and thus study the geometrical information that we can obtain from the Kantorovich-Rubinstein distance between those measures. We show that a large Kantorovich-Rubinstein distance between those measures allows to conclude that there exists a 1-Lipschitz classifier that classifies well the classes of points. We also discuss the limitation of the Kantorovich-Rubinstein distance as a descriptor.

ML-109-标题: Distributionally Time-Varying Online Stochastic Optimization under Polyak-Łojasiewicz Condition with Application in Conditional Value-at-Risk Statistical Learning

链接: https://arxiv.org/abs/2309.09411
作者: Yuen-Man Pun, Farhad Farokhi, Iman Shames
备注:

点击查看摘要

Abstract: In this work, we consider a sequence of stochastic optimization problems following a time-varying distribution via the lens of online optimization. Assuming that the loss function satisfies the Polyak-Łojasiewicz condition, we apply online stochastic gradient descent and establish its dynamic regret bound that is composed of cumulative distribution drifts and cumulative gradient biases caused by stochasticity. The distribution metric we adopt here is Wasserstein distance, which is well-defined without the absolute continuity assumption or with a time-varying support set. We also establish a regret bound of online stochastic proximal gradient descent when the objective function is regularized. Moreover, we show that the above framework can be applied to the Conditional Value-at-Risk (CVaR) learning problem. Particularly, we improve an existing proof on the discovery of the PL condition of the CVaR problem, resulting in a regret bound of online stochastic gradient descent.

ML-110-标题: An Automatic Tuning MPC with Application to Ecological Cruise Control

链接: https://arxiv.org/abs/2309.09358
作者: Mohammad Abtahi, Mahdis Rabbani, Shima Nazari
备注:

点击查看摘要

Abstract: Model predictive control (MPC) is a powerful tool for planning and controlling dynamical systems due to its capacity for handling constraints and taking advantage of preview information. Nevertheless, MPC performance is highly dependent on the choice of cost function tuning parameters. In this work, we demonstrate an approach for online automatic tuning of an MPC controller with an example application to an ecological cruise control system that saves fuel by using a preview of road grade. We solve the global fuel consumption minimization problem offline using dynamic programming and find the corresponding MPC cost function by solving the inverse optimization problem. A neural network fitted to these offline results is used to generate the desired MPC cost function weight during online operation. The effectiveness of the proposed approach is verified in simulation for different road geometries.

ML-111-标题: Structure to Property: Chemical Element Embeddings and a Deep Learning Approach for Accurate Prediction of Chemical Properties

链接: https://arxiv.org/abs/2309.09355
作者: Shokirbek Shermukhamedov, Dilorom Mamurjonova, Michael Probst
备注: 11 pages, 10 figures, 3 tables, 43 references

点击查看摘要

Abstract: The application of machine learning (ML) techniques in computational chemistry has led to significant advances in predicting molecular properties, accelerating drug discovery, and material design. ML models can extract hidden patterns and relationships from complex and large datasets, allowing for the prediction of various chemical properties with high accuracy. The use of such methods has enabled the discovery of molecules and materials that were previously difficult to identify. This paper introduces a new ML model based on deep learning techniques, such as a multilayer encoder and decoder architecture, for classification tasks. We demonstrate the opportunities offered by our approach by applying it to various types of input data, including organic and inorganic compounds. In particular, we developed and tested the model using the Matbench and Moleculenet benchmarks, which include crystal properties and drug design-related benchmarks. We also conduct a comprehensive analysis of vector representations of chemical compounds, shedding light on the underlying patterns in molecular data. The models used in this work exhibit a high degree of predictive power, underscoring the progress that can be made with refined machine learning when applied to molecular and material datasets. For instance, on the Tox21 dataset, we achieved an average accuracy of 96%, surpassing the previous best result by 10%. Our code is publicly available at this https URL.

ML-112-标题: Simulation-based Inference for Exoplanet Atmospheric Retrieval: Insights from winning the Ariel Data Challenge 2023 using Normalizing Flows KDD2023 ECML

链接: https://arxiv.org/abs/2309.09337
作者: Mayeul Aubin (1,2), Carolina Cuesta-Lazaro (1), Ethan Tregidga (1,3), Javier Viaña (4), Cecilia Garraffo (1), Iouli E. Gordon (1), Mercedes López-Morales (1), Robert J. Hargreaves (1), Vladimir Yu. Makhnev (1), Jeremy J. Drake (1), Douglas P. Finkbeiner (1), Phillip Cargile (1) ( (1) Center for Astrophysics | Harvard & Smithsonian, (2) Ecole Polytechnique, (3) University of Southampton, (4) Kavli Institute for Astrophysics and Space Research | Massachusetts Institute of Technology)
备注: Conference proceeding for the ECML PKDD 2023

点击查看摘要

Abstract: Advancements in space telescopes have opened new avenues for gathering vast amounts of data on exoplanet atmosphere spectra. However, accurately extracting chemical and physical properties from these spectra poses significant challenges due to the non-linear nature of the underlying physics. This paper presents novel machine learning models developed by the AstroAI team for the Ariel Data Challenge 2023, where one of the models secured the top position among 293 competitors. Leveraging Normalizing Flows, our models predict the posterior probability distribution of atmospheric parameters under different atmospheric assumptions. Moreover, we introduce an alternative model that exhibits higher performance potential than the winning model, despite scoring lower in the challenge. These findings highlight the need to reevaluate the evaluation metric and prompt further exploration of more efficient and accurate approaches for exoplanet atmosphere spectra analysis. Finally, we present recommendations to enhance the challenge and models, providing valuable insights for future applications on real observational data. These advancements pave the way for more effective and timely analysis of exoplanet atmospheric properties, advancing our understanding of these distant worlds.

ML-113-标题: High-dimensional manifold of solutions in neural networks: insights from statistical physics

链接: https://arxiv.org/abs/2309.09240
作者: Enrico M. Malatesta
备注: 22 pages, 9 figures, based on a set of lectures done at the “School of the Italian Society of Statistical Physics”, IMT, Lucca

点击查看摘要

Abstract: In these pedagogic notes I review the statistical mechanics approach to neural networks, focusing on the paradigmatic example of the perceptron architecture with binary an continuous weights, in the classification setting. I will review the Gardner’s approach based on replica method and the derivation of the SAT/UNSAT transition in the storage setting. Then, I discuss some recent works that unveiled how the zero training error configurations are geometrically arranged, and how this arrangement changes as the size of the training set increases. I also illustrate how different regions of solution space can be explored analytically and how the landscape in the vicinity of a solution can be characterized. I give evidence how, in binary weight models, algorithmic hardness is a consequence of the disappearance of a clustered region of solutions that extends to very large distances. Finally, I demonstrate how the study of linear mode connectivity between solutions can give insights into the average shape of the solution manifold.

ML-114-标题: Provable learning of quantum states with graphical models

链接: https://arxiv.org/abs/2309.09235
作者: Liming Zhao, Naixu Guo, Ming-Xing Luo, Patrick Rebentrost
备注:

点击查看摘要

Abstract: The complete learning of an n -qubit quantum state requires samples exponentially in n . Several works consider subclasses of quantum states that can be learned in polynomial sample complexity such as stabilizer states or high-temperature Gibbs states. Other works consider a weaker sense of learning, such as PAC learning and shadow tomography. In this work, we consider learning states that are close to neural network quantum states, which can efficiently be represented by a graphical model called restricted Boltzmann machines (RBMs). To this end, we exhibit robustness results for efficient provable two-hop neighborhood learning algorithms for ferromagnetic and locally consistent RBMs. We consider the L_p -norm as a measure of closeness, including both total variation distance and max-norm distance in the limit. Our results allow certain quantum states to be learned with a sample complexity \textitexponentially better than naive tomography. We hence provide new classes of efficiently learnable quantum states and apply new strategies to learn them.

ML-115-标题: On the Connection Between Riemann Hypothesis and a Special Class of Neural Networks

链接: https://arxiv.org/abs/2309.09171
作者: Soufiane Hayou
备注:

点击查看摘要

Abstract: The Riemann hypothesis (RH) is a long-standing open problem in mathematics. It conjectures that non-trivial zeros of the zeta function all have real part equal to 1/2. The extent of the consequences of RH is far-reaching and touches a wide spectrum of topics including the distribution of prime numbers, the growth of arithmetic functions, the growth of Euler totient, etc. In this note, we revisit and extend an old analytic criterion of the RH known as the Nyman-Beurling criterion which connects the RH to a minimization problem that involves a special class of neural networks. This note is intended for an audience unfamiliar with RH. A gentle introduction to RH is provided.

ML-116-标题: Integration of geoelectric and geochemical data using Self-Organizing Maps (SOM) to characterize a landfill

链接: https://arxiv.org/abs/2309.09164
作者: Camila Juliao, Johan Diaz, Yosmely BermÚdez, Milagrosa Aldana
备注: 11 pages, 7 figures

点击查看摘要

Abstract: Leachates from garbage dumps can significantly compromise their surrounding area. Even if the distance between these and the populated areas could be considerable, the risk of affecting the aquifers for public use is imminent in most cases. For this reason, the delimitation and monitoring of the leachate plume are of significant importance. Geoelectric data (resistivity and IP), and surface methane measurements, are integrated and classified using an unsupervised Neural Network to identify possible risk zones in areas surrounding a landfill. The Neural Network used is a Kohonen type, which generates; as a result, Self-Organizing Classification Maps or SOM (Self-Organizing Map). Two graphic outputs were obtained from the training performed in which groups of neurons that presented a similar behaviour were selected. Contour maps corresponding to the location of these groups and the individual variables were generated to compare the classification obtained and the different anomalies associated with each of these variables. Two of the groups resulting from the classification are related to typical values of liquids percolated in the landfill for the parameters evaluated individually. In this way, a precise delimitation of the affected areas in the studied landfill was obtained, integrating the input variables via SOMs. The location of the study area is not detailed for confidentiality reasons.

ML-117-标题: Reducing sequential change detection to sequential estimation

链接: https://arxiv.org/abs/2309.09111
作者: Shubhanshu Shekhar, Aaditya Ramdas
备注: 11 pages

点击查看摘要

Abstract: We consider the problem of sequential change detection, where the goal is to design a scheme for detecting any changes in a parameter or functional \theta of the data stream distribution that has small detection delay, but guarantees control on the frequency of false alarms in the absence of changes. In this paper, we describe a simple reduction from sequential change detection to sequential estimation using confidence sequences: we begin a new (1-\alpha) -confidence sequence at each time step, and proclaim a change when the intersection of all active confidence sequences becomes empty. We prove that the average run length is at least 1/\alpha , resulting in a change detection scheme with minimal structural assumptions~(thus allowing for possibly dependent observations, and nonparametric distribution classes), but strong guarantees. Our approach bears an interesting parallel with the reduction from change detection to sequential testing of Lorden (1971) and the e-detector of Shin et al. (2022).

ML-118-标题: Fast Approximation of the Shapley Values Based on Order-of-Addition Experimental Designs

链接: https://arxiv.org/abs/2309.08923
作者: Liuqing Yang, Yongdao Zhou, Haoda Fu, Min-Qian Liu, Wei Zheng
备注:

点击查看摘要

Abstract: Shapley value is originally a concept in econometrics to fairly distribute both gains and costs to players in a coalition game. In the recent decades, its application has been extended to other areas such as marketing, engineering and machine learning. For example, it produces reasonable solutions for problems in sensitivity analysis, local model explanation towards the interpretable machine learning, node importance in social network, attribution models, etc. However, its heavy computational burden has been long recognized but rarely investigated. Specifically, in a d -player coalition game, calculating a Shapley value requires the evaluation of d! or 2^d marginal contribution values, depending on whether we are taking the permutation or combination formulation of the Shapley value. Hence it becomes infeasible to calculate the Shapley value when d is reasonably large. A common remedy is to take a random sample of the permutations to surrogate for the complete list of permutations. We find an advanced sampling scheme can be designed to yield much more accurate estimation of the Shapley value than the simple random sampling (SRS). Our sampling scheme is based on combinatorial structures in the field of design of experiments (DOE), particularly the order-of-addition experimental designs for the study of how the orderings of components would affect the output. We show that the obtained estimates are unbiased, and can sometimes deterministically recover the original Shapley value. Both theoretical and simulations results show that our DOE-based sampling scheme outperforms SRS in terms of estimation accuracy. Surprisingly, it is also slightly faster than SRS. Lastly, real data analysis is conducted for the C. elegans nervous system and the 9/11 terrorist network.

ML-119-标题: Intelligent machines work in unstructured environments by differential neural computing

链接: https://arxiv.org/abs/2309.08835
作者: Shengbo Wang, Shuo Gao, Chenyu Tang, Cong Li, Shurui Wang, Jiaqi Wang, Hubin Zhao, Guohua Hu, Arokia Nathan, Ravinder Dahiya, Luigi Occhipinti
备注: 17 pages, 5 figures

点击查看摘要

Abstract: Expecting intelligent machines to efficiently work in real world requires a new method to understand unstructured information in unknown environments with good accuracy, scalability and generalization, like human. Here, a memristive neural computing based perceptual signal differential processing and learning method for intelligent machines is presented, via extracting main features of environmental information and applying associated encoded stimuli to memristors, we successfully obtain human-like ability in processing unstructured environmental information, such as amplification (>720%) and adaptation (<50%) of mechanical stimuli. The method also exhibits good scalability and generalization, validated in two typical applications of intelligent machines: object grasping and autonomous driving. In the former, a robot hand experimentally realizes safe and stable grasping, through learning unknown object features (e.g., sharp corner and smooth surface) with a single memristor in 1 ms. In the latter, the decision-making information of 10 unstructured environments in autonomous driving (e.g., overtaking cars, pedestrians) are accurately (94%) extracted with a 40x25 memristor array. By mimicking the intrinsic nature of human low-level perception mechanisms in electronic memristive neural circuits, the proposed method is adaptable to diverse sensing technologies, helping intelligent machines to generate smart high-level decisions in real world.

ML-120-标题: Bioinspired LLM : Conversational Large Language Model for the Mechanics of Biological and Bio-inspired Materials

链接: https://arxiv.org/abs/2309.08788
作者: Rachel K. Luu, Markus J. Buehler
备注:

点击查看摘要

Abstract: The study of biological materials and bio-inspired materials science is well established; however, surprisingly little knowledge has been systematically translated to engineering solutions. To accelerate discovery and guide insights, an open-source autoregressive transformer large language model, BioinspiredLLM, is reported. The model was finetuned with a corpus of over a thousand peer-reviewed articles in the field of structural biological and bio-inspired materials and can be prompted to actively and interactively recall information, assist with research tasks, and function as an engine for creativity. The model has proven by example that it is not only able to accurately recall information about biological materials when queried but also formulate biomaterials questions and answers that can evaluate its own performance. BioinspiredLLM also has been shown to develop sound hypotheses regarding biological materials design and remarkably so for materials that have never been explicitly studied before. Lastly, the model showed impressive promise in collaborating with other generative artificial intelligence models in a workflow that can reshape the traditional materials design process. This collaborative generative artificial intelligence method can stimulate and enhance bio-inspired materials design workflows. Biological materials is at a critical intersection of multiple scientific fields and models like BioinspiredLLM help to connect knowledge domains.

ML-121-标题: Mining Patents with Large Language Models Demonstrates Congruence of Functional Labels and Chemical Structures

链接: https://arxiv.org/abs/2309.08765
作者: Clayton W. Kosonocky, Claus O. Wilke, Edward M. Marcotte, Andrew D. Ellington
备注: Under review

点击查看摘要

Abstract: Predicting chemical function from structure is a major goal of the chemical sciences, from the discovery and repurposing of novel drugs to the creation of new materials. Recently, new machine learning algorithms are opening up the possibility of general predictive models spanning many different chemical functions. Here, we consider the challenge of applying large language models to chemical patents in order to consolidate and leverage the information about chemical functionality captured by these resources. Chemical patents contain vast knowledge on chemical function, but their usefulness as a dataset has historically been neglected due to the impracticality of extracting high-quality functional labels. Using a scalable ChatGPT-assisted patent summarization and word-embedding label cleaning pipeline, we derive a Chemical Function (CheF) dataset, containing 100K molecules and their patent-derived functional labels. The functional labels were validated to be of high quality, allowing us to detect a strong relationship between functional label and chemical structural spaces. Further, we find that the co-occurrence graph of the functional labels contains a robust semantic structure, which allowed us in turn to examine functional relatedness among the compounds. We then trained a model on the CheF dataset, allowing us to assign new functional labels to compounds. Using this model, we were able to retrodict approved Hepatitis C antivirals, uncover an antiviral mechanism undisclosed in the patent, and identify plausible serotonin-related drugs. The CheF dataset and associated model offers a promising new approach to predict chemical functionality.

ML-122-标题: Price of Safety in Linear Best Arm Identification

链接: https://arxiv.org/abs/2309.08709
作者: Xuedong Shang, Igor Colin, Merwan Barlier, Hamza Cherkaoui
备注: 20 pages, 1 figures

点击查看摘要

Abstract: We introduce the safe best-arm identification framework with linear feedback, where the agent is subject to some stage-wise safety constraint that linearly depends on an unknown parameter vector. The agent must take actions in a conservative way so as to ensure that the safety constraint is not violated with high probability at each round. Ways of leveraging the linear structure for ensuring safety has been studied for regret minimization, but not for best-arm identification to the best our knowledge. We propose a gap-based algorithm that achieves meaningful sample complexity while ensuring the stage-wise safety. We show that we pay an extra term in the sample complexity due to the forced exploration phase incurred by the additional safety constraint. Experimental illustrations are provided to justify the design of our algorithm.

ML-123-标题: Quantifying Credit Portfolio sensitivity to asset correlations with interpretable generative neural networks

链接: https://arxiv.org/abs/2309.08652
作者: Sergio Caprioli, Emanuele Cagliero, Riccardo Crupi
备注:

点击查看摘要

Abstract: In this research, we propose a novel approach for the quantification of credit portfolio Value-at-Risk (VaR) sensitivity to asset correlations with the use of synthetic financial correlation matrices generated with deep learning models. In previous work Generative Adversarial Networks (GANs) were employed to demonstrate the generation of plausible correlation matrices, that capture the essential characteristics observed in empirical correlation matrices estimated on asset returns. Instead of GANs, we employ Variational Autoencoders (VAE) to achieve a more interpretable latent space representation. Through our analysis, we reveal that the VAE latent space can be a useful tool to capture the crucial factors impacting portfolio diversification, particularly in relation to credit portfolio sensitivity to asset correlations changes.

ML-124-标题: Doubly High-Dimensional Contextual Bandits: An Interpretable Model for Joint Assortment-Pricing

链接: https://arxiv.org/abs/2309.08634
作者: Junhui Cai, Ran Chen, Martin J. Wainwright, Linda Zhao
备注:

点击查看摘要

Abstract: Key challenges in running a retail business include how to select products to present to consumers (the assortment problem), and how to price products (the pricing problem) to maximize revenue or profit. Instead of considering these problems in isolation, we propose a joint approach to assortment-pricing based on contextual bandits. Our model is doubly high-dimensional, in that both context vectors and actions are allowed to take values in high-dimensional spaces. In order to circumvent the curse of dimensionality, we propose a simple yet flexible model that captures the interactions between covariates and actions via a (near) low-rank representation matrix. The resulting class of models is reasonably expressive while remaining interpretable through latent factors, and includes various structured linear bandit and pricing models as particular cases. We propose a computationally tractable procedure that combines an exploration/exploitation protocol with an efficient low-rank matrix estimator, and we prove bounds on its regret. Simulation results show that this method has lower regret than state-of-the-art methods applied to various standard bandit and pricing models. Real-world case studies on the assortment-pricing problem, from an industry-leading instant noodles company to an emerging beauty start-up, underscore the gains achievable using our method. In each case, we show at least three-fold gains in revenue or profit by our bandit method, as well as the interpretability of the latent factor models that are learned.

ML-125-标题: PCN: A Deep Learning Approach to Jet Tagging Utilizing Novel Graph Construction Methods and Chebyshev Graph Convolutions

链接: https://arxiv.org/abs/2309.08630
作者: Yash Semlani, Mihir Relan, Krithik Ramesh
备注: 13 pages, 2 figures, and 4 tables

点击查看摘要

Abstract: Jet tagging is a classification problem in high-energy physics experiments that aims to identify the collimated sprays of subatomic particles, jets, from particle collisions and tag them to their emitter particle. Advances in jet tagging present opportunities for searches of new physics beyond the Standard Model. Current approaches use deep learning to uncover hidden patterns in complex collision data. However, the representation of jets as inputs to a deep learning model have been varied, and often, informative features are withheld from models. In this study, we propose a graph-based representation of a jet that encodes the most information possible. To learn best from this representation, we design Particle Chebyshev Network (PCN), a graph neural network (GNN) using Chebyshev graph convolutions (ChebConv). ChebConv has been demonstrated as an effective alternative to classical graph convolutions in GNNs and has yet to be explored in jet tagging. PCN achieves a substantial improvement in accuracy over existing taggers and opens the door to future studies into graph-based representations of jets and ChebConv layers in high-energy physics experiments. Code is available at this https URL.

ML-126-标题: Balance Measures Derived from Insole Sensor Differentiate Prodromal Dementia with Lewy Bodies

链接: https://arxiv.org/abs/2309.08623
作者: Masatomo Kobayashi, Yasunori Yamada, Kaoru Shinkawa, Miyuki Nemoto, Miho Ota, Kiyotaka Nemoto, Tetsuaki Arai
备注:

点击查看摘要

Abstract: Dementia with Lewy bodies is the second most common type of neurodegenerative dementia, and identification at the prodromal stage - i.e., mild cognitive impairment due to Lewy bodies (MCI-LB) - is important for providing appropriate care. However, MCI-LB is often underrecognized because of its diversity in clinical manifestations and similarities with other conditions such as mild cognitive impairment due to Alzheimer’s disease (MCI-AD). In this study, we propose a machine learning-based automatic pipeline that helps identify MCI-LB by exploiting balance measures acquired with an insole sensor during a 30-s standing task. An experiment with 98 participants (14 MCI-LB, 38 MCI-AD, 46 cognitively normal) showed that the resultant models could discriminate MCI-LB from the other groups with up to 78.0% accuracy (AUC: 0.681), which was 6.8% better than the accuracy of a reference model based on demographic and clinical neuropsychological measures. Our findings may open up a new approach for timely identification of MCI-LB, enabling better care for patients.

ML-127-标题: Variance Reduction of Resampling for Sequential Monte Carlo

链接: https://arxiv.org/abs/2309.08620
作者: Xiongming Dai, Gerald Baumgartner
备注: 13 pages, 6 figures

点击查看摘要

Abstract: A resampling scheme provides a way to switch low-weight particles for sequential Monte Carlo with higher-weight particles representing the objective distribution. The less the variance of the weight distribution is, the more concentrated the effective particles are, and the quicker and more accurate it is to approximate the hidden Markov model, especially for the nonlinear case. We propose a repetitive deterministic domain with median ergodicity for resampling and have achieved the lowest variances compared to the other resampling methods. As the size of the deterministic domain M\ll N (the size of population), given a feasible size of particles, our algorithm is faster than the state of the art, which is verified by theoretical deduction and experiments of a hidden Markov model in both the linear and non-linear cases.

计算机视觉

CV-0-标题: General In-Hand Object Rotation with Vision and Touch

链接: https://arxiv.org/abs/2309.09979
作者: Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, Jitendra Malik
备注: CoRL 2023; Website: this https URL

点击查看摘要

Abstract: We introduce RotateIt, a system that enables fingertip-based object rotation along multiple axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it has access to ground-truth object shapes and physical properties. Then we distill it to operate on realistic yet noisy simulated visuotactile and proprioceptive sensory inputs. These multimodal inputs are fused via a visuotactile transformer, enabling online inference of object shapes and physical properties during deployment. We show significant performance improvements over prior methods and the importance of visual and tactile sensing.

CV-1-标题: GEDepth: Ground Embedding for Monocular Depth Estimation ICCV2023

链接: https://arxiv.org/abs/2309.09975
作者: Xiaodong Yang, Zhuang Ma, Zhiyu Ji, Zhe Ren
备注: ICCV 2023

点击查看摘要

Abstract: Monocular depth estimation is an ill-posed problem as the same 2D image can be projected from infinite 3D scenes. Although the leading algorithms in this field have reported significant improvement, they are essentially geared to the particular compound of pictorial observations and camera parameters (i.e., intrinsics and extrinsics), strongly limiting their generalizability in real-world scenarios. To cope with this challenge, this paper proposes a novel ground embedding module to decouple camera parameters from pictorial cues, thus promoting the generalization capability. Given camera parameters, the proposed module generates the ground depth, which is stacked with the input image and referenced in the final depth prediction. A ground attention is designed in the module to optimally combine ground depth with residual depth. Our ground embedding is highly flexible and lightweight, leading to a plug-in module that is amenable to be integrated into various depth estimation networks. Experiments reveal that our approach achieves the state-of-the-art results on popular benchmarks, and more importantly, renders significant generalization improvement on a wide range of cross-domain tests.

CV-2-标题: End-to-End Learned Event- and Image-based Visual Odometry

链接: https://arxiv.org/abs/2309.09947
作者: Roberto Pellerito, Marco Cannici, Daniel Gehrig, Joris Belhadj, Olivier Dubois-Matra, Massimo Casasco, Davide Scaramuzza
备注: 7 pages, 5 figures

点击查看摘要

Abstract: Visual Odometry (VO) is crucial for autonomous robotic navigation, especially in GPS-denied environments like planetary terrains. While standard RGB cameras struggle in low-light or high-speed motion, event-based cameras offer high dynamic range and low latency. However, seamlessly integrating asynchronous event data with synchronous frames remains challenging. We introduce RAMP-VO, the first end-to-end learned event- and image-based VO system. It leverages novel Recurrent, Asynchronous, and Massively Parallel (RAMP) encoders that are 8x faster and 20% more accurate than existing asynchronous encoders. RAMP-VO further employs a novel pose forecasting technique to predict future poses for initialization. Despite being trained only in simulation, RAMP-VO outperforms image- and event-based methods by 52% and 20%, respectively, on traditional, real-world benchmarks as well as newly introduced Apollo and Malapert landing sequences, paving the way for robust and asynchronous VO in space.

CV-3-标题: What is a Fair Diffusion Model? Designing Generative Text-To-Image Models to Incorporate Various Worldviews

链接: https://arxiv.org/abs/2309.09944
作者: Zoe De Simone, Angie Boggust, Arvind Satyanarayan, Ashia Wilson
备注: 20 pages, 5 figures

点击查看摘要

Abstract: Generative text-to-image (GTI) models produce high-quality images from short textual descriptions and are widely used in academic and creative domains. However, GTI models frequently amplify biases from their training data, often producing prejudiced or stereotypical images. Yet, current bias mitigation strategies are limited and primarily focus on enforcing gender parity across occupations. To enhance GTI bias mitigation, we introduce DiffusionWorldViewer, a tool to analyze and manipulate GTI models’ attitudes, values, stories, and expectations of the world that impact its generated images. Through an interactive interface deployed as a web-based GUI and Jupyter Notebook plugin, DiffusionWorldViewer categorizes existing demographics of GTI-generated images and provides interactive methods to align image demographics with user worldviews. In a study with 13 GTI users, we find that DiffusionWorldViewer allows users to represent their varied viewpoints about what GTI outputs are fair and, in doing so, challenges current notions of fairness that assume a universal worldview.

CV-4-标题: Hierarchical Attention and Graph Neural Networks: Toward Drift-Free Pose Estimation

链接: https://arxiv.org/abs/2309.09934
作者: Kathia Melbouci, Fawzi Nashashibi
备注:

点击查看摘要

Abstract: The most commonly used method for addressing 3D geometric registration is the iterative closet-point algorithm, this approach is incremental and prone to drift over multiple consecutive frames. The Common strategy to address the drift is the pose graph optimization subsequent to frame-to-frame registration, incorporating a loop closure process that identifies previously visited places. In this paper, we explore a framework that replaces traditional geometric registration and pose graph optimization with a learned model utilizing hierarchical attention mechanisms and graph neural networks. We propose a strategy to condense the data flow, preserving essential information required for the precise estimation of rigid poses. Our results, derived from tests on the KITTI Odometry dataset, demonstrate a significant improvement in pose estimation accuracy. This improvement is especially notable in determining rotational components when compared with results obtained through conventional multi-way registration via pose graph optimization. The code will be made available upon completion of the review process.

CV-5-标题: On Model Explanations with Transferable Neural Pathways

链接: https://arxiv.org/abs/2309.09887
作者: Xinmiao Lin, Wentao Bao, Qi Yu, Yu Kong
备注: Arxiv preprint

点击查看摘要

Abstract: Neural pathways as model explanations consist of a sparse set of neurons that provide the same level of prediction performance as the whole model. Existing methods primarily focus on accuracy and sparsity but the generated pathways may offer limited interpretability thus fall short in explaining the model behavior. In this paper, we suggest two interpretability criteria of neural pathways: (i) same-class neural pathways should primarily consist of class-relevant neurons; (ii) each instance’s neural pathway sparsity should be optimally determined. To this end, we propose a Generative Class-relevant Neural Pathway (GEN-CNP) model that learns to predict the neural pathways from the target model’s feature maps. We propose to learn class-relevant information from features of deep and shallow layers such that same-class neural pathways exhibit high similarity. We further impose a faithfulness criterion for GEN-CNP to generate pathways with instance-specific sparsity. We propose to transfer the class-relevant neural pathways to explain samples of the same class and show experimentally and qualitatively their faithfulness and interpretability.

CV-6-标题: RaLF: Flow-based Global and Metric Radar Localization in LiDAR Maps

链接: https://arxiv.org/abs/2309.09875
作者: Abhijeet Nayak, Daniele Cattaneo, Abhinav Valada
备注:

点击查看摘要

Abstract: Localization is paramount for autonomous robots. While camera and LiDAR-based approaches have been extensively investigated, they are affected by adverse illumination and weather conditions. Therefore, radar sensors have recently gained attention due to their intrinsic robustness to such conditions. In this paper, we propose RaLF, a novel deep neural network-based approach for localizing radar scans in a LiDAR map of the environment, by jointly learning to address both place recognition and metric localization. RaLF is composed of radar and LiDAR feature encoders, a place recognition head that generates global descriptors, and a metric localization head that predicts the 3-DoF transformation between the radar scan and the map. We tackle the place recognition task by learning a shared embedding space between the two modalities via cross-modal metric learning. Additionally, we perform metric localization by predicting pixel-level flow vectors that align the query radar scan with the LiDAR map. We extensively evaluate our approach on multiple real-world driving datasets and show that RaLF achieves state-of-the-art performance for both place recognition and metric localization. Moreover, we demonstrate that our approach can effectively generalize to different cities and sensor setups than the ones used during training. We make the code and trained models publicly available at this http URL.

CV-7-标题: Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based Agile Flight

链接: https://arxiv.org/abs/2309.09865
作者: Jiaxu Xing, Leonard Bauersfeld, Yunlong Song, Chunwei Xing, Davide Scaramuzza
备注:

点击查看摘要

Abstract: Scene transfer for vision-based mobile robotics applications is a highly relevant and challenging problem. The utility of a robot greatly depends on its ability to perform a task in the real world, outside of a well-controlled lab environment. Existing scene transfer end-to-end policy learning approaches often suffer from poor sample efficiency or limited generalization capabilities, making them unsuitable for mobile robotics applications. This work proposes an adaptive multi-pair contrastive learning strategy for visual representation learning that enables zero-shot scene transfer and real-world deployment. Control policies relying on the embedding are able to operate in unseen environments without the need for finetuning in the deployment environment. We demonstrate the performance of our approach on the task of agile, vision-based quadrotor flight. Extensive simulation and real-world experiments demonstrate that our approach successfully generalizes beyond the training domain and outperforms all baselines.

CV-8-标题: Unsupervised Open-Vocabulary Object Localization in Videos ICCV2023

链接: https://arxiv.org/abs/2309.09858
作者: Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang, Yanwei Fu, Tong He
备注: Accepted by ICCV 2023

点击查看摘要

Abstract: In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

CV-9-标题: PseudoCal: Towards Initialisation-Free Deep Learning-Based Camera-LiDAR Self-Calibration BMVC

链接: https://arxiv.org/abs/2309.09855
作者: Mathieu Cocheteux, Julien Moreau, Franck Davoine
备注: British Machine Vision Conference (BMVC) 2023

点击查看摘要

Abstract: Camera-LiDAR extrinsic calibration is a critical task for multi-sensor fusion in autonomous systems, such as self-driving vehicles and mobile robots. Traditional techniques often require manual intervention or specific environments, making them labour-intensive and error-prone. Existing deep learning-based self-calibration methods focus on small realignments and still rely on initial estimates, limiting their practicality. In this paper, we present PseudoCal, a novel self-calibration method that overcomes these limitations by leveraging the pseudo-LiDAR concept and working directly in the 3D space instead of limiting itself to the camera field of view. In typical autonomous vehicle and robotics contexts and conventions, PseudoCal is able to perform one-shot calibration quasi-independently of initial parameter estimates, addressing extreme cases that remain unsolved by existing approaches.

CV-10-标题: CC-SGG: Corner Case Scenario Generation using Learned Scene Graphs

链接: https://arxiv.org/abs/2309.09844
作者: George Drayson, Efimia Panagiotaki, Daniel Omeiza, Lars Kunze
备注: The first two authors contributed equally to this work

点击查看摘要

Abstract: Corner case scenarios are an essential tool for testing and validating the safety of autonomous vehicles (AVs). As these scenarios are often insufficiently present in naturalistic driving datasets, augmenting the data with synthetic corner cases greatly enhances the safe operation of AVs in unique situations. However, the generation of synthetic, yet realistic, corner cases poses a significant challenge. In this work, we introduce a novel approach based on Heterogeneous Graph Neural Networks (HGNNs) to transform regular driving scenarios into corner cases. To achieve this, we first generate concise representations of regular driving scenes as scene graphs, minimally manipulating their structure and properties. Our model then learns to perturb those graphs to generate corner cases using attention and triple embeddings. The input and perturbed graphs are then imported back into the simulation to generate corner case scenarios. Our model successfully learned to produce corner cases from input scene graphs, achieving 89.9% prediction accuracy on our testing dataset. We further validate the generated scenarios on baseline autonomous driving methods, demonstrating our model’s ability to effectively create critical situations for the baselines.

CV-11-标题: Grasp-Anything: Large-scale Grasp Dataset from Foundation Models

链接: https://arxiv.org/abs/2309.09818
作者: An Dinh Vuong, Minh Nhat Vu, Hieu Le, Baoru Huang, Binh Huynh, Thieu Vo, Andreas Kugi, Anh Nguyen
备注: Project page: this https URL

点击查看摘要

Abstract: Foundation models such as ChatGPT have made significant strides in robotic tasks due to their universal representation of real-world domains. In this paper, we leverage foundation models to tackle grasp detection, a persistent challenge in robotics with broad industrial applications. Despite numerous grasp datasets, their object diversity remains limited compared to real-world figures. Fortunately, foundation models possess an extensive repository of real-world knowledge, including objects we encounter in our daily lives. As a consequence, a promising solution to the limited representation in previous grasp datasets is to harness the universal knowledge embedded in these foundation models. We present Grasp-Anything, a new large-scale grasp dataset synthesized from foundation models to implement this solution. Grasp-Anything excels in diversity and magnitude, boasting 1M samples with text descriptions and more than 3M objects, surpassing prior datasets. Empirically, we show that Grasp-Anything successfully facilitates zero-shot grasp detection on vision-based tasks and real-world robotic experiments. Our dataset and code are available at this https URL.

CV-12-标题: R2Gen GPT : Radiology Report Generation with Frozen LLM s

链接: https://arxiv.org/abs/2309.09812
作者: Zhanyu Wang, Lingqiao Liu, Lei Wang, Luping Zhou
备注: Submitted to meta-radiology

点击查看摘要

Abstract: Large Language Models (LLMs) have consistently showcased remarkable generalization capabilities when applied to various language tasks. Nonetheless, harnessing the full potential of LLMs for Radiology Report Generation (R2Gen) still presents a challenge, stemming from the inherent disparity in modality between LLMs and the R2Gen task. To bridge this gap effectively, we propose R2GenGPT, which is a novel solution that aligns visual features with the word embedding space of LLMs using an efficient visual alignment module. This innovative approach empowers the previously static LLM to seamlessly integrate and process image information, marking a step forward in optimizing R2Gen performance. R2GenGPT offers the following benefits. First, it attains state-of-the-art (SOTA) performance by training only the lightweight visual alignment module while freezing all the parameters of LLM. Second, it exhibits high training efficiency, as it requires the training of an exceptionally minimal number of parameters while achieving rapid convergence. By employing delta tuning, our model only trains 5M parameters (which constitute just 0.07% of the total parameter count) to achieve performance close to the SOTA levels. Our code is available at this https URL.

CV-13-标题: VisualProg Distiller: Learning to Fine-tune Non-differentiable Visual Programming Frameworks

链接: https://arxiv.org/abs/2309.09809
作者: Wentao Wan, Zeqing Wang, Nan Kang, Keze Wang, Zhiyu Shen, Liang Lin
备注: 12 pages, 10 figures, 6 tables

点击查看摘要

Abstract: As an interpretable and universal neuro-symbolic paradigm based on Large Language Models, visual programming (VisualProg) can execute compositional visual tasks without training, but its performance is markedly inferior compared to task-specific supervised learning models. To increase its practicality, the performance of VisualProg on specific tasks needs to be improved. However, the non-differentiability of VisualProg limits the possibility of employing the fine-tuning strategy on specific tasks to achieve further improvements. In our analysis, we discovered that significant performance issues in VisualProg’s execution originated from errors made by the sub-modules at corresponding visual sub-task steps. To address this, we propose ``VisualProg Distiller", a method of supplementing and distilling process knowledge to optimize the performance of each VisualProg sub-module on decoupled visual sub-tasks, thus enhancing the overall task performance. Specifically, we choose an end-to-end model that is well-performed on the given task as the teacher and further distill the knowledge of the teacher into the invoked visual sub-modules step-by-step based on the execution flow of the VisualProg-generated programs. In this way, our method is capable of facilitating the fine-tuning of the non-differentiable VisualProg frameworks effectively. Extensive and comprehensive experimental evaluations demonstrate that our method can achieve a substantial performance improvement of VisualProg, and outperforms all the compared state-of-the-art methods by large margins. Furthermore, to provide valuable process supervision for the GQA task, we construct a large-scale dataset by utilizing the distillation process of our method.

CV-14-标题: DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

链接: https://arxiv.org/abs/2309.09777
作者: Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiwen Lu
备注: Project Page: this https URL

点击查看摘要

Abstract: World models, especially in autonomous driving, are trending and drawing extensive attention due to their capacity for comprehending driving environments. The established world model holds immense potential for the generation of high-quality driving videos, and driving policies for safe maneuvering. However, a critical limitation in relevant research lies in its predominant focus on gaming environments or simulated settings, thereby lacking the representation of real-world driving scenarios. Therefore, we introduce DriveDreamer, a pioneering world model entirely derived from real-world driving scenarios. Regarding that modeling the world in intricate driving scenes entails an overwhelming search space, we propose harnessing the powerful diffusion model to construct a comprehensive representation of the complex environment. Furthermore, we introduce a two-stage training pipeline. In the initial phase, DriveDreamer acquires a deep understanding of structured traffic constraints, while the subsequent stage equips it with the ability to anticipate future states. The proposed DriveDreamer is the first world model established from real-world driving scenarios. We instantiate DriveDreamer on the challenging nuScenes benchmark, and extensive experiments verify that DriveDreamer empowers precise, controllable video generation that faithfully captures the structural constraints of real-world traffic scenarios. Additionally, DriveDreamer enables the generation of realistic and reasonable driving policies, opening avenues for interaction and practical applications.

CV-15-标题: Towards Self-Adaptive Pseudo-Label Filtering for Semi-Supervised Learning NEURIPS2021

链接: https://arxiv.org/abs/2309.09774
作者: Lei Zhu, Zhanghan Ke, Rynson Lau
备注: This paper was first submitted to NeurIPS 2021

点击查看摘要

Abstract: Recent semi-supervised learning (SSL) methods typically include a filtering strategy to improve the quality of pseudo labels. However, these filtering strategies are usually hand-crafted and do not change as the model is updated, resulting in a lot of correct pseudo labels being discarded and incorrect pseudo labels being selected during the training process. In this work, we observe that the distribution gap between the confidence values of correct and incorrect pseudo labels emerges at the very beginning of the training, which can be utilized to filter pseudo labels. Based on this observation, we propose a Self-Adaptive Pseudo-Label Filter (SPF), which automatically filters noise in pseudo labels in accordance with model evolvement by modeling the confidence distribution throughout the training process. Specifically, with an online mixture model, we weight each pseudo-labeled sample by the posterior of it being correct, which takes into consideration the confidence distribution at that time. Unlike previous handcrafted filters, our SPF evolves together with the deep neural network without manual tuning. Extensive experiments demonstrate that incorporating SPF into the existing SSL methods can help improve the performance of SSL, especially when the labeled data is extremely scarce.

CV-16-标题: Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays

链接: https://arxiv.org/abs/2309.09773
作者: Sivaramakrishnan Rajaraman, Ghada Zamzmi, Feng Yang, Zhaohui Liang, Zhiyun Xue, Sameer Antani
备注: 3 Tables, 11 Figures, 20 pages

点击查看摘要

Abstract: Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. Another data attribute is the inherent variety. It follows, therefore, that semantic redundancy, which is the presence of similar or repetitive information, would tend to lower performance and limit generalizability to unseen data. In medical imaging data, semantic redundancy can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Further, the common use of augmentation methods to generate variety in DL training may be limiting performance when applied to semantically redundant data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data. We demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set, during both internal (recall: 0.7164 vs 0.6597, p<0.05) and external testing (recall: 0.3185 vs 0.2589, p<0.05). Our findings emphasize the importance of information-oriented training sample selection as opposed to the conventional practice of using all available training data.

CV-17-标题: Localization-Guided Track: A Deep Association Multi-Object Tracking Framework Based on Localization Confidence of Detections

链接: https://arxiv.org/abs/2309.09765
作者: Ting Meng, Chunyun Fu, Mingguang Huang, Xiyang Wang, Jiawei He, Tao Huang, Wankai Shi
备注: 11 pages, 4 figures

点击查看摘要

Abstract: In currently available literature, no tracking-by-detection (TBD) paradigm-based tracking method has considered the localization confidence of detection boxes. In most TBD-based methods, it is considered that objects of low detection confidence are highly occluded and thus it is a normal practice to directly disregard such objects or to reduce their priority in matching. In addition, appearance similarity is not a factor to consider for matching these objects. However, in terms of the detection confidence fusing classification and localization, objects of low detection confidence may have inaccurate localization but clear appearance; similarly, objects of high detection confidence may have inaccurate localization or unclear appearance; yet these objects are not further classified. In view of these issues, we propose Localization-Guided Track (LG-Track). Firstly, localization confidence is applied in MOT for the first time, with appearance clarity and localization accuracy of detection boxes taken into account, and an effective deep association mechanism is designed; secondly, based on the classification confidence and localization confidence, a more appropriate cost matrix can be selected and used; finally, extensive experiments have been conducted on MOT17 and MOT20 datasets. The results show that our proposed method outperforms the compared state-of-art tracking methods. For the benefit of the community, our code has been made publicly at this https URL.

CV-18-标题: Application-driven Validation of Posteriors in Inverse Problems

链接: https://arxiv.org/abs/2309.09764
作者: Tim J. Adler, Jan-Hinrich Nölke, Annika Reinke, Minu Dietlinde Tizabi, Sebastian Gruber, Dasha Trofimova, Lynton Ardizzone, Paul F. Jaeger, Florian Buettner, Ullrich Köthe, Lena Maier-Hein
备注: Shared first authors: Tim J. Adler and Jan-Hinrich N"olke. 16 pages, 8 figures, 1 table

点击查看摘要

Abstract: Current deep learning-based solutions for image analysis tasks are commonly incapable of handling problems to which multiple different plausible solutions exist. In response, posterior-based methods such as conditional Diffusion Models and Invertible Neural Networks have emerged; however, their translation is hampered by a lack of research on adequate validation. In other words, the way progress is measured often does not reflect the needs of the driving practical application. Closing this gap in the literature, we present the first systematic framework for the application-driven validation of posterior-based methods in inverse problems. As a methodological novelty, it adopts key principles from the field of object detection validation, which has a long history of addressing the question of how to locate and match multiple object instances in an image. Treating modes as instances enables us to perform mode-centric validation, using well-interpretable metrics from the application perspective. We demonstrate the value of our framework through instantiations for a synthetic toy example and two medical vision use cases: pose estimation in surgery and imaging-based quantification of functional tissue parameters for diagnostics. Our framework offers key advantages over common approaches to posterior validation in all three examples and could thus revolutionize performance assessment in inverse problems.

CV-19-标题: Privileged to Predicted: Towards Sensorimotor Reinforcement Learning for Urban Driving

链接: https://arxiv.org/abs/2309.09756
作者: Ege Onat Özsüer, Barış Akgün, Fatma Güney
备注: 7 pages

点击查看摘要

Abstract: Reinforcement Learning (RL) has the potential to surpass human performance in driving without needing any expert supervision. Despite its promise, the state-of-the-art in sensorimotor self-driving is dominated by imitation learning methods due to the inherent shortcomings of RL algorithms. Nonetheless, RL agents are able to discover highly successful policies when provided with privileged ground truth representations of the environment. In this work, we investigate what separates privileged RL agents from sensorimotor agents for urban driving in order to bridge the gap between the two. We propose vision-based deep learning models to approximate the privileged representations from sensor data. In particular, we identify aspects of state representation that are crucial for the success of the RL agent such as desired route generation and stop zone prediction, and propose solutions to gradually develop less privileged RL agents. We also observe that bird’s-eye-view models trained on offline datasets do not generalize to online RL training due to distribution mismatch. Through rigorous evaluation on the CARLA simulation environment, we shed light on the significance of the state representations in RL for autonomous driving and point to unresolved challenges for future research.

CV-20-标题: Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

链接: https://arxiv.org/abs/2309.09742
作者: David Tschirschwitz, Christian Benz, Morris Florek, Henrik Norderhus, Benno Stein, Volker Rodehorst
备注:

点击查看摘要

Abstract: The reliability of supervised machine learning systems depends on the accuracy and availability of ground truth labels. However, the process of human annotation, being prone to error, introduces the potential for noisy labels, which can impede the practicality of these systems. While training with noisy labels is a significant consideration, the reliability of test data is also crucial to ascertain the dependability of the results. A common approach to addressing this issue is repeated labeling, where multiple annotators label the same example, and their labels are combined to provide a better estimate of the true label. In this paper, we propose a novel localization algorithm that adapts well-established ground truth estimation methods for object detection and instance segmentation tasks. The key innovation of our method lies in its ability to transform combined localization and classification tasks into classification-only problems, thus enabling the application of techniques such as Expectation-Maximization (EM) or Majority Voting (MJV). Although our main focus is the aggregation of unique ground truth for test data, our algorithm also shows superior performance during training on the TexBiG dataset, surpassing both noisy label training and label aggregation using Weighted Boxes Fusion (WBF). Our experiments indicate that the benefits of repeated labels emerge under specific dataset and annotation configurations. The key factors appear to be (1) dataset complexity, the (2) annotator consistency, and (3) the given annotation budget constraints.

CV-21-标题: Improving Neural Indoor Surface Reconstruction with Mask-Guided Adaptive Consistency Constraints

链接: https://arxiv.org/abs/2309.09739
作者: Xinyi Yu, Liqin Lu, Jintao Rong, Guangkai Xu, Linlin Ou
备注:

点击查看摘要

Abstract: 3D scene reconstruction from 2D images has been a long-standing task. Instead of estimating per-frame depth maps and fusing them in 3D, recent research leverages the neural implicit surface as a unified representation for 3D reconstruction. Equipped with data-driven pre-trained geometric cues, these methods have demonstrated promising performance. However, inaccurate prior estimation, which is usually inevitable, can lead to suboptimal reconstruction quality, particularly in some geometrically complex regions. In this paper, we propose a two-stage training process, decouple view-dependent and view-independent colors, and leverage two novel consistency constraints to enhance detail reconstruction performance without requiring extra priors. Additionally, we introduce an essential mask scheme to adaptively influence the selection of supervision constraints, thereby improving performance in a self-supervised paradigm. Experiments on synthetic and real-world datasets show the capability of reducing the interference from prior estimation errors and achieving high-quality scene reconstruction with rich geometric details.

CV-22-标题: Moving Object Detection and Tracking with 4D Radar Point Cloud

链接: https://arxiv.org/abs/2309.09737
作者: Zhijun Pan, Fangqiang Ding, Hantao Zhong, Chris Xiaoxuan Lu
备注: 8 pages, 4 figures. Co-first authorship for Zhijun Pan, Fangqiang Ding and Hantao Zhong

点击查看摘要

Abstract: Mobile autonomy relies on the precise perception of dynamic environments. Robustly tracking moving objects in 3D world thus plays a pivotal role for applications like trajectory prediction, obstacle avoidance, and path planning. While most current methods utilize LiDARs or cameras for Multiple Object Tracking (MOT), the capabilities of 4D imaging radars remain largely unexplored. Recognizing the challenges posed by radar noise and point sparsity in 4D radar data, we introduce RaTrack, an innovative solution tailored for radar-based tracking. Bypassing the typical reliance on specific object types and 3D bounding boxes, our method focuses on motion segmentation and clustering, enriched by a motion estimation module. Evaluated on the View-of-Delft dataset, RaTrack showcases superior tracking precision of moving objects, largely surpassing the performance of the state of the art.

CV-23-标题: Scribble-based 3D Multiple Abdominal Organ Segmentation via Triple-branch Multi-dilated Network with Pixel- and Class-wise Consistency MICCAI2023

链接: https://arxiv.org/abs/2309.09730
作者: Meng Han, Xiangde Luo, Wenjun Liao, Shichuan Zhang, Shaoting Zhang, Guotai Wang
备注: 10 pages, 3 figures, MICCAI2023

点击查看摘要

Abstract: Multi-organ segmentation in abdominal Computed Tomography (CT) images is of great importance for diagnosis of abdominal lesions and subsequent treatment planning. Though deep learning based methods have attained high performance, they rely heavily on large-scale pixel-level annotations that are time-consuming and labor-intensive to obtain. Due to its low dependency on annotation, weakly supervised segmentation has attracted great attention. However, there is still a large performance gap between current weakly-supervised methods and fully supervised learning, leaving room for exploration. In this work, we propose a novel 3D framework with two consistency constraints for scribble-supervised multiple abdominal organ segmentation from CT. Specifically, we employ a Triple-branch multi-Dilated network (TDNet) with one encoder and three decoders using different dilation rates to capture features from different receptive fields that are complementary to each other to generate high-quality soft pseudo labels. For more stable unsupervised learning, we use voxel-wise uncertainty to rectify the soft pseudo labels and then supervise the outputs of each decoder. To further regularize the network, class relationship information is exploited by encouraging the generated class affinity matrices to be consistent across different decoders under multi-view projection. Experiments on the public WORD dataset show that our method outperforms five existing scribble-supervised methods.

CV-24-标题: Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering ICCV2023

链接: https://arxiv.org/abs/2309.09724
作者: Chi Zhang, Wei Yin, Gang Yu, Zhibin Wang, Tao Chen, Bin Fu, Joey Tianyi Zhou, Chunhua Shen
备注: Accepted by ICCV2023

点击查看摘要

Abstract: In this study, we address the challenge of 3D scene structure recovery from monocular depth estimation. While traditional depth estimation methods leverage labeled datasets to directly predict absolute depth, recent advancements advocate for mix-dataset training, enhancing generalization across diverse scenes. However, such mixed dataset training yields depth predictions only up to an unknown scale and shift, hindering accurate 3D reconstructions. Existing solutions necessitate extra 3D datasets or geometry-complete depth annotations, constraints that limit their versatility. In this paper, we propose a learning framework that trains models to predict geometry-preserving depth without requiring extra data or annotations. To produce realistic 3D structures, we render novel views of the reconstructed scenes and design loss functions to promote depth estimation consistency across different views. Comprehensive experiments underscore our framework’s superior generalization capabilities, surpassing existing state-of-the-art methods on several benchmark datasets without leveraging extra training information. Moreover, our innovative loss functions empower the model to autonomously recover domain-specific scale-and-shift coefficients using solely unlabeled images.

CV-25-标题: Traffic Scene Similarity: a Graph-based Contrastive Learning Approach

链接: https://arxiv.org/abs/2309.09720
作者: Maximilian Zipfl, Moritz Jarosch, J. Marius Zöllner
备注:

点击查看摘要

Abstract: Ensuring validation for highly automated driving poses significant obstacles to the widespread adoption of highly automated vehicles. Scenario-based testing offers a potential solution by reducing the homologation effort required for these systems. However, a crucial prerequisite, yet unresolved, is the definition and reduction of the test space to a finite number of scenarios. To tackle this challenge, we propose an extension to a contrastive learning approach utilizing graphs to construct a meaningful embedding space. Our approach demonstrates the continuous mapping of scenes using scene-specific features and the formation of thematically similar clusters based on the resulting embeddings. Based on the found clusters, similar scenes could be identified in the subsequent test process, which can lead to a reduction in redundant test runs.

CV-26-标题: CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

链接: https://arxiv.org/abs/2309.09709
作者: Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, Jun Xun
备注:

点击查看摘要

Abstract: Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adhere to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a block, which, when stacked, enables capturing audio-visual fine-grained combinatorial-dependence in a memory-efficient manner. Additionally, we introduce audio-constrained queries during the decoding phase. These queries contain rich object-level information, ensuring the decoded mask adheres to the sounds. Experimental results confirm our approach’s effectiveness, with our framework achieving a new SOTA performance on all three datasets using two backbones. The code is available at \urlthis https URL

CV-27-标题: Ugly Ducklings or Swans: A Tiered Quadruplet Network with Patient-Specific Mining for Improved Skin Lesion Classification

链接: https://arxiv.org/abs/2309.09689
作者: Nathasha Naranpanawa, H. Peter Soyer, Adam Mothershaw, Gayan K. Kulatilleke, Zongyuan Ge, Brigid Betz-Stablein, Shekhar S. Chandra
备注: 12 pages, 6 figures

点击查看摘要

Abstract: An ugly duckling is an obviously different skin lesion from surrounding lesions of an individual, and the ugly duckling sign is a criterion used to aid in the diagnosis of cutaneous melanoma by differentiating between highly suspicious and benign lesions. However, the appearance of pigmented lesions, can change drastically from one patient to another, resulting in difficulties in visual separation of ugly ducklings. Hence, we propose DMT-Quadruplet - a deep metric learning network to learn lesion features at two tiers - patient-level and lesion-level. We introduce a patient-specific quadruplet mining approach together with a tiered quadruplet network, to drive the network to learn more contextual information both globally and locally between the two tiers. We further incorporate a dynamic margin within the patient-specific mining to allow more useful quadruplets to be mined within individuals. Comprehensive experiments show that our proposed method outperforms traditional classifiers, achieving 54% higher sensitivity than a baseline ResNet18 CNN and 37% higher than a naive triplet network in classifying ugly duckling lesions. Visualisation of the data manifold in the metric space further illustrates that DMT-Quadruplet is capable of classifying ugly duckling lesions in both patient-specific and patient-agnostic manner successfully.

CV-28-标题: Conditioning Latent-Space Clusters for Real-World Anomaly Classification

链接: https://arxiv.org/abs/2309.09676
作者: Daniel Bogdoll, Svetlana Pavlitska, Simon Klaus, J. Marius Zöllner
备注: Daniel Bogdoll, Svetlana Pavlitska, and Simon Klaus contributed equally. Accepted for publication at SSCI 2023

点击查看摘要

Abstract: Anomalies in the domain of autonomous driving are a major hindrance to the large-scale deployment of autonomous vehicles. In this work, we focus on high-resolution camera data from urban scenes that include anomalies of various types and sizes. Based on a Variational Autoencoder, we condition its latent space to classify samples as either normal data or anomalies. In order to emphasize especially small anomalies, we perform experiments where we provide the VAE with a discrepancy map as an additional input, evaluating its impact on the detection performance. Our method separates normal data and anomalies into isolated clusters while still reconstructing high-quality images, leading to meaningful latent representations.

CV-29-标题: DGM-DR: Domain Generalization with Mutual Information Regularized Diabetic Retinopathy Classification

链接: https://arxiv.org/abs/2309.09670
作者: Aleksandr Matsun, Dana O. Mohamed, Sharon Chokuwa, Muhammad Ridzuan, Mohammad Yaqub
备注: DART: 5th Workshop on Domain Adaptation and Representation Transfer 2023 camera ready

点击查看摘要

Abstract: The domain shift between training and testing data presents a significant challenge for training generalizable deep learning models. As a consequence, the performance of models trained with the independent and identically distributed (i.i.d) assumption deteriorates when deployed in the real world. This problem is exacerbated in the medical imaging context due to variations in data acquisition across clinical centers, medical apparatus, and patients. Domain generalization (DG) aims to address this problem by learning a model that generalizes well to any unseen target domain. Many domain generalization techniques were unsuccessful in learning domain-invariant representations due to the large domain shift. Furthermore, multiple tasks in medical imaging are not yet extensively studied in existing literature when it comes to DG point of view. In this paper, we introduce a DG method that re-establishes the model objective function as a maximization of mutual information with a large pretrained model to the medical imaging field. We re-visit the problem of DG in Diabetic Retinopathy (DR) classification to establish a clear benchmark with a correct model selection strategy and to achieve robust domain-invariant representation for an improved generalization. Moreover, we conduct extensive experiments on public datasets to show that our proposed method consistently outperforms the previous state-of-the-art by a margin of 5.25% in average accuracy and a lower standard deviation. Source code available at this https URL

CV-30-标题: DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

链接: https://arxiv.org/abs/2309.09668
作者: Bowen Yin, Xuying Zhang, Zhongyu Li, Li Liu, Ming-Ming Cheng, Qibin Hou
备注:

点击查看摘要

Abstract: We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that aim to encode RGB features,DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design; 2) We pre-train the backbone using image-depth pairs from ImageNet-1K, and thus the DFormer is endowed with the capacity to encode RGB-D representations. It avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pre-trained backbones, which widely lies in existing methods but has not been resolved. We fine-tune the pre-trained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D segmentation datasets and five RGB-D saliency datasets. Our code is available at: this https URL.

CV-31-标题: Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation

链接: https://arxiv.org/abs/2309.09667
作者: Huan Liu, Zichang Tan, Qiang Chen, Yunchao Wei, Yao Zhao, Jingdong Wang
备注:

点击查看摘要

Abstract: Detecting and grounding multi-modal media manipulation (DGM^4) has become increasingly crucial due to the widespread dissemination of face forgery and text misinformation. In this paper, we present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM^4 problem. Unlike previous state-of-the-art methods that solely focus on the image (RGB) domain to describe visual forgery features, we additionally introduce the frequency domain as a complementary viewpoint. By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts. Then, our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands. Moreover, to address the semantic conflicts between image and frequency domains, the forgery-aware mutual module is developed to further enable the effective interaction of disparate image and frequency features, resulting in aligned and comprehensive visual forgery representations. Finally, based on visual and textual forgery features, we propose a unified decoder that comprises two symmetric cross-modal interaction modules responsible for gathering modality-specific forgery information, along with a fusing interaction module for aggregation of both modalities. The proposed unified decoder formulates our UFAFormer as a unified framework, ultimately simplifying the overall architecture and facilitating the optimization process. Experimental results on the DGM^4 dataset, containing several perturbations, demonstrate the superior performance of our framework compared to previous methods, setting a new benchmark in the field.

CV-32-标题: HiT: Building Mapping with Hierarchical Transformer s

链接: https://arxiv.org/abs/2309.09643
作者: Mingming Zhang, Qingjie Liu, Yunhong Wang
备注:

点击查看摘要

Abstract: Deep learning-based methods have been extensively explored for automatic building mapping from high-resolution remote sensing images over recent years. While most building mapping models produce vector polygons of buildings for geographic and mapping systems, dominant methods typically decompose polygonal building extraction in some sub-problems, including segmentation, polygonization, and regularization, leading to complex inference procedures, low accuracy, and poor generalization. In this paper, we propose a simple and novel building mapping method with Hierarchical Transformers, called HiT, improving polygonal building mapping quality from high-resolution remote sensing images. HiT builds on a two-stage detection architecture by adding a polygon head parallel to classification and bounding box regression heads. HiT simultaneously outputs building bounding boxes and vector polygons, which is fully end-to-end trainable. The polygon head formulates a building polygon as serialized vertices with the bidirectional characteristic, a simple and elegant polygon representation avoiding the start or end vertex hypothesis. Under this new perspective, the polygon head adopts a transformer encoder-decoder architecture to predict serialized vertices supervised by the designed bidirectional polygon loss. Furthermore, a hierarchical attention mechanism combined with convolution operation is introduced in the encoder of the polygon head, providing more geometric structures of building polygons at vertex and edge levels. Comprehensive experiments on two benchmarks (the CrowdAI and Inria datasets) demonstrate that our method achieves a new state-of-the-art in terms of instance segmentation and polygonal metrics compared with state-of-the-art methods. Moreover, qualitative results verify the superiority and effectiveness of our model under complex scenes.

CV-33-标题: Designing a Hybrid Neural System to Learn Real-world Crack Segmentation from Fractal-based Simulation

链接: https://arxiv.org/abs/2309.09637
作者: Achref Jaziri, Martin Mundt, Andres Fernandez Rodriguez, Visvanathan Ramesh
备注:

点击查看摘要

Abstract: Identification of cracks is essential to assess the structural integrity of concrete infrastructure. However, robust crack segmentation remains a challenging task for computer vision systems due to the diverse appearance of concrete surfaces, variable lighting and weather conditions, and the overlapping of different defects. In particular recent data-driven methods struggle with the limited availability of data, the fine-grained and time-consuming nature of crack annotation, and face subsequent difficulty in generalizing to out-of-distribution samples. In this work, we move past these challenges in a two-fold way. We introduce a high-fidelity crack graphics simulator based on fractals and a corresponding fully-annotated crack dataset. We then complement the latter with a system that learns generalizable representations from simulation, by leveraging both a pointwise mutual information estimate along with adaptive instance normalization as inductive biases. Finally, we empirically highlight how different design choices are symbiotic in bridging the simulation to real gap, and ultimately demonstrate that our introduced system can effectively handle real-world crack segmentation.

CV-34-标题: Holistic Geometric Feature Learning for Structured Reconstruction

链接: https://arxiv.org/abs/2309.09622
作者: Ziqiong Lu, Linxi Huan, Qiyuan Ma, Xianwei Zheng
备注:

点击查看摘要

Abstract: The inference of topological principles is a key problem in structured reconstruction. We observe that wrongly predicted topological relationships are often incurred by the lack of holistic geometry clues in low-level features. Inspired by the fact that massive signals can be compactly described with frequency analysis, we experimentally explore the efficiency and tendency of learning structure geometry in the frequency domain. Accordingly, we propose a frequency-domain feature learning strategy (F-Learn) to fuse scattered geometric fragments holistically for topology-intact structure reasoning. Benefiting from the parsimonious design, the F-Learn strategy can be easily deployed into a deep reconstructor with a lightweight model modification. Experiments demonstrate that the F-Learn strategy can effectively introduce structure awareness into geometric primitive detection and topology inference, bringing significant performance improvement to final structured reconstruction. Code and pre-trained models are available at this https URL.

CV-35-标题: Gradpaint: Gradient-Guided Inpainting with Diffusion Models

链接: https://arxiv.org/abs/2309.09614
作者: Asya Grechka, Guillaume Couairon, Matthieu Cord
备注:

点击查看摘要

Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have recently achieved remarkable results in conditional and unconditional image generation. The pre-trained models can be adapted without further training to different downstream tasks, by guiding their iterative denoising process at inference time to satisfy additional constraints. For the specific task of image inpainting, the current guiding mechanism relies on copying-and-pasting the known regions from the input image at each denoising step. However, diffusion models are strongly conditioned by the initial random noise, and therefore struggle to harmonize predictions inside the inpainting mask with the real parts of the input image, often producing results with unnatural artifacts. Our method, dubbed GradPaint, steers the generation towards a globally coherent image. At each step in the denoising process, we leverage the model’s “denoised image estimation” by calculating a custom loss measuring its coherence with the masked input image. Our guiding mechanism uses the gradient obtained from backpropagating this loss through the diffusion model itself. GradPaint generalizes well to diffusion models trained on various datasets, improving upon current state-of-the-art supervised and unsupervised methods.

CV-36-标题: Collaborative Three-Stream Transformer s for Video Captioning

链接: https://arxiv.org/abs/2309.09611
作者: Hao Wang, Libo Zhang, Heng Fan, Tiejian Luo
备注: Accepted by CVIU

点击查看摘要

Abstract: As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods.

CV-37-标题: MEDL-U: Uncertainty-aware 3D Automatic Annotator based on Evidential Deep Learning

链接: https://arxiv.org/abs/2309.09599
作者: Helbert Paat, Qing Lian, Weilong Yao, Tong Zhang
备注: 6 pages + 1 page reference

点击查看摘要

Abstract: Advancements in deep learning-based 3D object detection necessitate the availability of large-scale datasets. However, this requirement introduces the challenge of manual annotation, which is often both burdensome and time-consuming. To tackle this issue, the literature has seen the emergence of several weakly supervised frameworks for 3D object detection which can automatically generate pseudo labels for unlabeled data. Nevertheless, these generated pseudo labels contain noise and are not as accurate as those labeled by humans. In this paper, we present the first approach that addresses the inherent ambiguities present in pseudo labels by introducing an Evidential Deep Learning (EDL) based uncertainty estimation framework. Specifically, we propose MEDL-U, an EDL framework based on MTrans, which not only generates pseudo labels but also quantifies the associated uncertainties. However, applying EDL to 3D object detection presents three primary challenges: (1) relatively lower pseudolabel quality in comparison to other autolabelers; (2) excessively high evidential uncertainty estimates; and (3) lack of clear interpretability and effective utilization of uncertainties for downstream tasks. We tackle these issues through the introduction of an uncertainty-aware IoU-based loss, an evidence-aware multi-task loss function, and the implementation of a post-processing stage for uncertainty refinement. Our experimental results demonstrate that probabilistic detectors trained using the outputs of MEDL-U surpass deterministic detectors trained using outputs from previous 3D annotators on the KITTI val set for all difficulty levels. Moreover, MEDL-U achieves state-of-the-art results on the KITTI official test set compared to existing 3D automatic annotators.

CV-38-标题: Mutual Information-calibrated Conformal Feature Fusion for Uncertainty-Aware Multimodal 3D Object Detection at the Edge

链接: https://arxiv.org/abs/2309.09593
作者: Alex C. Stutts, Danilo Erricolo, Sathya Ravi, Theja Tulabandhula, Amit Ranjan Trivedi
备注:

点击查看摘要

Abstract: In the expanding landscape of AI-enabled robotics, robust quantification of predictive uncertainties is of great importance. Three-dimensional (3D) object detection, a critical robotics operation, has seen significant advancements; however, the majority of current works focus only on accuracy and ignore uncertainty quantification. Addressing this gap, our novel study integrates the principles of conformal inference (CI) with information theoretic measures to perform lightweight, Monte Carlo-free uncertainty estimation within a multimodal framework. Through a multivariate Gaussian product of the latent variables in a Variational Autoencoder (VAE), features from RGB camera and LiDAR sensor data are fused to improve the prediction accuracy. Normalized mutual information (NMI) is leveraged as a modulator for calibrating uncertainty bounds derived from CI based on a weighted loss function. Our simulation results show an inverse correlation between inherent predictive uncertainty and NMI throughout the model’s training. The framework demonstrates comparable or better performance in KITTI 3D object detection benchmarks to similar methods that are not uncertainty-aware, making it suitable for real-time edge robotics.

CV-39-标题: Multi-Semantic Fusion Model for Generalized Zero-Shot Skeleton-Based Action Recognition

链接: https://arxiv.org/abs/2309.09592
作者: Ming-Zhe Li, Zhen Jia, Zhang Zhang, Zhanyu Ma, Liang Wang
备注: Accepted by ICIG 2023 (The code has been released at this https URL)

点击查看摘要

Abstract: Generalized zero-shot skeleton-based action recognition (GZSSAR) is a new challenging problem in computer vision community, which requires models to recognize actions without any training samples. Previous studies only utilize the action labels of verb phrases as the semantic prototypes for learning the mapping from skeleton-based actions to a shared semantic space. However, the limited semantic information of action labels restricts the generalization ability of skeleton features for recognizing unseen actions. In order to solve this dilemma, we propose a multi-semantic fusion (MSF) model for improving the performance of GZSSAR, where two kinds of class-level textual descriptions (i.e., action descriptions and motion descriptions), are collected as auxiliary semantic information to enhance the learning efficacy of generalizable skeleton features. Specially, a pre-trained language encoder takes the action descriptions, motion descriptions and original class labels as inputs to obtain rich semantic features for each action class, while a skeleton encoder is implemented to extract skeleton features. Then, a variational autoencoder (VAE) based generative module is performed to learn a cross-modal alignment between skeleton and semantic features. Finally, a classification module is built to recognize the action categories of input samples, where a seen-unseen classification gate is adopted to predict whether the sample comes from seen action classes or not in GZSSAR. The superior performance in comparisons with previous models validates the effectiveness of the proposed MSF model on GZSSAR.

CV-40-标题: An Autonomous Vision-Based Algorithm for Interplanetary Navigation

链接: https://arxiv.org/abs/2309.09590
作者: Eleonora Andreis, Paolo Panicucci, Francesco Topputo
备注:

点击查看摘要

Abstract: The surge of deep-space probes makes it unsustainable to navigate them with standard radiometric tracking. Self-driving interplanetary satellites represent a solution to this problem. In this work, a full vision-based navigation algorithm is built by combining an orbit determination method with an image processing pipeline suitable for interplanetary transfers of autonomous platforms. To increase the computational efficiency of the algorithm, a non-dimensional extended Kalman filter is selected as state estimator, fed by the positions of the planets extracted from deep-space images. An enhancement of the estimation accuracy is performed by applying an optimal strategy to select the best pair of planets to track. Moreover, a novel analytical measurement model for deep-space navigation is developed providing a first-order approximation of the light-aberration and light-time effects. Algorithm performance is tested on a high-fidelity, Earth–Mars interplanetary transfer, showing the algorithm applicability for deep-space navigation.

CV-41-标题: Heterogeneous Generative Knowledge Distillation with Masked Image Modeling

链接: https://arxiv.org/abs/2309.09571
作者: Ziming Wang, Shumin Han, Xiaodi Wang, Jing Hao, Xianbin Cao, Baochang Zhang
备注:

点击查看摘要

Abstract: Small CNN-based models usually require transferring knowledge from a large model before they are deployed in computationally resource-limited edge devices. Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. The reason is mainly due to the significant discrepancy between the Transformer-based large model and the CNN-based small network. In this paper, we develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method builds a bridge between Transformer-based models and CNNs by training a UNet-style student with sparse convolution, which can effectively mimic the visual representation inferred by a teacher over masked modeling. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models, which can be pre-trained using advanced generative methods. Extensive experiments show that it adapts well to various models and sizes, consistently achieving state-of-the-art performance in image classification, object detection, and semantic segmentation tasks. For example, in the Imagenet 1K dataset, H-GKD improves the accuracy of Resnet50 (sparse) from 76.98% to 80.01%.

CV-42-标题: RIDE: Self-Supervised Learning of Rotation-Equivariant Keypoint Detection and Invariant Description for Endoscopy

链接: https://arxiv.org/abs/2309.09563
作者: Mert Asim Karaoglu, Viktoria Markova, Nassir Navab, Benjamin Busam, Alexander Ladikos
备注: 8 pages, 5 figures

点击查看摘要

Abstract: Unlike in natural images, in endoscopy there is no clear notion of an up-right camera orientation. Endoscopic videos therefore often contain large rotational motions, which require keypoint detection and description algorithms to be robust to these conditions. While most classical methods achieve rotation-equivariant detection and invariant description by design, many learning-based approaches learn to be robust only up to a certain degree. At the same time learning-based methods under moderate rotations often outperform classical approaches. In order to address this shortcoming, in this paper we propose RIDE, a learning-based method for rotation-equivariant detection and invariant description. Following recent advancements in group-equivariant learning, RIDE models rotation-equivariance implicitly within its architecture. Trained in a self-supervised manner on a large curation of endoscopic images, RIDE requires no manual labeling of training data. We test RIDE in the context of surgical tissue tracking on the SuPeR dataset as well as in the context of relative pose estimation on a repurposed version of the SCARED dataset. In addition we perform explicit studies showing its robustness to large rotations. Our comparison against recent learning-based and classical approaches shows that RIDE sets a new state-of-the-art performance on matching and relative pose estimation tasks and scores competitively on surgical tissue tracking.

CV-43-标题: Causal-Story: Local Causal Attention Utilizing Parameter-Efficient Tuning For Visual Story Synthesis ICASSP2024

链接: https://arxiv.org/abs/2309.09553
作者: Tianyi Song (1), Jiuxin Cao (1), Kun Wang (1), Bo Liu (1), Xiaofeng Zhang (2) ((1) Southeast University (2) Shanghai Jiao Tong University)
备注: Submitted to ICASSP 2024

点击查看摘要

Abstract: The excellent text-to-image synthesis capability of diffusion models has driven progress in synthesizing coherent visual stories. The current state-of-the-art method combines the features of historical captions, historical frames, and the current captions as conditions for generating the current frame. However, this method treats each historical frame and caption as the same contribution. It connects them in order with equal weights, ignoring that not all historical conditions are associated with the generation of the current frame. To address this issue, we propose Causal-Story. This model incorporates a local causal attention mechanism that considers the causal relationship between previous captions, frames, and current captions. By assigning weights based on this relationship, Causal-Story generates the current frame, thereby improving the global consistency of story generation. We evaluated our model on the PororoSV and FlintstonesSV datasets and obtained state-of-the-art FID scores, and the generated frames also demonstrate better storytelling in visuals. The source code of Causal-Story can be obtained from this https URL.

CV-44-标题: Selective Volume Mixup for Video Action Recognition

链接: https://arxiv.org/abs/2309.09534
作者: Yi Tan, Zhaofan Qiu, Yanbin Hao, Ting Yao, Xiangnan He, Tao Mei
备注:

点击查看摘要

Abstract: The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for each frame individually including Mixup, Cutmix, and RandAugment, which are not particularly optimized for video data. In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos. SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes the volumes up to achieve a new training video. Technically, we propose two new modules, i.e., a spatial selective module to select the local patches for each spatial position, and a temporal selective module to mix the entire frames for each timestamp and maintain the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy. We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boot the performances of both CNN-based and transformer-based models.

CV-45-标题: Decompose Semantic Shifts for Composed Image Retrieval

链接: https://arxiv.org/abs/2309.09531
作者: Xingyu Yang, Daqing Liu, Heng Zhang, Yong Luo, Chaoyue Wang, Jing Zhang
备注:

点击查看摘要

Abstract: Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image. However, most existing methods focus on the composition learning of text and reference images and oversimplify the text as a description, neglecting the inherent structure and the user’s shifting intention of the texts. As a result, these methods typically take shortcuts that disregard the visual cue of the reference images. To address this issue, we reconsider the text as instructions and propose a Semantic Shift network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image. Specifically, SSN explicitly decomposes the instructions into two components: degradation and upgradation, where the degradation is used to picture the visual prototype from the reference image, while the upgradation is used to enrich the visual prototype into the final representations to retrieve the desired target image. The experimental results show that the proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance. Codes will be publicly available.

CV-46-标题: DFIL: Deepfake Incremental Learning by Exploiting Domain-invariant Forgery Clues

链接: https://arxiv.org/abs/2309.09526
作者: Kun Pan, Yin Yifang, Yao Wei, Feng Lin, Zhongjie Ba, Zhenguang Liu, ZhiBo Wang, Lorenzo Cavallaro, Kui Ren
备注: Accepted by ACMMM2023

点击查看摘要

Abstract: The malicious use and widespread dissemination of deepfake pose a significant crisis of trust. Current deepfake detection models can generally recognize forgery images by training on a large dataset. However, the accuracy of detection models degrades significantly on images generated by new deepfake methods due to the difference in data distribution. To tackle this issue, we present a novel incremental learning framework that improves the generalization of deepfake detection models by continual learning from a small number of new samples. To cope with different data distributions, we propose to learn a domain-invariant representation based on supervised contrastive learning, preventing overfit to the insufficient new data. To mitigate catastrophic forgetting, we regularize our model in both feature-level and label-level based on a multi-perspective knowledge distillation approach. Finally, we propose to select both central and hard representative samples to update the replay set, which is beneficial for both domain-invariant representation learning and rehearsal-based knowledge preserving. We conduct extensive experiments on four benchmark datasets, obtaining the new state-of-the-art average forgetting rate of 7.01 and average accuracy of 85.49 on FF++, DFDC-P, DFD, and CDF2. Our code is released at this https URL.

CV-47-标题: NOMAD: A Natural Occluded Multi-scale Aerial Dataset for Emergency Response Scenarios

链接: https://arxiv.org/abs/2309.09518
作者: Arturo Miguel Russell Bernal, Walter Scheirer, Jane Cleland-Huang
备注:

点击查看摘要

Abstract: With the increasing reliance on small Unmanned Aerial Systems (sUAS) for Emergency Response Scenarios, such as Search and Rescue, the integration of computer vision capabilities has become a key factor in mission success. Nevertheless, computer vision performance for detecting humans severely degrades when shifting from ground to aerial views. Several aerial datasets have been created to mitigate this problem, however, none of them has specifically addressed the issue of occlusion, a critical component in Emergency Response Scenarios. Natural Occluded Multi-scale Aerial Dataset (NOMAD) presents a benchmark for human detection under occluded aerial views, with five different aerial distances and rich imagery variance. NOMAD is composed of 100 different Actors, all performing sequences of walking, laying and hiding. It includes 42,825 frames, extracted from 5.4k resolution videos, and manually annotated with a bounding box and a label describing 10 different visibility levels, categorized according to the percentage of the human body visible inside the bounding box. This allows computer vision models to be evaluated on their detection performance across different ranges of occlusion. NOMAD is designed to improve the effectiveness of aerial search and rescue and to enhance collaboration between sUAS and humans, by providing a new benchmark dataset for human detection under occluded aerial views.

CV-48-标题: Sparse and Privacy-enhanced Representation for Human Pose Estimation BMVC’23

链接: https://arxiv.org/abs/2309.09515
作者: Ting-Ying Lin, Lin-Yung Hsieh, Fu-En Wang, Wen-Shen Wuen, Min Sun
备注: BMVC’23; project page: this https URL

点击查看摘要

Abstract: We propose a sparse and privacy-enhanced representation for Human Pose Estimation (HPE). Given a perspective camera, we use a proprietary motion vector sensor(MVS) to extract an edge image and a two-directional motion vector image at each time frame. Both edge and motion vector images are sparse and contain much less information (i.e., enhancing human privacy). We advocate that edge information is essential for HPE, and motion vectors complement edge information during fast movements. We propose a fusion network leveraging recent advances in sparse convolution used typically for 3D voxels to efficiently process our proposed sparse representation, which achieves about 13x speed-up and 96% reduction in FLOPs. We collect an in-house edge and motion vector dataset with 16 types of actions by 40 users using the proprietary MVS. Our method outperforms individual modalities using only edge or motion vector images. Finally, we validate the privacy-enhanced quality of our sparse representation through face recognition on CelebA (a large face dataset) and a user study on our in-house dataset.

CV-49-标题: PanoMixSwap Panorama Mixing via Structural Swapping for Indoor Scene Understanding BMVC’23

链接: https://arxiv.org/abs/2309.09514
作者: Yu-Cheng Hsieh, Cheng Sun, Suraj Dengale, Min Sun
备注: BMVC’23

点击查看摘要

Abstract: The volume and diversity of training data are critical for modern deep learningbased methods. Compared to the massive amount of labeled perspective images, 360 panoramic images fall short in both volume and diversity. In this paper, we propose PanoMixSwap, a novel data augmentation technique specifically designed for indoor panoramic images. PanoMixSwap explicitly mixes various background styles, foreground furniture, and room layouts from the existing indoor panorama datasets and generates a diverse set of new panoramic images to enrich the datasets. We first decompose each panoramic image into its constituent parts: background style, foreground furniture, and room layout. Then, we generate an augmented image by mixing these three parts from three different images, such as the foreground furniture from one image, the background style from another image, and the room structure from the third image. Our method yields high diversity since there is a cubical increase in image combinations. We also evaluate the effectiveness of PanoMixSwap on two indoor scene understanding tasks: semantic segmentation and layout estimation. Our experiments demonstrate that state-of-the-art methods trained with PanoMixSwap outperform their original setting on both tasks consistently.

CV-50-标题: Learning Parallax for Stereo Event-based Motion Deblurring

链接: https://arxiv.org/abs/2309.09513
作者: Mingyuan Lin, Chi Zhang, Chu He, Lei Yu
备注:

点击查看摘要

Abstract: Due to the extremely low latency, events have been recently exploited to supplement lost information for motion deblurring. Existing approaches largely rely on the perfect pixel-wise alignment between intensity images and events, which is not always fulfilled in the real world. To tackle this problem, we propose a novel coarse-to-fine framework, named NETwork of Event-based motion Deblurring with STereo event and intensity cameras (St-EDNet), to recover high-quality images directly from the misaligned inputs, consisting of a single blurry image and the concurrent event streams. Specifically, the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths. Then, a dual-feature embedding architecture is proposed to gradually build the fine bidirectional association of the coarsely aligned data and reconstruct the sequence of the latent sharp images. Furthermore, we build a new dataset with STereo Event and Intensity Cameras (StEIC), containing real-world events, intensity images, and dense disparity maps. Experiments on real-world datasets demonstrate the superiority of the proposed network over state-of-the-art methods.

CV-51-标题: RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

链接: https://arxiv.org/abs/2309.09502
作者: Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Li Liu, Shanghang Zhang
备注:

点击查看摘要

Abstract: 3D occupancy prediction holds significant promise in the fields of robot perception and autonomous driving, which quantifies 3D scenes into grid cells with semantic labels. Recent works mainly utilize complete occupancy labels in 3D voxel space for supervision. However, the expensive annotation process and sometimes ambiguous labels have severely constrained the usability and scalability of 3D occupancy models. To address this, we present RenderOcc, a novel paradigm for training 3D occupancy models only using 2D labels. Specifically, we extract a NeRF-style 3D volume representation from multi-view images, and employ volume rendering techniques to establish 2D renderings, thus enabling direct 3D supervision from 2D semantics and depth labels. Additionally, we introduce an Auxiliary Ray method to tackle the issue of sparse viewpoints in autonomous driving scenarios, which leverages sequential frames to construct comprehensive 2D rendering for each object. To our best knowledge, RenderOcc is the first attempt to train multi-view 3D occupancy models only using 2D labels, reducing the dependence on costly 3D occupancy annotations. Extensive experiments demonstrate that RenderOcc achieves comparable performance to models fully supervised with 3D labels, underscoring the significance of this approach in real-world applications.

CV-52-标题: Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation IJCAI2023

链接: https://arxiv.org/abs/2309.09501
作者: Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, Si Liu
备注: Accepted by IJCAI 2023

点击查看摘要

Abstract: Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames simultaneously, which is both redundant and implicit. In this paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information and associate each of them to particular sounding objects. Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries. Besides, an Audio-Bridged Temporal Interaction module is proposed to exchange sounding object-relevant information among multiple frames with the bridge of audio features. Extensive experiments are conducted on two AVS benchmarks to show that our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.

CV-53-标题: CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

链接: https://arxiv.org/abs/2309.09496
作者: Yating liu, Yaowei Li, Zimo Liu, Wenming Yang, Yaowei Wang, Qingmin Liao
备注:

点击查看摘要

Abstract: Text-based Person Retrieval aims to retrieve the target person images given a textual query. The primary challenge lies in bridging the substantial gap between vision and language modalities, especially when dealing with limited large-scale datasets. In this paper, we introduce a CLIP-based Synergistic Knowledge Transfer(CSKT) approach for TBPR. Specifically, to explore the CLIP’s knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections. Secondly, Dual Adapters Transferring (DAT) is designed to transfer knowledge on output side of Multi-Head Self-Attention (MHSA) in vision and language. This synergistic two-way collaborative mechanism promotes the early-stage feature fusion and efficiently exploits the existing knowledge of CLIP. CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model, demonstrating its remarkable efficiency, effectiveness and generalization.

CV-54-标题: Target-aware Bi- Transformer for Few-shot Segmentation

链接: https://arxiv.org/abs/2309.09492
作者: Xianglin Wang, Xiaoliu Luo, Taiping Zhang
备注:

点击查看摘要

Abstract: Traditional semantic segmentation tasks require a large number of labels and are difficult to identify unlearned categories. Few-shot semantic segmentation (FSS) aims to use limited labeled support images to identify the segmentation of new classes of objects, which is very practical in the real world. Previous researches were primarily based on prototypes or correlations. Due to colors, textures, and styles are similar in the same image, we argue that the query image can be regarded as its own support image. In this paper, we proposed the Target-aware Bi-Transformer Network (TBTNet) to equivalent treat of support images and query image. A vigorous Target-aware Transformer Layer (TTL) also be designed to distill correlations and force the model to focus on foreground information. It treats the hypercorrelation as a feature, resulting a significant reduction in the number of feature channels. Benefit from this characteristic, our model is the lightest up to now with only 0.4M learnable parameters. Futhermore, TBTNet converges in only 10% to 25% of the training epochs compared to traditional methods. The excellent performance on standard FSS benchmarks of PASCAL-5i and COCO-20i proves the efficiency of our method. Extensive ablation studies were also carried out to evaluate the effectiveness of Bi-Transformer architecture and TTL.

CV-55-标题: Distributional Estimation of Data Uncertainty for Surveillance Face Anti-spoofing

链接: https://arxiv.org/abs/2309.09485
作者: Mouxiao Huang
备注:

点击查看摘要

Abstract: Face recognition systems have become increasingly vulnerable to security threats in recent years, prompting the use of Face Anti-spoofing (FAS) to protect against various types of attacks, such as phone unlocking, face payment, and self-service security inspection. While FAS has demonstrated its effectiveness in traditional settings, securing it in long-distance surveillance scenarios presents a significant challenge. These scenarios often feature low-quality face images, necessitating the modeling of data uncertainty to improve stability under extreme conditions. To address this issue, this work proposes Distributional Estimation (DisE), a method that converts traditional FAS point estimation to distributional estimation by modeling data uncertainty during training, including feature (mean) and uncertainty (variance). By adjusting the learning strength of clean and noisy samples for stability and accuracy, the learned uncertainty enhances DisE’s performance. The method is evaluated on SuHiFiMask [1], a large-scale and challenging FAS dataset in surveillance scenarios. Results demonstrate that DisE achieves comparable performance on both ACER and AUC metrics.

CV-56-标题: Spatio-temporal Co-attention Fusion Network for Video Splicing Localization

链接: https://arxiv.org/abs/2309.09482
作者: Man Lin, Gang Cao, Zijie Lou
备注:

点击查看摘要

Abstract: Digital video splicing has become easy and ubiquitous. Malicious users copy some regions of a video and paste them to another video for creating realistic forgeries. It is significant to blindly detect such forgery regions in videos. In this paper, a spatio-temporal co-attention fusion network (SCFNet) is proposed for video splicing localization. Specifically, a three-stream network is used as an encoder to capture manipulation traces across multiple frames. The deep interaction and fusion of spatio-temporal forensic features are achieved by the novel parallel and cross co-attention fusion modules. A lightweight multilayer perceptron (MLP) decoder is adopted to yield a pixel-level tampering localization map. A new large-scale video splicing dataset is created for training the SCFNet. Extensive tests on benchmark datasets show that the localization and generalization performances of our SCFNet outperform the state-of-the-art. Code and datasets will be available at this https URL.

CV-57-标题: Stealthy Physical Masked Face Recognition Attack via Adversarial Style Optimization

链接: https://arxiv.org/abs/2309.09480
作者: Huihui Gong, Minjing Dong, Siqi Ma, Seyit Camtepe, Surya Nepal, Chang Xu
备注: 11 pages, 7 figures

点击查看摘要

Abstract: Deep neural networks (DNNs) have achieved state-of-the-art performance on face recognition (FR) tasks in the last decade. In real scenarios, the deployment of DNNs requires taking various face accessories into consideration, like glasses, hats, and masks. In the COVID-19 pandemic era, wearing face masks is one of the most effective ways to defend against the novel coronavirus. However, DNNs are known to be vulnerable to adversarial examples with a small but elaborated perturbation. Thus, a facial mask with adversarial perturbations may pose a great threat to the widely used deep learning-based FR models. In this paper, we consider a challenging adversarial setting: targeted attack against FR models. We propose a new stealthy physical masked FR attack via adversarial style optimization. Specifically, we train an adversarial style mask generator that hides adversarial perturbations inside style masks. Moreover, to ameliorate the phenomenon of sub-optimization with one fixed style, we propose to discover the optimal style given a target through style optimization in a continuous relaxation manner. We simultaneously optimize the generator and the style selection for generating strong and stealthy adversarial style masks. We evaluated the effectiveness and transferability of our proposed method via extensive white-box and black-box digital experiments. Furthermore, we also conducted physical attack experiments against local FR models and online platforms.

CV-58-标题: Self-supervised Multi-view Clustering in Computer Vision: A Survey

链接: https://arxiv.org/abs/2309.09473
作者: Jiatai Wang, Zhiwei Xu, Xuewen Yang, Hailong Li, Bo Li, Xuying Meng
备注:

点击查看摘要

Abstract: Multi-view clustering (MVC) has had significant implications in cross-modal representation learning and data-driven decision-making in recent years. It accomplishes this by leveraging the consistency and complementary information among multiple views to cluster samples into distinct groups. However, as contrastive learning continues to evolve within the field of computer vision, self-supervised learning has also made substantial research progress and is progressively becoming dominant in MVC methods. It guides the clustering process by designing proxy tasks to mine the representation of image and video data itself as supervisory information. Despite the rapid development of self-supervised MVC, there has yet to be a comprehensive survey to analyze and summarize the current state of research progress. Therefore, this paper explores the reasons and advantages of the emergence of self-supervised MVC and discusses the internal connections and classifications of common datasets, data issues, representation learning methods, and self-supervised learning methods. This paper does not only introduce the mechanisms for each category of methods but also gives a few examples of how these techniques are used. In the end, some open problems are pointed out for further investigation and development.

CV-59-标题: Reconstructing Existing Levels through Level Inpainting

链接: https://arxiv.org/abs/2309.09472
作者: Johor Jara Gonzalez, Mathew Guzdial
备注: 8 pages, 5 figures, Artificial Intelligence and Interactive Digital Entertainment

点击查看摘要

Abstract: Procedural Content Generation (PCG) and Procedural Content Generation via Machine Learning (PCGML) have been used in prior work for generating levels in various games. This paper introduces Content Augmentation and focuses on the subproblem of level inpainting, which involves reconstructing and extending video game levels. Drawing inspiration from image inpainting, we adapt two techniques from this domain to address our specific use case. We present two approaches for level inpainting: an Autoencoder and a U-net. Through a comprehensive case study, we demonstrate their superior performance compared to a baseline method and discuss their relative merits. Furthermore, we provide a practical demonstration of both approaches for the level inpainting task and offer insights into potential directions for future research.

CV-60-标题: Progressive Text-to-Image Diffusion with Soft Latent Direction

链接: https://arxiv.org/abs/2309.09466
作者: YuTeng Ye, Jiale Cai, Hang Zhou, Guanwen Li, Youjia Zhang, Zikai Song, Chenxing Gao, Junqing Yu, Wei Yang
备注: 14 pages, 15 figures

点击查看摘要

Abstract: In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations-namely insertion, editing, and erasing-we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field’s performance standards.

CV-61-标题: Reducing Adversarial Training Cost with Gradient Approximation

链接: https://arxiv.org/abs/2309.09464
作者: Huihui Gong, Shuo Yang, Siqi Ma, Seyit Camtepe, Surya Nepal, Chang Xu
备注: 7 pages, 2 figures

点击查看摘要

Abstract: Deep learning models have achieved state-of-the-art performances in various domains, while they are vulnerable to the inputs with well-crafted but small perturbations, which are named after adversarial examples (AEs). Among many strategies to improve the model robustness against AEs, Projected Gradient Descent (PGD) based adversarial training is one of the most effective methods. Unfortunately, the prohibitive computational overhead of generating strong enough AEs, due to the maximization of the loss function, sometimes makes the regular PGD adversarial training impractical when using larger and more complicated models. In this paper, we propose that the adversarial loss can be approximated by the partial sum of Taylor series. Furthermore, we approximate the gradient of adversarial loss and propose a new and efficient adversarial training method, adversarial training with gradient approximation (GAAT), to reduce the cost of building up robust models. Additionally, extensive experiments demonstrate that this efficiency improvement can be achieved without any or with very little loss in accuracy on natural and adversarial examples, which show that our proposed method saves up to 60% of the training time with comparable model test accuracy on MNIST, CIFAR-10 and CIFAR-100 datasets.

CV-62-标题: Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

链接: https://arxiv.org/abs/2309.09456
作者: Chenming Zhu, Wenwei Zhang, Tai Wang, Xihui Liu, Kai Chen
备注: 17 pages, 7 figures, 9 tables

点击查看摘要

Abstract: Point cloud-based open-vocabulary 3D object detection aims to detect 3D categories that do not have ground-truth annotations in the training set. It is extremely challenging because of the limited data and annotations (bounding boxes with class labels or text descriptions) of 3D scenes. Previous approaches leverage large-scale richly-annotated image datasets as a bridge between 3D and category semantics but require an extra alignment process between 2D images and 3D points, limiting the open-vocabulary ability of 3D detectors. Instead of leveraging 2D images, we propose Object2Scene, the first approach that leverages large-scale large-vocabulary 3D object datasets to augment existing 3D scene datasets for open-vocabulary 3D object detection. Object2Scene inserts objects from different sources into 3D scenes to enrich the vocabulary of 3D scene datasets and generates text descriptions for the newly inserted objects. We further introduce a framework that unifies 3D detection and visual grounding, named L3Det, and propose a cross-domain category-level contrastive learning approach to mitigate the domain gap between 3D objects from different datasets. Extensive experiments on existing open-vocabulary 3D object detection benchmarks show that Object2Scene obtains superior performance over existing methods. We further verify the effectiveness of Object2Scene on a new benchmark OV-ScanNet-200, by holding out all rare categories as novel categories not seen during training.

CV-63-标题: Scalable Label-efficient Footpath Network Generation Using Remote Sensing Data and Self-supervised Learning

链接: https://arxiv.org/abs/2309.09446
作者: Xinye Wanyan, Sachith Seneviratne, Kerry Nice, Jason Thompson, Marcus White, Nano Langenheim, Mark Stevenson
备注:

点击查看摘要

Abstract: Footpath mapping, modeling, and analysis can provide important geospatial insights to many fields of study, including transport, health, environment and urban planning. The availability of robust Geographic Information System (GIS) layers can benefit the management of infrastructure inventories, especially at local government level with urban planners responsible for the deployment and maintenance of such infrastructure. However, many cities still lack real-time information on the location, connectivity, and width of footpaths, and/or employ costly and manual survey means to gather this information. This work designs and implements an automatic pipeline for generating footpath networks based on remote sensing images using machine learning models. The annotation of segmentation tasks, especially labeling remote sensing images with specialized requirements, is very expensive, so we aim to introduce a pipeline requiring less labeled data. Considering supervised methods require large amounts of training data, we use a self-supervised method for feature representation learning to reduce annotation requirements. Then the pre-trained model is used as the encoder of the U-Net for footpath segmentation. Based on the generated masks, the footpath polygons are extracted and converted to footpath networks which can be loaded and visualized by geographic information systems conveniently. Validation results indicate considerable consistency when compared to manually collected GIS layers. The footpath network generation pipeline proposed in this work is low-cost and extensible, and it can be applied where remote sensing images are available. Github: this https URL.

CV-64-标题: FactoFormer: Factorized Hyperspectral Transformer s with Self-Supervised Pre-Training

链接: https://arxiv.org/abs/2309.09431
作者: Shaheer Mohamed, Maryam Haghighat, Tharindu Fernando, Sridha Sridharan, Clinton Fookes, Peyman Moghadam
备注: Pre-print of article in 2023 IEEE Trans. on Geoscience and Remote Sensing

点击查看摘要

Abstract: Hyperspectral images (HSIs) contain rich spectral and spatial information. Motivated by the success of transformers in the field of natural language processing and computer vision where they have shown the ability to learn long range dependencies within input data, recent research has focused on using transformers for HSIs. However, current state-of-the-art hyperspectral transformers only tokenize the input HSI sample along the spectral dimension, resulting in the under-utilization of spatial information. Moreover, transformers are known to be data-hungry and their performance relies heavily on large-scale pre-training, which is challenging due to limited annotated hyperspectral data. Therefore, the full potential of HSI transformers has not been fully realized. To overcome these limitations, we propose a novel factorized spectral-spatial transformer that incorporates factorized self-supervised pre-training procedures, leading to significant improvements in performance. The factorization of the inputs allows the spectral and spatial transformers to better capture the interactions within the hyperspectral data cubes. Inspired by masked image modeling pre-training, we also devise efficient masking strategies for pre-training each of the spectral and spatial transformers. We conduct experiments on three publicly available datasets for HSI classification task and demonstrate that our model achieves state-of-the-art performance in all three datasets. The code for our model will be made available at this https URL.

CV-65-标题: TransTouch: Learning Transparent Objects Depth Sensing Through Sparse Touches IROS

链接: https://arxiv.org/abs/2309.09427
作者: Liuyu Bian, Pengyang Shi, Weihang Chen, Jing Xu, Li Yi, Rui Chen
备注: Accepted to the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

点击查看摘要

Abstract: Transparent objects are common in daily life. However, depth sensing for transparent objects remains a challenging problem. While learning-based methods can leverage shape priors to improve the sensing quality, the labor-intensive data collection in the real world and the sim-to-real domain gap restrict these methods’ scalability. In this paper, we propose a method to finetune a stereo network with sparse depth labels automatically collected using a probing system with tactile feedback. We present a novel utility function to evaluate the benefit of touches. By approximating and optimizing the utility function, we can optimize the probing locations given a fixed touching budget to better improve the network’s performance on real objects. We further combine tactile depth supervision with a confidence-based regularization to prevent over-fitting during finetuning. To evaluate the effectiveness of our method, we construct a real-world dataset including both diffuse and transparent objects. Experimental results on this dataset show that our method can significantly improve real-world depth sensing accuracy, especially for transparent objects.

CV-66-标题: Cross-attention-based saliency inference for predicting cancer metastasis on whole slide images

链接: https://arxiv.org/abs/2309.09412
作者: Ziyu Su, Mostafa Rezapour, Usama Sajjad, Shuo Niu, Metin Nafi Gurcan, Muhammad Khalid Khan Niazi
备注:

点击查看摘要

Abstract: Although multiple instance learning (MIL) methods are widely used for automatic tumor detection on whole slide images (WSI), they suffer from the extreme class imbalance within the small tumor WSIs. This occurs when the tumor comprises only a few isolated cells. For early detection, it is of utmost importance that MIL algorithms can identify small tumors, even when they are less than 1% of the size of the WSI. Existing studies have attempted to address this issue using attention-based architectures and instance selection-based methodologies, but have not yielded significant improvements. This paper proposes cross-attention-based salient instance inference MIL (CASiiMIL), which involves a novel saliency-informed attention mechanism, to identify breast cancer lymph node micro-metastasis on WSIs without the need for any annotations. Apart from this new attention mechanism, we introduce a negative representation learning algorithm to facilitate the learning of saliency-informed attention weights for improved sensitivity on tumor WSIs. The proposed model outperforms the state-of-the-art MIL methods on two popular tumor metastasis detection datasets, and demonstrates great cross-center generalizability. In addition, it exhibits excellent accuracy in classifying WSIs with small tumor lesions. Moreover, we show that the proposed model has excellent interpretability attributed to the saliency-informed attention weights. We strongly believe that the proposed method will pave the way for training algorithms for early tumor detection on large datasets where acquiring fine-grained annotations is practically impossible.

CV-67-标题: a critical analysis of internal reliability for uncertainty quantification of dense image matching in multi-view stereo

链接: https://arxiv.org/abs/2309.09379
作者: Debao Huang, Rongjun Qin
备注: Figure 8

点击查看摘要

Abstract: Nowadays, photogrammetrically derived point clouds are widely used in many civilian applications due to their low cost and flexibility in acquisition. Typically, photogrammetric point clouds are assessed through reference data such as LiDAR point clouds. However, when reference data are not available, the assessment of photogrammetric point clouds may be challenging. Since these point clouds are algorithmically derived, their accuracies and precisions are highly varying with the camera networks, scene complexity, and dense image matching (DIM) algorithms, and there is no standard error metric to determine per-point errors. The theory of internal reliability of camera networks has been well studied through first-order error estimation of Bundle Adjustment (BA), which is used to understand the errors of 3D points assuming known measurement errors. However, the measurement errors of the DIM algorithms are intricate to an extent that every single point may have its error function determined by factors such as pixel intensity, texture entropy, and surface smoothness. Despite the complexity, there exist a few common metrics that may aid the process of estimating the posterior reliability of the derived points, especially in a multi-view stereo (MVS) setup when redundancies are present. In this paper, by using an aerial oblique photogrammetric block with LiDAR reference data, we analyze several internal matching metrics within a common MVS framework, including statistics in ray convergence, intersection angles, DIM energy, etc.

CV-68-标题: Enhancing Knee Osteoarthritis severity level classification using diffusion augmented images CEC

链接: https://arxiv.org/abs/2309.09328
作者: Paleti Nikhil Chowdary, Gorantla V N S L Vishnu Vardhan, Menta Sai Akshay, Menta Sai Aashish, Vadlapudi Sai Aravind, Garapati Venkata Krishna Rayalu, Aswathy P
备注: Paper has been accepted to be presented at ICACECS 2023 and the final version will be published by Atlantis Highlights in Computer Science (AHCS) , Atlantis Press(part of Springer Nature)

点击查看摘要

Abstract: This research paper explores the classification of knee osteoarthritis (OA) severity levels using advanced computer vision models and augmentation techniques. The study investigates the effectiveness of data preprocessing, including Contrast-Limited Adaptive Histogram Equalization (CLAHE), and data augmentation using diffusion models. Three experiments were conducted: training models on the original dataset, training models on the preprocessed dataset, and training models on the augmented dataset. The results show that data preprocessing and augmentation significantly improve the accuracy of the models. The EfficientNetB3 model achieved the highest accuracy of 84% on the augmented dataset. Additionally, attention visualization techniques, such as Grad-CAM, are utilized to provide detailed attention maps, enhancing the understanding and trustworthiness of the models. These findings highlight the potential of combining advanced models with augmented data and attention visualization for accurate knee OA severity classification.

CV-69-标题: Active Learning for Semantic Segmentation with Multi-class Label Query

链接: https://arxiv.org/abs/2309.09319
作者: Sehyun Hwang, Sohyun Lee, Hoyoung Kim, Minhyeon Oh, Jungseul Ok, Suha Kwak
备注:

点击查看摘要

Abstract: This paper proposes a new active learning method for semantic segmentation. The core of our method lies in a new annotation query design. It samples informative local image regions (e.g., superpixels), and for each of such regions, asks an oracle for a multi-hot vector indicating all classes existing in the region. This multi-class labeling strategy is substantially more efficient than existing ones like segmentation, polygon, and even dominant class labeling in terms of annotation time per click. However, it introduces the class ambiguity issue in training since it assigns partial labels (i.e., a set of candidate classes) to individual pixels. We thus propose a new algorithm for learning semantic segmentation while disambiguating the partial labels in two stages. In the first stage, it trains a segmentation model directly with the partial labels through two new loss functions motivated by partial label learning and multiple instance learning. In the second stage, it disambiguates the partial labels by generating pixel-wise pseudo labels, which are used for supervised learning of the model. Equipped with a new acquisition function dedicated to the multi-class labeling, our method outperformed previous work on Cityscapes and PASCAL VOC 2012 while spending less annotation cost.

CV-70-标题: MOVIN: Real-time Motion Capture using a Single LiDAR

链接: https://arxiv.org/abs/2309.09314
作者: Deok-Kyeong Jang, Dongseok Yang, Deok-Yun Jang, Byeoli Choi, Taeil Jin, Sung-Hee Lee
备注:

点击查看摘要

Abstract: Recent advancements in technology have brought forth new forms of interactive applications, such as the social metaverse, where end users interact with each other through their virtual avatars. In such applications, precise full-body tracking is essential for an immersive experience and a sense of embodiment with the virtual avatar. However, current motion capture systems are not easily accessible to end users due to their high cost, the requirement for special skills to operate them, or the discomfort associated with wearable devices. In this paper, we present MOVIN, the data-driven generative method for real-time motion capture with global tracking, using a single LiDAR sensor. Our autoregressive conditional variational autoencoder (CVAE) model learns the distribution of pose variations conditioned on the given 3D point cloud from this http URL a central factor for high-accuracy motion capture, we propose a novel feature encoder to learn the correlation between the historical 3D point cloud data and global, local pose features, resulting in effective learning of the pose prior. Global pose features include root translation, rotation, and foot contacts, while local features comprise joint positions and rotations. Subsequently, a pose generator takes into account the sampled latent variable along with the features from the previous frame to generate a plausible current pose. Our framework accurately predicts the performer’s 3D global information and local joint details while effectively considering temporally coherent movements across frames. We demonstrate the effectiveness of our architecture through quantitative and qualitative evaluations, comparing it against state-of-the-art methods. Additionally, we implement a real-time application to showcase our method in real-world scenarios. MOVIN dataset is available at \urlthis https URL.

CV-71-标题: Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention BMVC

链接: https://arxiv.org/abs/2309.09311
作者: Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
备注: Accepted by the British Machine Vision Conference (BMVC) 2023. Project Page: this https URL

点击查看摘要

Abstract: Many studies focus on improving pretraining or developing new backbones in text-video retrieval. However, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, spatial appearance features on action recognition or temporal object co-occurrences on video scene graph generation could induce spurious correlations. In this work, we present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips, which is the first such attempt for a text-video retrieval task, to the best of our knowledge. We first hypothesise and verify the bias on how it would affect the model illustrated with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets. Our model overpasses the baseline and SOTA on nDCG, a semantic-relevancy-focused evaluation metric which proves the bias is mitigated, as well as on the other conventional metrics.

CV-72-标题: UGC: Unified GAN Compression for Efficient Image-to-Image Translation

链接: https://arxiv.org/abs/2309.09310
作者: Yuxi Ren, Jie Wu, Peng Zhang, Manlin Zhang, Xuefeng Xiao, Qian He, Rui Wang, Min Zheng, Xin Pan
备注:

点击查看摘要

Abstract: Recent years have witnessed the prevailing progress of Generative Adversarial Networks (GANs) in image-to-image translation. However, the success of these GAN models hinges on ponderous computational costs and labor-expensive training data. Current efficient GAN learning techniques often fall into two orthogonal aspects: i) model slimming via reduced calculation costs; ii)data/label-efficient learning with fewer training data/labels. To combine the best of both worlds, we propose a new learning paradigm, Unified GAN Compression (UGC), with a unified optimization objective to seamlessly prompt the synergy of model-efficient and label-efficient learning. UGC sets up semi-supervised-driven network architecture search and adaptive online semi-supervised distillation stages sequentially, which formulates a heterogeneous mutual learning scheme to obtain an architecture-flexible, label-efficient, and performance-excellent model.

CV-73-标题: Effective Image Tampering Localization via Enhanced Transformer and Co-attention Fusion

链接: https://arxiv.org/abs/2309.09306
作者: Kun Guo, Haochen Zhu, Gang Cao
备注:

点击查看摘要

Abstract: Powerful manipulation techniques have made digital image forgeries be easily created and widespread without leaving visual anomalies. The blind localization of tampered regions becomes quite significant for image forensics. In this paper, we propose an effective image tampering localization network (EITLNet) based on a two-branch enhanced transformer encoder with attention-based feature fusion. Specifically, a feature enhancement module is designed to enhance the feature representation ability of the transformer encoder. The features extracted from RGB and noise streams are fused effectively by the coordinate attention-based fusion module at multiple scales. Extensive experimental results verify that the proposed scheme achieves the state-of-the-art generalization ability and robustness in various benchmark datasets. Code will be public at this https URL.

CV-74-标题: RenderIH: A Large-scale Synthetic Dataset for 3D Interacting Hand Pose Estimation ICCV2023

链接: https://arxiv.org/abs/2309.09301
作者: Lijun Li, Linrui Tian1, Xindi Zhang, Qi Wang, Bang Zhang, Liefeng Bo, Mengyuan Liu, Chen Chen
备注: Accepted by ICCV 2023

点击查看摘要

Abstract: The current interacting hand (IH) datasets are relatively simplistic in terms of background and texture, with hand joints being annotated by a machine annotator, which may result in inaccuracies, and the diversity of pose distribution is limited. However, the variability of background, pose distribution, and texture can greatly influence the generalization ability. Therefore, we present a large-scale synthetic dataset RenderIH for interacting hands with accurate and diverse pose annotations. The dataset contains 1M photo-realistic images with varied backgrounds, perspectives, and hand textures. To generate natural and diverse interacting poses, we propose a new pose optimization algorithm. Additionally, for better pose estimation accuracy, we introduce a transformer-based pose estimation network, TransHand, to leverage the correlation between interacting hands and verify the effectiveness of RenderIH in improving results. Our dataset is model-agnostic and can improve more accuracy of any hand pose estimation method in comparison to other real or synthetic datasets. Experiments have shown that pretraining on our synthetic data can significantly decrease the error from 6.76mm to 5.79mm, and our Transhand surpasses contemporary methods. Our dataset and code are available at this https URL.

CV-75-标题: Chasing Day and Night: Towards Robust and Efficient All-Day Object Detection Guided by an Event Camera

链接: https://arxiv.org/abs/2309.09297
作者: Jiahang Cao, Xu Zheng, Yuanhuiyi Lyu, Jiaxu Wang, Renjing Xu, Lin Wang
备注: Under submission

点击查看摘要

Abstract: The ability to detect objects in all lighting (i.e., normal-, over-, and under-exposed) conditions is crucial for real-world applications, such as self-driving.Traditional RGB-based detectors often fail under such varying lighting conditions.Therefore, recent works utilize novel event cameras to supplement or guide the RGB modality; however, these methods typically adopt asymmetric network structures that rely predominantly on the RGB modality, resulting in limited robustness for all-day detection. In this paper, we propose EOLO, a novel object detection framework that achieves robust and efficient all-day detection by fusing both RGB and event modalities. Our EOLO framework is built based on a lightweight spiking neural network (SNN) to efficiently leverage the asynchronous property of events. Buttressed by it, we first introduce an Event Temporal Attention (ETA) module to learn the high temporal information from events while preserving crucial edge information. Secondly, as different modalities exhibit varying levels of importance under diverse lighting conditions, we propose a novel Symmetric RGB-Event Fusion (SREF) module to effectively fuse RGB-Event features without relying on a specific modality, thus ensuring a balanced and adaptive fusion for all-day detection. In addition, to compensate for the lack of paired RGB-Event datasets for all-day training and evaluation, we propose an event synthesis approach based on the randomized optical flow that allows for directly generating the event frame from a single exposure image. We further build two new datasets, E-MSCOCO and E-VOC based on the popular benchmarks MSCOCO and PASCAL VOC. Extensive experiments demonstrate that our EOLO outperforms the state-of-the-art detectors,e.g.,RENet,by a substantial margin (+3.74% mAP50) in all lighting conditions.Our code and datasets will be available at this https URL

CV-76-标题: LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation ICCV2023

链接: https://arxiv.org/abs/2309.09294
作者: Yihao Zhi, Xiaodong Cun, Xuelin Chen, Xi Shen, Wen Guo, Shaoli Huang, Shenghua Gao
备注: Accepted by ICCV 2023

点击查看摘要

Abstract: Gestures are non-verbal but important behaviors accompanying people’s speech. While previous methods are able to generate speech rhythm-synchronized gestures, the semantic context of the speech is generally lacking in the gesticulations. Although semantic gestures do not occur very regularly in human speech, they are indeed the key for the audience to understand the speech context in a more immersive environment. Hence, we introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation and offers several control handles. In particular, our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement. Specifically, the script-based gesture generation leverages the pre-trained CLIP text embeddings as the guidance for generating gestures that are highly semantically aligned with the script. Then, we devise a simple but effective diffusion-based gesture generation backbone simply using pure MLPs, that is conditioned on only audio signals and learns to gesticulate with realistic motions. We utilize such powerful prior to rhyme the script-guided gestures with the audio signals, notably in a zero-shot setting. Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style, editing the co-speech gestures via textual prompting, and controlling the semantic awareness and rhythm alignment with guided diffusion. Extensive experiments demonstrate the advantages of the proposed framework over competing methods. In addition, our core diffusion-based generative model also achieves state-of-the-art performance on two benchmarks. The code and model will be released to facilitate future research.

CV-77-标题: MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene Classification

链接: https://arxiv.org/abs/2309.09276
作者: Junjie Zhu, Yiying Li, Chunping Qiu, Ke Yang, Naiyang Guan, Xiaodong Yi
备注: SUBMIT TO IEEE TRANSACTIONS

点击查看摘要

Abstract: Vision Transformer (ViT) models have recently emerged as powerful and versatile models for various visual tasks. Recently, a work called PMF has achieved promising results in few-shot image classification by utilizing pre-trained vision transformer models. However, PMF employs full fine-tuning for learning the downstream tasks, leading to significant overfitting and storage issues, especially in the remote sensing domain. In order to tackle these issues, we turn to the recently proposed parameter-efficient tuning methods, such as VPT, which updates only the newly added prompt parameters while keeping the pre-trained backbone frozen. Inspired by VPT, we propose the Meta Visual Prompt Tuning (MVP) method. Specifically, we integrate the VPT method into the meta-learning framework and tailor it to the remote sensing domain, resulting in an efficient framework for Few-Shot Remote Sensing Scene Classification (FS-RSSC). Furthermore, we introduce a novel data augmentation strategy based on patch embedding recombination to enhance the representation and diversity of scenes for classification purposes. Experiment results on the FS-RSSC benchmark demonstrate the superior performance of the proposed MVP over existing methods in various settings, such as various-way-various-shot, various-way-one-shot, and cross-domain adaptation.

CV-78-标题: Deep Neighbor Layer Aggregation for Lightweight Self-Supervised Monocular Depth Estimation

链接: https://arxiv.org/abs/2309.09272
作者: Boya Wang, Shuo Wang, Ziwen Dou, Dong Ye
备注:

点击查看摘要

Abstract: With the frequent use of self-supervised monocular depth estimation in robotics and autonomous driving, the model’s efficiency is becoming increasingly important. Most current approaches apply much larger and more complex networks to improve the precision of depth estimation. Some researchers incorporated Transformer into self-supervised monocular depth estimation to achieve better performance. However, this method leads to high parameters and high computation. We present a fully convolutional depth estimation network using contextual feature fusion. Compared to UNet++ and HRNet, we use high-resolution and low-resolution features to reserve information on small targets and fast-moving objects instead of long-range fusion. We further promote depth estimation results employing lightweight channel attention based on convolution in the decoder stage. Our method reduces the parameters without sacrificing accuracy. Experiments on the KITTI benchmark show that our method can get better results than many large models, such as Monodepth2, with only 30 parameters. The source code is available at this https URL.

CV-79-标题: LiDAR Data Synthesis with Denoising Diffusion Probabilistic Models

链接: https://arxiv.org/abs/2309.09256
作者: Kazuto Nakashima, Ryo Kurazume
备注:

点击查看摘要

Abstract: Generative modeling of 3D LiDAR data is an emerging task with promising applications for autonomous mobile robots, such as scalable simulation, scene manipulation, and sparse-to-dense completion of LiDAR point clouds. Existing approaches have shown the feasibility of image-based LiDAR data generation using deep generative models while still struggling with the fidelity of generated data and training instability. In this work, we present R2DM, a novel generative model for LiDAR data that can generate diverse and high-fidelity 3D scene point clouds based on the image representation of range and reflectance intensity. Our method is based on the denoising diffusion probabilistic models (DDPMs), which have demonstrated impressive results among generative model frameworks and have been significantly progressing in recent years. To effectively train DDPMs on the LiDAR domain, we first conduct an in-depth analysis regarding data representation, training objective, and spatial inductive bias. Based on our designed model R2DM, we also introduce a flexible LiDAR completion pipeline using the powerful properties of DDPMs. We demonstrate that our method outperforms the baselines on the generation task of KITTI-360 and KITTI-Raw datasets and the upsampling task of KITTI-360 datasets. Our code and pre-trained weights will be available at this https URL.

CV-80-标题: Convex Latent-Optimized Adversarial Regularizers for Imaging Inverse Problems

链接: https://arxiv.org/abs/2309.09250
作者: Huayu Wang, Chen Luo, Taofeng Xie, Qiyu Jin, Guoqing Chen, Zhuo-Xu Cui, Dong Liang
备注:

点击查看摘要

Abstract: Recently, data-driven techniques have demonstrated remarkable effectiveness in addressing challenges related to MR imaging inverse problems. However, these methods still exhibit certain limitations in terms of interpretability and robustness. In response, we introduce Convex Latent-Optimized Adversarial Regularizers (CLEAR), a novel and interpretable data-driven paradigm. CLEAR represents a fusion of deep learning (DL) and variational regularization. Specifically, we employ a latent optimization technique to adversarially train an input convex neural network, and its set of minima can fully represent the real data manifold. We utilize it as a convex regularizer to formulate a CLEAR-informed variational regularization model that guides the solution of the imaging inverse problem on the real data manifold. Leveraging its inherent convexity, we have established the convergence of the projected subgradient descent algorithm for the CLEAR-informed regularization model. This convergence guarantees the attainment of a unique solution to the imaging inverse problem, subject to certain assumptions. Furthermore, we have demonstrated the robustness of our CLEAR-informed model, explicitly showcasing its capacity to achieve stable reconstruction even in the presence of measurement interference. Finally, we illustrate the superiority of our approach using MRI reconstruction as an example. Our method consistently outperforms conventional data-driven techniques and traditional regularization approaches, excelling in both reconstruction quality and robustness.

CV-81-标题: LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking

链接: https://arxiv.org/abs/2309.09249
作者: Qingmao Wei, Bi Zeng, Jianqi Liu, Li He, Guotian Zeng
备注:

点击查看摘要

Abstract: The recent advancements in transformer-based visual trackers have led to significant progress, attributed to their strong modeling capabilities. However, as performance improves, running latency correspondingly increases, presenting a challenge for real-time robotics applications, especially on edge devices with computational constraints. In response to this, we introduce LiteTrack, an efficient transformer-based tracking model optimized for high-speed operations across various devices. It achieves a more favorable trade-off between accuracy and efficiency than the other lightweight trackers. The main innovations of LiteTrack encompass: 1) asynchronous feature extraction and interaction between the template and search region for better feature fushion and cutting redundant computation, and 2) pruning encoder layers from a heavy tracker to refine the balnace between performance and speed. As an example, our fastest variant, LiteTrack-B4, achieves 65.2% AO on the GOT-10k benchmark, surpassing all preceding efficient trackers, while running over 100 fps with ONNX on the Jetson Orin NX edge device. Moreover, our LiteTrack-B9 reaches competitive 72.2% AO on GOT-10k and 82.4% AUC on TrackingNet, and operates at 171 fps on an NVIDIA 2080Ti GPU. The code and demo materials will be available at this https URL.

CV-82-标题: Image-level supervision and self-training for transformer -based cross-modality tumor segmentation

链接: https://arxiv.org/abs/2309.09246
作者: Malo de Boisredon, Eugene Vorontsov, William Trung Le, Samuel Kadoury
备注: 17 pages, 10 figures, 5 tables

点击查看摘要

Abstract: Deep neural networks are commonly used for automated medical image segmentation, but models will frequently struggle to generalize well across different imaging modalities. This issue is particularly problematic due to the limited availability of annotated data, making it difficult to deploy these models on a larger scale. To overcome these challenges, we propose a new semi-supervised training strategy called MoDATTS. Our approach is designed for accurate cross-modality 3D tumor segmentation on unpaired bi-modal datasets. An image-to-image translation strategy between imaging modalities is used to produce annotated pseudo-target volumes and improve generalization to the unannotated target modality. We also use powerful vision transformer architectures and introduce an iterative self-training procedure to further close the domain gap between modalities. MoDATTS additionally allows the possibility to extend the training to unannotated target data by exploiting image-level labels with an unsupervised objective that encourages the model to perform 3D diseased-to-healthy translation by disentangling tumors from the background. The proposed model achieves superior performance compared to other methods from participating teams in the CrossMoDA 2022 challenge, as evidenced by its reported top Dice score of 0.87+/-0.04 for the VS segmentation. MoDATTS also yields consistent improvements in Dice scores over baselines on a cross-modality brain tumor segmentation task composed of four different contrasts from the BraTS 2020 challenge dataset, where 95% of a target supervised model performance is reached. We report that 99% and 100% of this maximum performance can be attained if 20% and 50% of the target data is additionally annotated, which further demonstrates that MoDATTS can be leveraged to reduce the annotation burden.

CV-83-标题: Detection and Localization of Firearm Carriers in Complex Scenes for Improved Safety Measures

链接: https://arxiv.org/abs/2309.09236
作者: Arif Mahmood, Abdul Basit, M. Akhtar Munir, Mohsen Ali
备注: This paper is accepted in IEEE Transactions on Computational Social Systems

点击查看摘要

Abstract: Detecting firearms and accurately localizing individuals carrying them in images or videos is of paramount importance in security, surveillance, and content customization. However, this task presents significant challenges in complex environments due to clutter and the diverse shapes of firearms. To address this problem, we propose a novel approach that leverages human-firearm interaction information, which provides valuable clues for localizing firearm carriers. Our approach incorporates an attention mechanism that effectively distinguishes humans and firearms from the background by focusing on relevant areas. Additionally, we introduce a saliency-driven locality-preserving constraint to learn essential features while preserving foreground information in the input image. By combining these components, our approach achieves exceptional results on a newly proposed dataset. To handle inputs of varying sizes, we pass paired human-firearm instances with attention masks as channels through a deep network for feature computation, utilizing an adaptive average pooling layer. We extensively evaluate our approach against existing methods in human-object interaction detection and achieve significant results (AP=77.8%) compared to the baseline approach (AP=63.1%). This demonstrates the effectiveness of leveraging attention mechanisms and saliency-driven locality preservation for accurate human-firearm interaction detection. Our findings contribute to advancing the fields of security and surveillance, enabling more efficient firearm localization and identification in diverse scenarios.

CV-84-标题: CryoAlign: feature-based method for global and local 3D alignment of EM density maps

链接: https://arxiv.org/abs/2309.09217
作者: Bintao He, Fa Zhang, Chenjie Feng, Jianyi Yang, Xin Gao, Renmin Han
备注:

点击查看摘要

Abstract: Advances on cryo-electron imaging technologies have led to a rapidly increasing number of density maps. Alignment and comparison of density maps play a crucial role in interpreting structural information, such as conformational heterogeneity analysis using global alignment and atomic model assembly through local alignment. Here, we propose a fast and accurate global and local cryo-electron microscopy density map alignment method CryoAlign, which leverages local density feature descriptors to capture spatial structure similarities. CryoAlign is the first feature-based EM map alignment tool, in which the employment of feature-based architecture enables the rapid establishment of point pair correspondences and robust estimation of alignment parameters. Extensive experimental evaluations demonstrate the superiority of CryoAlign over the existing methods in both alignment accuracy and speed.

CV-85-标题: Neural Gradient Learning and Optimization for Oriented Point Normal Estimation SIGGRAPH

链接: https://arxiv.org/abs/2309.09211
作者: Qing Li, Huifang Feng, Kanle Shi, Yi Fang, Yu-Shen Liu, Zhizhong Han
备注: accepted by SIGGRAPH Asia 2023

点击查看摘要

Abstract: We propose Neural Gradient Learning (NGL), a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation. It has excellent gradient approximation properties for the underlying geometry of the data. We utilize a simple neural network to parameterize the objective function to produce gradients at points using a global implicit representation. However, the derived gradients usually drift away from the ground-truth oriented normals due to the lack of local detail descriptions. Therefore, we introduce Gradient Vector Optimization (GVO) to learn an angular distance field based on local plane geometry to refine the coarse gradient vectors. Finally, we formulate our method with a two-phase pipeline of coarse estimation followed by refinement. Moreover, we integrate two weighting functions, i.e., anisotropic kernel and inlier score, into the optimization to improve the robust and detail-preserving performance. Our method efficiently conducts global gradient approximation while achieving better accuracy and generalization ability of local feature description. This leads to a state-of-the-art normal estimator that is robust to noise, outliers and point density variations. Extensive evaluations show that our method outperforms previous works in both unoriented and oriented normal estimation on widely used benchmarks. The source code and pre-trained models are available at this https URL.

CV-86-标题: Differentiable SLAM Helps Deep Learning-based LiDAR Perception Tasks BMVC2023

链接: https://arxiv.org/abs/2309.09206
作者: Prashant Kumar, Dheeraj Vattikonda, Vedang Bhupesh Shenvi Nadkarni, Erqun Dong, Sabyasachi Sahoo
备注: 15 pages,6 Tables, 3 figures. Accepted at BMVC 2023

点击查看摘要

Abstract: We investigate a new paradigm that uses differentiable SLAM architectures in a self-supervised manner to train end-to-end deep learning models in various LiDAR based applications. To the best of our knowledge there does not exist any work that leverages SLAM as a training signal for deep learning based models. We explore new ways to improve the efficiency, robustness, and adaptability of LiDAR systems with deep learning techniques. We focus on the potential benefits of differentiable SLAM architectures for improving performance of deep learning tasks such as classification, regression as well as SLAM. Our experimental results demonstrate a non-trivial increase in the performance of two deep learning applications - Ground Level Estimation and Dynamic to Static LiDAR Translation, when used with differentiable SLAM architectures. Overall, our findings provide important insights that enhance the performance of LiDAR based navigation systems. We demonstrate that this new paradigm of using SLAM Loss signal while training LiDAR based models can be easily adopted by the community.

CV-87-标题: Efficient Pyramid Channel Attention Network for Pathological Myopia Detection

链接: https://arxiv.org/abs/2309.09196
作者: Xiaoqing Zhang, Jilu Zhao, Richu Jin, Yan Li, Hao Wu, Xiangtian Zhou, Jiang Liu
备注: 12 pages

点击查看摘要

Abstract: Pathological myopia (PM) is the leading ocular disease for impaired vision and blindness worldwide. The key to detecting PM as early as possible is to detect informative features in global and local lesion regions, such as fundus tessellation, atrophy and maculopathy. However, applying classical convolutional neural networks (CNNs) to efficiently highlight global and local lesion context information in feature maps is quite challenging. To tackle this issue, we aim to fully leverage the potential of global and local lesion information with attention module design. Based on this, we propose an efficient pyramid channel attention (EPCA) module, which dynamically explores the relative importance of global and local lesion context information in feature maps. Then we combine the EPCA module with the backbone network to construct EPCA-Net for automatic PM detection based on fundus images. In addition, we construct a PM dataset termed PM-fundus by collecting fundus images of PM from publicly available datasets (e.g., the PALM dataset and ODIR dataset). The comprehensive experiments are conducted on three datasets, demonstrating that our EPCA-Net outperforms state-of-the-art methods in detecting PM. Furthermore, motivated by the recent pretraining-and-finetuning paradigm, we attempt to adapt pre-trained natural image models for PM detection by freezing them and treating the EPCA module and other attention modules as the adapters. The results show that our method with the pretraining-and-finetuning paradigm achieves competitive performance through comparisons to part of methods with traditional fine-tuning methods with fewer tunable parameters.

CV-88-标题: CLIPUNetr: Assisting Human-robot Interface for Uncalibrated Visual Servoing Control with CLIP-driven Referring Expression Segmentation

链接: https://arxiv.org/abs/2309.09183
作者: Chen Jiang, Yuchen Yang, Martin Jagersand
备注:

点击查看摘要

Abstract: The classical human-robot interface in uncalibrated image-based visual servoing (UIBVS) relies on either human annotations or semantic segmentation with categorical labels. Both methods fail to match natural human communication and convey rich semantics in manipulation tasks as effectively as natural language expressions. In this paper, we tackle this problem by using referring expression segmentation, which is a prompt-based approach, to provide more in-depth information for robot perception. To generate high-quality segmentation predictions from referring expressions, we propose CLIPUNetr - a new CLIP-driven referring expression segmentation network. CLIPUNetr leverages CLIP’s strong vision-language representations to segment regions from referring expressions, while utilizing its ``U-shaped’’ encoder-decoder architecture to generate predictions with sharper boundaries and finer structures. Furthermore, we propose a new pipeline to integrate CLIPUNetr into UIBVS and apply it to control robots in real-world environments. In experiments, our method improves boundary and structure measurements by an average of 120% and can successfully assist real-world UIBVS control in an unstructured manipulation environment.

CV-89-标题: Syntax Tree Constrained Graph Network for Visual Question Answering

链接: https://arxiv.org/abs/2309.09179
作者: Xiangrui Su, Qi Zhang, Chongyang Shi, Jiachang Liu, Liang Hu
备注:

点击查看摘要

Abstract: Visual Question Answering (VQA) aims to automatically answer natural language questions related to given image content. Existing VQA methods integrate vision modeling and language understanding to explore the deep semantics of the question. However, these methods ignore the significant syntax information of the question, which plays a vital role in understanding the essential semantics of the question and guiding the visual feature refinement. To fill the gap, we suggested a novel Syntax Tree Constrained Graph Network (STCGN) for VQA based on entity message passing and syntax tree. This model is able to extract a syntax tree from questions and obtain more precise syntax information. Specifically, we parse questions and obtain the question syntax tree using the Stanford syntax parsing tool. From the word level and phrase level, syntactic phrase features and question features are extracted using a hierarchical tree convolutional network. We then design a message-passing mechanism for phrase-aware visual entities and capture entity features according to a given visual context. Extensive experiments on VQA2.0 datasets demonstrate the superiority of our proposed model.

CV-90-标题: FDCNet: Feature Drift Compensation Network for Class-Incremental Weakly Supervised Object Localization

链接: https://arxiv.org/abs/2309.09122
作者: Sejin Park, Taehyung Lee, Yeejin Lee, Byeongkeun Kang
备注: ACM Multimedia 2023

点击查看摘要

Abstract: This work addresses the task of class-incremental weakly supervised object localization (CI-WSOL). The goal is to incrementally learn object localization for novel classes using only image-level annotations while retaining the ability to localize previously learned classes. This task is important because annotating bounding boxes for every new incoming data is expensive, although object localization is crucial in various applications. To the best of our knowledge, we are the first to address this task. Thus, we first present a strong baseline method for CI-WSOL by adapting the strategies of class-incremental classifiers to mitigate catastrophic forgetting. These strategies include applying knowledge distillation, maintaining a small data set from previous tasks, and using cosine normalization. We then propose the feature drift compensation network to compensate for the effects of feature drifts on class scores and localization maps. Since updating network parameters to learn new tasks causes feature drifts, compensating for the final outputs is necessary. Finally, we evaluate our proposed method by conducting experiments on two publicly available datasets (ImageNet-100 and CUB-200). The experimental results demonstrate that the proposed method outperforms other baseline methods.

CV-91-标题: Uncertainty-aware 3D Object-Level Mapping with Deep Shape Priors ICRA2024

链接: https://arxiv.org/abs/2309.09118
作者: Ziwei Liao, Jun Yang, Jingxing Qian, Angela P. Schoellig, Steven L. Waslander
备注: Manuscript submitted to ICRA 2024

点击查看摘要

Abstract: 3D object-level mapping is a fundamental problem in robotics, which is especially challenging when object CAD models are unavailable during inference. In this work, we propose a framework that can reconstruct high-quality object-level maps for unknown objects. Our approach takes multiple RGB-D images as input and outputs dense 3D shapes and 9-DoF poses (including 3 scale parameters) for detected objects. The core idea of our approach is to leverage a learnt generative model for shape categories as a prior and to formulate a probabilistic, uncertainty-aware optimization framework for 3D reconstruction. We derive a probabilistic formulation that propagates shape and pose uncertainty through two novel loss functions. Unlike current state-of-the-art approaches, we explicitly model the uncertainty of the object shapes and poses during our optimization, resulting in a high-quality object-level mapping system. Moreover, the resulting shape and pose uncertainties, which we demonstrate can accurately reflect the true errors of our object maps, can also be useful for downstream robotics tasks such as active vision. We perform extensive evaluations on indoor and outdoor real-world datasets, achieving achieves substantial improvements over state-of-the-art methods. Our code will be available at this https URL.

CV-92-标题: FrameRS: A Video Frame Compression Model Composed by Self supervised Video Frame Reconstructor and Key Frame Selector

链接: https://arxiv.org/abs/2309.09083
作者: Qiqian Fu, Guanhong Wang, Gaoang Wang
备注:

点击查看摘要

Abstract: In this paper, we present frame reconstruction model: FrameRS. It consists self-supervised video frame reconstructor and key frame selector. The frame reconstructor, FrameMAE, is developed by adapting the principles of the Masked Autoencoder for Images (MAE) for video context. The key frame selector, Frame Selector, is built on CNN architecture. By taking the high-level semantic information from the encoder of FrameMAE as its input, it can predicted the key frames with low computation costs. Integrated with our bespoke Frame Selector, FrameMAE can effectively compress a video clip by retaining approximately 30% of its pivotal frames. Performance-wise, our model showcases computational efficiency and competitive accuracy, marking a notable improvement over traditional Key Frame Extract algorithms. The implementation is available on Github

CV-93-标题: Multi-camera Birds Eye View Perception for Autonomous Driving

链接: https://arxiv.org/abs/2309.09080
作者: David Unger, Nikhil Gosala, Varun Ravi Kumar, Shubhankar Borse, Abhinav Valada, Senthil Yogamani
备注: Taylor & Francis (CRC Press) book chapter. Book title: Computer Vision: Challenges, Trends, and Opportunities

点击查看摘要

Abstract: Most automated driving systems comprise a diverse sensor set, including several cameras, Radars, and LiDARs, ensuring a complete 360°coverage in near and far regions. Unlike Radar and LiDAR, which measure directly in 3D, cameras capture a 2D perspective projection with inherent depth ambiguity. However, it is essential to produce perception outputs in 3D to enable the spatial reasoning of other agents and structures for optimal path planning. The 3D space is typically simplified to the BEV space by omitting the less relevant Z-coordinate, which corresponds to the height dimension.The most basic approach to achieving the desired BEV representation from a camera image is IPM, assuming a flat ground surface. Surround vision systems that are pretty common in new vehicles use the IPM principle to generate a BEV image and to show it on display to the driver. However, this approach is not suited for autonomous driving since there are severe distortions introduced by this too-simplistic transformation method.

CV-94-标题: Unsupervised Green Object Tracker (GOT) without Offline Pre-training

链接: https://arxiv.org/abs/2309.09078
作者: Zhiruo Zhou, Suya You, C.-C. Jay Kuo
备注:

点击查看摘要

Abstract: Supervised trackers trained on labeled data dominate the single object tracking field for superior tracking accuracy. The labeling cost and the huge computational complexity hinder their applications on edge devices. Unsupervised learning methods have also been investigated to reduce the labeling cost but their complexity remains high. Aiming at lightweight high-performance tracking, feasibility without offline pre-training, and algorithmic transparency, we propose a new single object tracking method, called the green object tracker (GOT), in this work. GOT conducts an ensemble of three prediction branches for robust box tracking: 1) a global object-based correlator to predict the object location roughly, 2) a local patch-based correlator to build temporal correlations of small spatial units, and 3) a superpixel-based segmentator to exploit the spatial information of the target frame. GOT offers competitive tracking accuracy with state-of-the-art unsupervised trackers, which demand heavy offline pre-training, at a lower computation cost. GOT has a tiny model size (<3k parameters) and low inference complexity (around 58M FLOPs per frame). Since its inference complexity is between 0.1%-10% of DL trackers, it can be easily deployed on mobile and edge devices.

CV-95-标题: MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer

链接: https://arxiv.org/abs/2309.09067
作者: Fudong Lin, Summer Crawford, Kaleb Guillot, Yihe Zhang, Yan Chen, Xu Yuan, Li Chen, Shelby Willams, Robert Minvielle, Xiangming Xiao, Drew Gholson, Nicolas Ashwell, Tri Setiyono, Brenda Tubana, Lu Peng, Magdy Bayoumi, Nian-Feng Tzeng
备注:

点击查看摘要

Abstract: Precise crop yield prediction provides valuable information for agricultural planning and decision-making processes. However, timely predicting crop yields remains challenging as crop growth is sensitive to growing season weather variation and climate change. In this work, we develop a deep learning-based solution, namely Multi-Modal Spatial-Temporal Vision Transformer (MMST-ViT), for predicting crop yields at the county level across the United States, by considering the effects of short-term meteorological variations during the growing season and the long-term climate change on crops. Specifically, our MMST-ViT consists of a Multi-Modal Transformer, a Spatial Transformer, and a Temporal Transformer. The Multi-Modal Transformer leverages both visual remote sensing data and short-term meteorological data for modeling the effect of growing season weather variations on crop growth. The Spatial Transformer learns the high-resolution spatial dependency among counties for accurate agricultural tracking. The Temporal Transformer captures the long-range temporal dependency for learning the impact of long-term climate change on crops. Meanwhile, we also devise a novel multi-modal contrastive learning technique to pre-train our model without extensive human supervision. Hence, our MMST-ViT captures the impacts of both short-term weather variations and long-term climate change on crops by leveraging both satellite images and meteorological data. We have conducted extensive experiments on over 200 counties in the United States, with the experimental results exhibiting that our MMST-ViT outperforms its counterparts under three performance metrics of interest.

CV-96-标题: Sub-action Prototype Learning for Point-level Weakly-supervised Temporal Action Localization

链接: https://arxiv.org/abs/2309.09060
作者: Yueyang Li, Yonghong Hou, Wanqing Li
备注:

点击查看摘要

Abstract: Point-level weakly-supervised temporal action localization (PWTAL) aims to localize actions with only a single timestamp annotation for each action instance. Existing methods tend to mine dense pseudo labels to alleviate the label sparsity, but overlook the potential sub-action temporal structures, resulting in inferior performance. To tackle this problem, we propose a novel sub-action prototype learning framework (SPL-Loc) which comprises Sub-action Prototype Clustering (SPC) and Ordered Prototype Alignment (OPA). SPC adaptively extracts representative sub-action prototypes which are capable to perceive the temporal scale and spatial content variation of action instances. OPA selects relevant prototypes to provide completeness clue for pseudo label generation by applying a temporal alignment loss. As a result, pseudo labels are derived from alignment results to improve action boundary prediction. Extensive experiments on three popular benchmarks demonstrate that the proposed SPL-Loc significantly outperforms existing SOTA PWTAL methods.

CV-97-标题: Microscale 3-D Capacitance Tomography with a CMOS Sensor Array

链接: https://arxiv.org/abs/2309.09039
作者: anar Abdelatty, Joseph Incandela, Kangping Hu, Joseph W. Larkin, Sherief Reda, Jacob K. Rosenstein
备注:

点击查看摘要

Abstract: Electrical capacitance tomography (ECT) is a nonoptical imaging technique in which a map of the interior permittivity of a volume is estimated by making capacitance measurements at its boundary and solving an inverse problem. While previous ECT demonstrations have often been at centimeter scales, ECT is not limited to macroscopic systems. In this paper, we demonstrate ECT imaging of polymer microspheres and bacterial biofilms using a CMOS microelectrode array, achieving spatial resolution of 10 microns. Additionally, we propose a deep learning architecture and an improved multi-objective training scheme for reconstructing out-of-plane permittivity maps from the sensor measurements. Experimental results show that the proposed approach is able to resolve microscopic 3-D structures, achieving 91.5% prediction accuracy on the microsphere dataset and 82.7% on the biofilm dataset, including an average of 4.6% improvement over baseline computational methods.

CV-98-标题: A store-and-forward cloud-based telemonitoring system for automatic assessing dysarthria evolution in neurological diseases from video-recording analysis

链接: https://arxiv.org/abs/2309.09038
作者: Lucia Migliorelli, Daniele Berardini, Kevin Cela, Michela Coccia, Laura Villani, Emanuele Frontoni, Sara Moccia
备注:

点击查看摘要

Abstract: Background and objectives: Patients suffering from neurological diseases may develop dysarthria, a motor speech disorder affecting the execution of speech. Close and quantitative monitoring of dysarthria evolution is crucial for enabling clinicians to promptly implement patient management strategies and maximizing effectiveness and efficiency of communication functions in term of restoring, compensating or adjusting. In the clinical assessment of orofacial structures and functions, at rest condition or during speech and non-speech movements, a qualitative evaluation is usually performed, throughout visual observation. Methods: To overcome limitations posed by qualitative assessments, this work presents a store-and-forward self-service telemonitoring system that integrates, within its cloud architecture, a convolutional neural network (CNN) for analyzing video recordings acquired by individuals with dysarthria. This architecture, called facial landmark Mask RCNN, aims at locating facial landmarks as a prior for assessing the orofacial functions related to speech and examining dysarthria evolution in neurological diseases. Results: When tested on the Toronto NeuroFace dataset, a publicly available annotated dataset of video recordings from patients with amyotrophic lateral sclerosis (ALS) and stroke, the proposed CNN achieved a normalized mean error equal to 1.79 on localizing the facial landmarks. We also tested our system in a real-life scenario on 11 bulbar-onset ALS subjects, obtaining promising outcomes in terms of facial landmark position estimation. Discussion and conclusions: This preliminary study represents a relevant step towards the use of remote tools to support clinicians in monitoring the evolution of dysarthria.

CV-99-标题: RingMo-lite: A Remote Sensing Multi-task Lightweight Network with CNN- Transformer Hybrid Framework

链接: https://arxiv.org/abs/2309.09003
作者: Yuelei Wang, Ting Zhang, Liangjin Zhao, Lin Hu, Zhechao Wang, Ziqing Niu, Peirui Cheng, Kaiqiang Chen, Xuan Zeng, Zhirui Wang, Hongqi Wang, Xian Sun
备注:

点击查看摘要

Abstract: In recent years, remote sensing (RS) vision foundation models such as RingMo have emerged and achieved excellent performance in various downstream tasks. However, the high demand for computing resources limits the application of these models on edge devices. It is necessary to design a more lightweight foundation model to support on-orbit RS image interpretation. Existing methods face challenges in achieving lightweight solutions while retaining generalization in RS image interpretation. This is due to the complex high and low-frequency spectral components in RS images, which make traditional single CNN or Vision Transformer methods unsuitable for the task. Therefore, this paper proposes RingMo-lite, an RS multi-task lightweight network with a CNN-Transformer hybrid framework, which effectively exploits the frequency-domain properties of RS to optimize the interpretation process. It is combined by the Transformer module as a low-pass filter to extract global features of RS images through a dual-branch structure, and the CNN module as a stacked high-pass filter to extract fine-grained details effectively. Furthermore, in the pretraining stage, the designed frequency-domain masked image modeling (FD-MIM) combines each image patch’s high-frequency and low-frequency characteristics, effectively capturing the latent feature representation in RS data. As shown in Fig. 1, compared with RingMo, the proposed RingMo-lite reduces the parameters over 60% in various RS image interpretation tasks, the average accuracy drops by less than 2% in most of the scenes and achieves SOTA performance compared to models of the similar size. In addition, our work will be integrated into the MindSpore computing platform in the near future.

CV-100-标题: OmniLRS: A Photorealistic Simulator for Lunar Robotics

链接: https://arxiv.org/abs/2309.08997
作者: Antoine Richard, Junnosuke Kamohara, Kentaro Uno, Shreya Santra, Dave van der Meer, Miguel Olivares-Mendez, Kazuya Yoshida
备注: 7 pages, 4 figures

点击查看摘要

Abstract: Developing algorithms for extra-terrestrial robotic exploration has always been challenging. Along with the complexity associated with these environments, one of the main issues remains the evaluation of said algorithms. With the regained interest in lunar exploration, there is also a demand for quality simulators that will enable the development of lunar robots. % In this paper, we explain how we built a Lunar simulator based on Isaac Sim, Nvidia’s robotic simulator. In this paper, we propose Omniverse Lunar Robotic-Sim (OmniLRS) that is a photorealistic Lunar simulator based on Nvidia’s robotic simulator. This simulation provides fast procedural environment generation, multi-robot capabilities, along with synthetic data pipeline for machine-learning applications. It comes with ROS1 and ROS2 bindings to control not only the robots, but also the environments. This work also performs sim-to-real rock instance segmentation to show the effectiveness of our simulator for image-based perception. Trained on our synthetic data, a yolov8 model achieves performance close to a model trained on real-world data, with 5% performance gap. When finetuned with real data, the model achieves 14% higher average precision than the model trained on real-world data, demonstrating our simulator’s photorealism.% to realize sim-to-real. The code is fully open-source, accessible here: this https URL, and comes with demonstrations.

CV-101-标题: RMP: A Random Mask Pretrain Framework for Motion Prediction ITSC2023

链接: https://arxiv.org/abs/2309.08989
作者: Yi Yang, Qingwen Zhang, Thomas Gilles, Nazre Batool, John Folkesson
备注: IEEE International Conference on Intelligent Transportation Systems (ITSC 2023)

点击查看摘要

Abstract: As the pretraining technique is growing in popularity, little work has been done on pretrained learning-based motion prediction methods in autonomous driving. In this paper, we propose a framework to formalize the pretraining task for trajectory prediction of traffic participants. Within our framework, inspired by the random masked model in natural language processing (NLP) and computer vision (CV), objects’ positions at random timesteps are masked and then filled in by the learned neural network (NN). By changing the mask profile, our framework can easily switch among a range of motion-related tasks. We show that our proposed pretraining framework is able to deal with noisy inputs and improves the motion prediction accuracy and miss rate, especially for objects occluded over time by evaluating it on Argoverse and NuScenes datasets.

CV-102-标题: FF-LOGO: Cross-Modality Point Cloud Registration with Feature Filtering and Local to Global Optimization

链接: https://arxiv.org/abs/2309.08966
作者: Nan Ma, Mohan Wang, Yiheng Han, Yong-Jin Liu
备注: 7 pages, 2 figures

点击查看摘要

Abstract: Cross-modality point cloud registration is confronted with significant challenges due to inherent differences in modalities between different sensors. We propose a cross-modality point cloud registration framework FF-LOGO: a cross-modality point cloud registration method with feature filtering and local-global optimization. The cross-modality feature correlation filtering module extracts geometric transformation-invariant features from cross-modality point clouds and achieves point selection by feature matching. We also introduce a cross-modality optimization process, including a local adaptive key region aggregation module and a global modality consistency fusion optimization module. Experimental results demonstrate that our two-stage optimization significantly improves the registration accuracy of the feature association and selection module. Our method achieves a substantial increase in recall rate compared to the current state-of-the-art methods on the 3DCSR dataset, improving from 40.59% to 75.74%. Our code will be available at this https URL.

CV-103-标题: Tightening Classification Boundaries in Open Set Domain Adaptation through Unknown Exploitation

链接: https://arxiv.org/abs/2309.08964
作者: Lucas Fernando Alvarenga e Silva, Nicu Sebe, Jurandy Almeida
备注:

点击查看摘要

Abstract: Convolutional Neural Networks (CNNs) have brought revolutionary advances to many research areas due to their capacity of learning from raw data. However, when those methods are applied to non-controllable environments, many different factors can degrade the model’s expected performance, such as unlabeled datasets with different levels of domain shift and category shift. Particularly, when both issues occur at the same time, we tackle this challenging setup as Open Set Domain Adaptation (OSDA) problem. In general, existing OSDA approaches focus their efforts only on aligning known classes or, if they already extract possible negative instances, use them as a new category learned with supervision during the course of training. We propose a novel way to improve OSDA approaches by extracting a high-confidence set of unknown instances and using it as a hard constraint to tighten the classification boundaries of OSDA methods. Especially, we adopt a new loss constraint evaluated in three different means, (1) directly with the pristine negative instances; (2) with randomly transformed negatives using data augmentation techniques; and (3) with synthetically generated negatives containing adversarial features. We assessed all approaches in an extensive set of experiments based on OVANet, where we could observe consistent improvements for two public benchmarks, the Office-31 and Office-Home datasets, yielding absolute gains of up to 1.3% for both Accuracy and H-Score on Office-31 and 5.8% for Accuracy and 4.7% for H-Score on Office-Home.

CV-104-标题: ExBluRF: Efficient Radiance Fields for Extreme Motion Blurred Images

链接: https://arxiv.org/abs/2309.08957
作者: Dongwoo Lee, Jeongtaek Oh, Jaesung Lim, Sunghyun Cho, Kyoung Mu Lee
备注:

点击查看摘要

Abstract: We present ExBluRF, a novel view synthesis method for extreme motion blurred images based on efficient radiance fields optimization. Our approach consists of two main components: 6-DOF camera trajectory-based motion blur formulation and voxel-based radiance fields. From extremely blurred images, we optimize the sharp radiance fields by jointly estimating the camera trajectories that generate the blurry images. In training, multiple rays along the camera trajectory are accumulated to reconstruct single blurry color, which is equivalent to the physical motion blur operation. We minimize the photo-consistency loss on blurred image space and obtain the sharp radiance fields with camera trajectories that explain the blur of all images. The joint optimization on the blurred image space demands painfully increasing computation and resources proportional to the blur size. Our method solves this problem by replacing the MLP-based framework to low-dimensional 6-DOF camera poses and voxel-based radiance fields. Compared with the existing works, our approach restores much sharper 3D scenes from challenging motion blurred views with the order of 10 times less training time and GPU memory consumption.

CV-105-标题: IntelliBeeHive: An Automated Honey Bee Pollen and Varroa Destructor Monitoring System

链接: https://arxiv.org/abs/2309.08955
作者: Christian I. Narcia-Macias, Joselito Guardado, Jocell Rodriguez, Joanne Rampersad-Ammons, Erik Enriquez, Dong-Chul Kim
备注:

点击查看摘要

Abstract: Utilizing computer vision and the latest technological advancements, in this study, we developed a honey bee monitoring system that aims to enhance our understanding of Colony Collapse Disorder, honey bee behavior, population decline, and overall hive health. The system is positioned at the hive entrance providing real-time data, enabling beekeepers to closely monitor the hive’s activity and health through an account-based website. Using machine learning, our monitoring system can accurately track honey bees, monitor pollen-gathering activity, and detect Varroa mites, all without causing any disruption to the honey bees. Moreover, we have ensured that the development of this monitoring system utilizes cost-effective technology, making it accessible to apiaries of various scales, including hobbyists, commercial beekeeping businesses, and researchers. The inference models used to detect honey bees, pollen, and mites are based on the YOLOv7-tiny architecture trained with our own data. The F1-score for honey bee model recognition is 0.95 and the precision and recall value is 0.981. For our pollen and mite object detection model F1-score is 0.95 and the precision and recall value is 0.821 for pollen and 0.996 for “mite”. The overall performance of our IntelliBeeHive system demonstrates its effectiveness in monitoring the honey bee’s activity, achieving an accuracy of 96.28 % in tracking and our pollen model achieved a F1-score of 0.831.

CV-106-标题: Robust Backdoor Attacks on Object Detection in Real World

链接: https://arxiv.org/abs/2309.08953
作者: Yaguan Qian, Boyuan Ji, Shuke He, Shenhui Huang, Xiang Ling, Bin Wang, Wei Wang
备注: 22 pages, 13figures

点击查看摘要

Abstract: Deep learning models are widely deployed in many applications, such as object detection in various security fields. However, these models are vulnerable to backdoor attacks. Most backdoor attacks were intensively studied on classified models, but little on object detection. Previous works mainly focused on the backdoor attack in the digital world, but neglect the real world. Especially, the backdoor attack’s effect in the real world will be easily influenced by physical factors like distance and illumination. In this paper, we proposed a variable-size backdoor trigger to adapt to the different sizes of attacked objects, overcoming the disturbance caused by the distance between the viewing point and attacked object. In addition, we proposed a backdoor training named malicious adversarial training, enabling the backdoor object detector to learn the feature of the trigger with physical noise. The experiment results show this robust backdoor attack (RBA) could enhance the attack success rate in the real world.

CV-107-标题: Staged Contact-Aware Global Human Motion Forecasting BMVC23

链接: https://arxiv.org/abs/2309.08947
作者: Luca Scofano, Alessio Sampieri, Elisabeth Schiele, Edoardo De Matteis, Laura Leal-Taixé, Fabio Galasso
备注: 15 pages, 7 figures, BMVC23 oral

点击查看摘要

Abstract: Scene-aware global human motion forecasting is critical for manifold applications, including virtual reality, robotics, and sports. The task combines human trajectory and pose forecasting within the provided scene context, which represents a significant challenge. So far, only Mao et al. NeurIPS’22 have addressed scene-aware global motion, cascading the prediction of future scene contact points and the global motion estimation. They perform the latter as the end-to-end forecasting of future trajectories and poses. However, end-to-end contrasts with the coarse-to-fine nature of the task and it results in lower performance, as we demonstrate here empirically. We propose a STAGed contact-aware global human motion forecasting STAG, a novel three-stage pipeline for predicting global human motion in a 3D environment. We first consider the scene and the respective human interaction as contact points. Secondly, we model the human trajectory forecasting within the scene, predicting the coarse motion of the human body as a whole. The third and last stage matches a plausible fine human joint motion to complement the trajectory considering the estimated contacts. Compared to the state-of-the-art (SoA), STAG achieves a 1.8% and 16.2% overall improvement in pose and trajectory prediction, respectively, on the scene-aware GTA-IM dataset. A comprehensive ablation study confirms the advantages of staged modeling over end-to-end approaches. Furthermore, we establish the significance of a newly proposed temporal counter called the “time-to-go”, which tells how long it is before reaching scene contact and endpoints. Notably, STAG showcases its ability to generalize to datasets lacking a scene and achieves a new state-of-the-art performance on CMU-Mocap, without leveraging any social cues. Our code is released at: this https URL

CV-108-标题: Universal Metric Learning with Parameter-Efficient Transfer Learning

链接: https://arxiv.org/abs/2309.08944
作者: Sungyeon Kim, Donghyun Kim, Suha Kwak
备注:

点击查看摘要

Abstract: A common practice in metric learning is to train and test an embedding model for each dataset. This dataset-specific approach fails to simulate real-world scenarios that involve multiple heterogeneous distributions of data. In this regard, we introduce a novel metric learning paradigm, called Universal Metric Learning (UML), which learns a unified distance metric capable of capturing relations across multiple data distributions. UML presents new challenges, such as imbalanced data distribution and bias towards dominant distributions. To address these challenges, we propose Parameter-efficient Universal Metric leArning (PUMA), which consists of a pre-trained frozen model and two additional modules, stochastic adapter and prompt pool. These modules enable to capture dataset-specific knowledge while avoiding bias towards dominant distributions. Additionally, we compile a new universal metric learning benchmark with a total of 8 different datasets. PUMA outperformed the state-of-the-art dataset-specific models while using about 69 times fewer trainable parameters.

CV-109-标题: AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose ICCV2023

链接: https://arxiv.org/abs/2309.08942
作者: Juntao Jian, Xiuping Liu, Manyi Li, Ruizhen Hu, Jian Liu
备注: Accepted by ICCV 2023

点击查看摘要

Abstract: How human interact with objects depends on the functional roles of the target objects, which introduces the problem of affordance-aware hand-object interaction. It requires a large number of human demonstrations for the learning and understanding of plausible and appropriate hand-object interactions. In this work, we present AffordPose, a large-scale dataset of hand-object interactions with affordance-driven hand pose. We first annotate the specific part-level affordance labels for each object, e.g. twist, pull, handle-grasp, etc, instead of the general intents such as use or handover, to indicate the purpose and guide the localization of the hand-object interactions. The fine-grained hand-object interactions reveal the influence of hand-centered affordances on the detailed arrangement of the hand poses, yet also exhibit a certain degree of diversity. We collect a total of 26.7K hand-object interactions, each including the 3D object shape, the part-level affordance label, and the manually adjusted hand poses. The comprehensive data analysis shows the common characteristics and diversity of hand-object interactions per affordance via the parameter statistics and contacting computation. We also conduct experiments on the tasks of hand-object affordance understanding and affordance-oriented hand-object interaction generation, to validate the effectiveness of our dataset in learning the fine-grained hand-object interactions. Project page: this https URL.

CV-110-标题: Semantics-aware LiDAR-Only Pseudo Point Cloud Generation for 3D Object Detection

链接: https://arxiv.org/abs/2309.08932
作者: Tiago Cortinhal, Idriss Gouigah, Eren Erdal Aksoy
备注:

点击查看摘要

Abstract: Although LiDAR sensors are crucial for autonomous systems due to providing precise depth information, they struggle with capturing fine object details, especially at a distance, due to sparse and non-uniform data. Recent advances introduced pseudo-LiDAR, i.e., synthetic dense point clouds, using additional modalities such as cameras to enhance 3D object detection. We present a novel LiDAR-only framework that augments raw scans with denser pseudo point clouds by solely relying on LiDAR sensors and scene semantics, omitting the need for cameras. Our framework first utilizes a segmentation model to extract scene semantics from raw point clouds, and then employs a multi-modal domain translator to generate synthetic image segments and depth cues without real cameras. This yields a dense pseudo point cloud enriched with semantic information. We also introduce a new semantically guided projection method, which enhances detection performance by retaining only relevant pseudo points. We applied our framework to different advanced 3D object detection methods and reported up to 2.9% performance upgrade. We also obtained comparable results on the KITTI 3D object detection dataset, in contrast to other state-of-the-art LiDAR-only detectors.

CV-111-标题: A Novel Neural-symbolic System under Statistical Relational Learning

链接: https://arxiv.org/abs/2309.08931
作者: Dongran Yu, Xueyan Liu, Shirui Pan, Anchen Li, Bo Yang
备注:

点击查看摘要

Abstract: A key objective in field of artificial intelligence is to develop cognitive models that can exhibit human-like intellectual capabilities. One promising approach to achieving this is through neural-symbolic systems, which combine the strengths of deep learning and symbolic reasoning. However, current approaches in this area have been limited in their combining way, generalization and interpretability. To address these limitations, we propose a general bi-level probabilistic graphical reasoning framework called GBPGR. This framework leverages statistical relational learning to effectively integrate deep learning models and symbolic reasoning in a mutually beneficial manner. In GBPGR, the results of symbolic reasoning are utilized to refine and correct the predictions made by the deep learning models. At the same time, the deep learning models assist in enhancing the efficiency of the symbolic reasoning process. Through extensive experiments, we demonstrate that our approach achieves high performance and exhibits effective generalization in both transductive and inductive tasks.

CV-112-标题: In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval ICCV2023

链接: https://arxiv.org/abs/2309.08928
作者: Nina Shvetsova, Anna Kukleva, Bernt Schiele, Hilde Kuehne
备注: Published at ICCV 2023, code: this https URL

点击查看摘要

Abstract: Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos without any paired text-video data. To this end, we propose an approach, In-Style, that learns the style of the text queries and transfers it to uncurated web videos. Moreover, to improve generalization, we show that one model can be trained with multiple text styles. To this end, we introduce a multi-style contrastive training procedure that improves the generalizability over several datasets simultaneously. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework on the new task of uncurated & unpaired text-video retrieval and improve state-of-the-art performance on zero-shot text-video retrieval.

CV-113-标题: DynaMoN: Motion-Aware Fast And Robust Camera Localization for Dynamic NeRF

链接: https://arxiv.org/abs/2309.08927
作者: Mert Asim Karaoglu, Hannah Schieber, Nicolas Schischka, Melih Görgülü, Florian Grötzner, Alexander Ladikos, Daniel Roth, Nassir Navab, Benjamin Busam
备注: 6 pages, 4 figures

点击查看摘要

Abstract: Dynamic reconstruction with neural radiance fields (NeRF) requires accurate camera poses. These are often hard to retrieve with existing structure-from-motion (SfM) pipelines as both camera and scene content can change. We propose DynaMoN that leverages simultaneous localization and mapping (SLAM) jointly with motion masking to handle dynamic scene content. Our robust SLAM-based tracking module significantly accelerates the training process of the dynamic NeRF while improving the quality of synthesized views at the same time. Extensive experimental validation on TUM RGB-D, BONN RGB-D Dynamic and the DyCheck’s iPhone dataset, three real-world datasets, shows the advantages of DynaMoN both for camera pose estimation and novel view synthesis.

CV-114-标题: Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution

链接: https://arxiv.org/abs/2309.08919
作者: Wenyu Zhang, Xin Deng, Baojun Jia, Xingtong Yu, Yifan Chen, jin Ma, Qing Ding, Xinming Zhang
备注:

点击查看摘要

Abstract: Current Scene text image super-resolution approaches primarily focus on extracting robust features, acquiring text information, and complex training strategies to generate super-resolution images. However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features. Unlike previous graph attention mechanisms, our approach achieves 2-3 orders of magnitude improvement in efficiency and memory utilization by eliminating the dependency on sparse adjacency matrices and introducing a sliding window approach for efficient parallel computation. Additionally, we introduce the MLP-based Sequential Residual Block (MSRB) for robust feature extraction from text images, and a Local Contour Awareness loss ( \mathcalL_lca ) to enhance the model’s perception of details. Comprehensive experiments on TextZoom demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy. For single-stage and multi-stage strategies, we achieved improvements of 0.7% and 2.6%, respectively, increasing the performance from 52.6% and 53.7% to 53.3% and 56.3%. The code is available at this https URL.

CV-115-标题: Delving into Multimodal Prompting for Fine-grained Visual Classification

链接: https://arxiv.org/abs/2309.08912
作者: Xin Jiang, Hao Tang, Junyao Gao, Xiaoyu Du, Shengfeng He, Zechao Li
备注: The first two authors contributed equally to this work

点击查看摘要

Abstract: Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.

CV-116-标题: V2CE: Video to Continuous Events Simulator

链接: https://arxiv.org/abs/2309.08891
作者: Zhongyang Zhang, Shuyang Cui, Kaidong Chai, Haowen Yu, Subhasis Dasgupta, Upal Mahbub, Tauhidur Rahman
备注: 6 pages, 7 figures

点击查看摘要

Abstract: Dynamic Vision Sensor (DVS)-based solutions have recently garnered significant interest across various computer vision tasks, offering notable benefits in terms of dynamic range, temporal resolution, and inference speed. However, as a relatively nascent vision sensor compared to Active Pixel Sensor (APS) devices such as RGB cameras, DVS suffers from a dearth of ample labeled datasets. Prior efforts to convert APS data into events often grapple with issues such as a considerable domain shift from real events, the absence of quantified validation, and layering problems within the time axis. In this paper, we present a novel method for video-to-events stream conversion from multiple perspectives, considering the specific characteristics of DVS. A series of carefully designed losses helps enhance the quality of generated event voxels significantly. We also propose a novel local dynamic-aware timestamp inference strategy to accurately recover event timestamps from event voxels in a continuous fashion and eliminate the temporal layering problem. Results from rigorous validation through quantified metrics at all stages of the pipeline establish our method unquestionably as the current state-of-the-art (SOTA).

CV-117-标题: GCL: Gradient-Guided Contrastive Learning for Medical Image Segmentation with Multi-Perspective Meta Labels

链接: https://arxiv.org/abs/2309.08888
作者: Yixuan Wu, Jintai Chen, Jiahuan Yan, Yiheng Zhu, Danny Z. Chen, Jian Wu
备注:

点击查看摘要

Abstract: Since annotating medical images for segmentation tasks commonly incurs expensive costs, it is highly desirable to design an annotation-efficient method to alleviate the annotation burden. Recently, contrastive learning has exhibited a great potential in learning robust representations to boost downstream tasks with limited labels. In medical imaging scenarios, ready-made meta labels (i.e., specific attribute information of medical images) inherently reveal semantic relationships among images, which have been used to define positive pairs in previous work. However, the multi-perspective semantics revealed by various meta labels are usually incompatible and can incur intractable “semantic contradiction” when combining different meta labels. In this paper, we tackle the issue of “semantic contradiction” in a gradient-guided manner using our proposed Gradient Mitigator method, which systematically unifies multi-perspective meta labels to enable a pre-trained model to attain a better high-level semantic recognition ability. Moreover, we emphasize that the fine-grained discrimination ability is vital for segmentation-oriented pre-training, and develop a novel method called Gradient Filter to dynamically screen pixel pairs with the most discriminating power based on the magnitude of gradients. Comprehensive experiments on four medical image segmentation datasets verify that our new method GCL: (1) learns informative image representations and considerably boosts segmentation performance with limited labels, and (2) shows promising generalizability on out-of-distribution datasets.

CV-118-标题: Enhancing Visual Perception in Novel Environments via Incremental Data Augmentation Based on Style Transfer

链接: https://arxiv.org/abs/2309.08851
作者: Abhibha Gupta, Rully Agus Hendrawan, Mansur Arief
备注:

点击查看摘要

Abstract: The deployment of autonomous agents in real-world scenarios is challenged by “unknown unknowns”, i.e. novel unexpected environments not encountered during training, such as degraded signs. While existing research focuses on anomaly detection and class imbalance, it often fails to address truly novel scenarios. Our approach enhances visual perception by leveraging the Variational Prototyping Encoder (VPE) to adeptly identify and handle novel inputs, then incrementally augmenting data using neural style transfer to enrich underrepresented data. By comparing models trained solely on original datasets with those trained on a combination of original and augmented datasets, we observed a notable improvement in the performance of the latter. This underscores the critical role of data augmentation in enhancing model robustness. Our findings suggest the potential benefits of incorporating generative models for domain-specific augmentation strategies.

CV-119-标题: MA-SAM: Modality-agnostic SAM Adaptation for 3D Medical Image Segmentation

链接: https://arxiv.org/abs/2309.08842
作者: Cheng Chen, Juzheng Miao, Dufan Wu, Zhiling Yan, Sekeun Kim, Jiang Hu, Aoxiao Zhong, Zhengliang Liu, Lichao Sun, Xiang Li, Tianming Liu, Pheng-Ann Heng, Quanzheng Li
备注:

点击查看摘要

Abstract: The Segment Anything Model (SAM), a foundation model for general image segmentation, has demonstrated impressive zero-shot performance across numerous natural image segmentation tasks. However, SAM’s performance significantly declines when applied to medical images, primarily due to the substantial disparity between natural and medical image domains. To effectively adapt SAM to medical images, it is important to incorporate critical third-dimensional information, i.e., volumetric or temporal knowledge, during fine-tuning. Simultaneously, we aim to harness SAM’s pre-trained weights within its original 2D backbone to the fullest extent. In this paper, we introduce a modality-agnostic SAM adaptation framework, named as MA-SAM, that is applicable to various volumetric and video medical data. Our method roots in the parameter-efficient fine-tuning strategy to update only a small portion of weight increments while preserving the majority of SAM’s pre-trained weights. By injecting a series of 3D adapters into the transformer blocks of the image encoder, our method enables the pre-trained 2D backbone to extract third-dimensional information from input data. The effectiveness of our method has been comprehensively evaluated on four medical image segmentation tasks, by using 10 public datasets across CT, MRI, and surgical video data. Remarkably, without using any prompt, our method consistently outperforms various state-of-the-art 3D approaches, surpassing nnU-Net by 0.9%, 2.6%, and 9.9% in Dice for CT multi-organ segmentation, MRI prostate segmentation, and surgical scene segmentation respectively. Our model also demonstrates strong generalization, and excels in challenging tumor segmentation when prompts are used. Our code is available at: this https URL.

CV-120-标题: AOSR-Net: All-in-One Sandstorm Removal Network ICTAI2023

链接: https://arxiv.org/abs/2309.08838
作者: Yazhong Si, Xulong Zhang, Fan Yang, Jianzong Wang, Ning Cheng, Jing Xiao
备注: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)

点击查看摘要

Abstract: Most existing sandstorm image enhancement methods are based on traditional theory and prior knowledge, which often restrict their applicability in real-world scenarios. In addition, these approaches often adopt a strategy of color correction followed by dust removal, which makes the algorithm structure too complex. To solve the issue, we introduce a novel image restoration model, named all-in-one sandstorm removal network (AOSR-Net). This model is developed based on a re-formulated sandstorm scattering model, which directly establishes the image mapping relationship by integrating intermediate parameters. Such integration scheme effectively addresses the problems of over-enhancement and weak generalization in the field of sand dust image enhancement. Experimental results on synthetic and real-world sandstorm images demonstrate the superiority of the proposed AOSR-Net over state-of-the-art (SOTA) algorithms.

CV-121-标题: Dual-Camera Joint Deblurring-Denoising

链接: https://arxiv.org/abs/2309.08826
作者: Shayan Shekarforoush, Amanpreet Walia, Marcus A. Brubaker, Konstantinos G. Derpanis, Alex Levinshtein
备注: Project webpage: this http URL

点击查看摘要

Abstract: Recent image enhancement methods have shown the advantages of using a pair of long and short-exposure images for low-light photography. These image modalities offer complementary strengths and weaknesses. The former yields an image that is clean but blurry due to camera or object motion, whereas the latter is sharp but noisy due to low photon count. Motivated by the fact that modern smartphones come equipped with multiple rear-facing camera sensors, we propose a novel dual-camera method for obtaining a high-quality image. Our method uses a synchronized burst of short exposure images captured by one camera and a long exposure image simultaneously captured by another. Having a synchronized short exposure burst alongside the long exposure image enables us to (i) obtain better denoising by using a burst instead of a single image, (ii) recover motion from the burst and use it for motion-aware deblurring of the long exposure image, and (iii) fuse the two results to further enhance quality. Our method is able to achieve state-of-the-art results on synthetic dual-camera images from the GoPro dataset with five times fewer training parameters compared to the next best method. We also show that our method qualitatively outperforms competing approaches on real synchronized dual-camera captures.

CV-122-标题: EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding ICCV2023

链接: https://arxiv.org/abs/2309.08816
作者: Chenchen Zhu, Fanyi Xiao, Andres Alvarado, Yasmine Babaei, Jiabo Hu, Hichem El-Mohri, Sean Chang Culatana, Roshan Sumbaly, Zhicheng Yan
备注: ICCV 2023 final version and supplement. See more details in project page: this https URL

点击查看摘要

Abstract: Object understanding in egocentric visual data is arguably a fundamental research topic in egocentric vision. However, existing object datasets are either non-egocentric or have limitations in object categories, visual content, and annotation granularities. In this work, we introduce EgoObjects, a large-scale egocentric dataset for fine-grained object understanding. Its Pilot version contains over 9K videos collected by 250 participants from 50+ countries using 4 wearable devices, and over 650K object annotations from 368 object categories. Unlike prior datasets containing only object category labels, EgoObjects also annotates each object with an instance-level identifier, and includes over 14K unique object instances. EgoObjects was designed to capture the same object under diverse background complexities, surrounding objects, distance, lighting and camera motion. In parallel to the data collection, we conducted data annotation by developing a multi-stage federated annotation process to accommodate the growing nature of the dataset. To bootstrap the research on EgoObjects, we present a suite of 4 benchmark tasks around the egocentric object understanding, including a novel instance level- and the classical category level object detection. Moreover, we also introduce 2 novel continual learning object detection tasks. The dataset and API are available at this https URL.

CV-123-标题: D3: Data Diversity Design for Systematic Generalization in Visual Question Answering

链接: https://arxiv.org/abs/2309.08798
作者: Amir Rahimi, Vanessa D’Amario, Moyuru Yamada, Kentaro Takemoto, Tomotake Sasaki, Xavier Boix
备注: Under review, 15 pages

点击查看摘要

Abstract: Systematic generalization is a crucial aspect of intelligence, which refers to the ability to generalize to novel tasks by combining known subtasks and concepts. One critical factor that has been shown to influence systematic generalization is the diversity of training data. However, diversity can be defined in various ways, as data have many factors of variation. A more granular understanding of how different aspects of data diversity affect systematic generalization is lacking. We present new evidence in the problem of Visual Question Answering (VQA) that reveals that the diversity of simple tasks (i.e. tasks formed by a few subtasks and concepts) plays a key role in achieving systematic generalization. This implies that it may not be essential to gather a large and varied number of complex tasks, which could be costly to obtain. We demonstrate that this result is independent of the similarity between the training and testing data and applies to well-known families of neural network architectures for VQA (i.e. monolithic architectures and neural module networks). Additionally, we observe that neural module networks leverage all forms of data diversity we evaluated, while monolithic architectures require more extensive amounts of data to do so. These findings provide a first step towards understanding the interactions between data diversity design, neural network architectures, and systematic generalization capabilities.

CV-124-标题: Privacy-preserving Early Detection of Epileptic Seizures in Videos MICCAI2023

链接: https://arxiv.org/abs/2309.08794
作者: Deval Mehta, Shobi Sivathamboo, Hugh Simpson, Patrick Kwan, Terence O`Brien, Zongyuan Ge
备注: Accepted to MICCAI 2023

点击查看摘要

Abstract: In this work, we contribute towards the development of video-based epileptic seizure classification by introducing a novel framework (SETR-PKD), which could achieve privacy-preserved early detection of seizures in videos. Specifically, our framework has two significant components - (1) It is built upon optical flow features extracted from the video of a seizure, which encodes the seizure motion semiotics while preserving the privacy of the patient; (2) It utilizes a transformer based progressive knowledge distillation, where the knowledge is gradually distilled from networks trained on a longer portion of video samples to the ones which will operate on shorter portions. Thus, our proposed framework addresses the limitations of the current approaches which compromise the privacy of the patients by directly operating on the RGB video of a seizure as well as impede real-time detection of a seizure by utilizing the full video sample to make a prediction. Our SETR-PKD framework could detect tonic-clonic seizures (TCSs) in a privacy-preserving manner with an accuracy of 83.9% while they are only half-way into their progression. Our data and code is available at this https URL

CV-125-标题: Rethinking Cross-Domain Pedestrian Detection: A Background-Focused Distribution Alignment Framework for Instance-Free One-Stage Detectors

链接: https://arxiv.org/abs/2309.08771
作者: Yancheng Cai, Bo Zhang, Baopu Li, Tao Chen, Hongliang Yan, Jingdong Zhang, Jiahao Xu
备注: This paper published on IEEE Transactions on Image Processing on August 2023.See this https URL

点击查看摘要

Abstract: Cross-domain pedestrian detection aims to generalize pedestrian detectors from one label-rich domain to another label-scarce domain, which is crucial for various real-world applications. Most recent works focus on domain alignment to train domain-adaptive detectors either at the instance level or image level. From a practical point of view, one-stage detectors are faster. Therefore, we concentrate on designing a cross-domain algorithm for rapid one-stage detectors that lacks instance-level proposals and can only perform image-level feature alignment. However, pure image-level feature alignment causes the foreground-background misalignment issue to arise, i.e., the foreground features in the source domain image are falsely aligned with background features in the target domain image. To address this issue, we systematically analyze the importance of foreground and background in image-level cross-domain alignment, and learn that background plays a more critical role in image-level cross-domain alignment. Therefore, we focus on cross-domain background feature alignment while minimizing the influence of foreground features on the cross-domain alignment stage. This paper proposes a novel framework, namely, background-focused distribution alignment (BFDA), to train domain adaptive onestage pedestrian detectors. Specifically, BFDA first decouples the background features from the whole image feature maps and then aligns them via a novel long-short-range discriminator.

CV-126-标题: The Use of Multi-Scale Fiducial Markers To Aid Takeoff and Landing Navigation by Rotorcraft

链接: https://arxiv.org/abs/2309.08769
作者: Jongwon Lee, Su Yeon Choi, Timothy Bretl
备注: Extended abstract accepted at the 2024 AIAA SciTech

点击查看摘要

Abstract: This paper quantifies the impact of adverse environmental conditions on the detection of fiducial markers (i.e., artificial landmarks) by color cameras mounted on rotorcraft. We restrict our attention to square markers with a black-and-white pattern of grid cells that can be nested to allow detection at multiple scales. These markers have the potential to enhance the reliability of precision takeoff and landing at vertiports by flying vehicles in urban settings. Prior work has shown, in particular, that these markers can be detected with high precision (i.e., few false positives) and high recall (i.e., few false negatives). However, most of this prior work has been based on image sequences collected indoors with hand-held cameras. Our work is based on image sequences collected outdoors with cameras mounted on a quadrotor during semi-autonomous takeoff and landing operations under adverse environmental conditions that include variations in temperature, illumination, wind speed, humidity, visibility, and precipitation. In addition to precision and recall, performance measures include continuity, availability, robustness, resiliency, and coverage volume. We release both our dataset and the code we used for analysis to the public as open source.

CV-127-标题: Biased Attention: Do Vision Transformer s Amplify Gender Bias More than Convolutional Neural Networks?

链接: https://arxiv.org/abs/2309.08760
作者: Abhishek Mandal, Susan Leavy, Suzanne Little
备注:

点击查看摘要

Abstract: Deep neural networks used in computer vision have been shown to exhibit many social biases such as gender bias. Vision Transformers (ViTs) have become increasingly popular in computer vision applications, outperforming Convolutional Neural Networks (CNNs) in many tasks such as image classification. However, given that research on mitigating bias in computer vision has primarily focused on CNNs, it is important to evaluate the effect of a different network architecture on the potential for bias amplification. In this paper we therefore introduce a novel metric to measure bias in architectures, Accuracy Difference. We examine bias amplification when models belonging to these two architectures are used as a part of large multimodal models, evaluating the different image encoders of Contrastive Language Image Pretraining which is an important model used in many generative models such as DALL-E and Stable Diffusion. Our experiments demonstrate that architecture can play a role in amplifying social biases due to the different techniques employed by the models for feature extraction and embedding as well as their different learning properties. This research found that ViTs amplified gender bias to a greater extent than CNNs

CV-128-标题: Unified Brain MR-Ultrasound Synthesis using Multi-Modal Hierarchical Representations MICCAI2023

链接: https://arxiv.org/abs/2309.08747
作者: Reuben Dorent, Nazim Haouchie, Fryderyk Kögl, Samuel Joutard, Parikshit Juvekar, Erickson Torio, Alexandra Golby, Sebastien Ourselin, Sarah Frisken, Tom Vercauteren, Tina Kapur, William M. Wells
备注: Accepted at MICCAI 2023

点击查看摘要

Abstract: We introduce MHVAE, a deep hierarchical variational auto-encoder (VAE) that synthesizes missing images from various modalities. Extending multi-modal VAEs with a hierarchical latent structure, we introduce a probabilistic formulation for fusing multi-modal images in a common latent representation while having the flexibility to handle incomplete image sets as input. Moreover, adversarial learning is employed to generate sharper images. Extensive experiments are performed on the challenging problem of joint intra-operative ultrasound (iUS) and Magnetic Resonance (MR) synthesis. Our model outperformed multi-modal VAEs, conditional GANs, and the current state-of-the-art unified method (ResViT) for synthesizing missing images, demonstrating the advantage of using a hierarchical latent representation and a principled probabilistic fusion operation. Our code is publicly available \urlthis https URL.

CV-129-标题: Improved Breast Cancer Diagnosis through Transfer Learning on Hematoxylin and Eosin Stained Histology Images

链接: https://arxiv.org/abs/2309.08745
作者: Fahad Ahmed, Reem Abdel-Salam, Leon Hamnett, Mary Adewunmi, Temitope Ayano
备注: 11 pages, 4 figures

点击查看摘要

Abstract: Breast cancer is one of the leading causes of death for women worldwide. Early screening is essential for early identification, but the chance of survival declines as the cancer progresses into advanced stages. For this study, the most recent BRACS dataset of histological (H&E) stained images was used to classify breast cancer tumours, which contains both the whole-slide images (WSI) and region-of-interest (ROI) images, however, for our study we have considered ROI images. We have experimented using different pre-trained deep learning models, such as Xception, EfficientNet, ResNet50, and InceptionResNet, pre-trained on the ImageNet weights. We pre-processed the BRACS ROI along with image augmentation, upsampling, and dataset split strategies. For the default dataset split, the best results were obtained by ResNet50 achieving 66% f1-score. For the custom dataset split, the best results were obtained by performing upsampling and image augmentation which results in 96.2% f1-score. Our second approach also reduced the number of false positive and false negative classifications to less than 3% for each class. We believe that our study significantly impacts the early diagnosis and identification of breast cancer tumors and their subtypes, especially atypical and malignant tumors, thus improving patient outcomes and reducing patient mortality rates. Overall, this study has primarily focused on identifying seven (7) breast cancer tumor subtypes, and we believe that the experimental models can be fine-tuned further to generalize over previous breast cancer histology datasets as well.

CV-130-标题: Personalized Food Image Classification: Benchmark Dataset s and New Baseline

链接: https://arxiv.org/abs/2309.08744
作者: Xinyue Pan, Jiangpeng He, Fengqing Zhu
备注: Accepted by IEEE Asilomar conference (2023)

点击查看摘要

Abstract: Food image classification is a fundamental step of image-based dietary assessment, enabling automated nutrient analysis from food images. Many current methods employ deep neural networks to train on generic food image datasets that do not reflect the dynamism of real-life food consumption patterns, in which food images appear sequentially over time, reflecting the progression of what an individual consumes. Personalized food classification aims to address this problem by training a deep neural network using food images that reflect the consumption pattern of each individual. However, this problem is under-explored and there is a lack of benchmark datasets with individualized food consumption patterns due to the difficulty in data collection. In this work, we first introduce two benchmark personalized datasets including the Food101-Personal, which is created based on surveys of daily dietary patterns from participants in the real world, and the VFNPersonal, which is developed based on a dietary study. In addition, we propose a new framework for personalized food image classification by leveraging self-supervised learning and temporal image feature information. Our method is evaluated on both benchmark datasets and shows improved performance compared to existing works. The dataset has been made available at: this https URL

CV-131-标题: Active Learning for Fine-Grained Sketch-Based Image Retrieval BMVC2023

链接: https://arxiv.org/abs/2309.08743
作者: Himanshu Thakur, Soumitri Chattopadhyay
备注: Accepted at BMVC 2023

点击查看摘要

Abstract: The ability to retrieve a photo by mere free-hand sketching highlights the immense potential of Fine-grained sketch-based image retrieval (FG-SBIR). However, its rapid practical adoption, as well as scalability, is limited by the expense of acquiring faithful sketches for easily available photo counterparts. A solution to this problem is Active Learning, which could minimise the need for labeled sketches while maximising performance. Despite extensive studies in the field, there exists no work that utilises it for reducing sketching effort in FG-SBIR tasks. To this end, we propose a novel active learning sampling technique that drastically minimises the need for drawing photo sketches. Our proposed approach tackles the trade-off between uncertainty and diversity by utilising the relationship between the existing photo-sketch pair to a photo that does not have its sketch and augmenting this relation with its intermediate representations. Since our approach relies only on the underlying data distribution, it is agnostic of the modelling approach and hence is applicable to other cross-modal instance-level retrieval tasks as well. With experimentation over two publicly available fine-grained SBIR datasets ChairV2 and ShoeV2, we validate our approach and reveal its superiority over adapted baselines.

CV-132-标题: Concept explainability for plant diseases classification

链接: https://arxiv.org/abs/2309.08739
作者: Jihen Amara, Birgitta König-Ries, Sheeba Samuel
备注: Accepted at VISAPP 2023

点击查看摘要

Abstract: Plant diseases remain a considerable threat to food security and agricultural sustainability. Rapid and early identification of these diseases has become a significant concern motivating several studies to rely on the increasing global digitalization and the recent advances in computer vision based on deep learning. In fact, plant disease classification based on deep convolutional neural networks has shown impressive performance. However, these methods have yet to be adopted globally due to concerns regarding their robustness, transparency, and the lack of explainability compared with their human experts counterparts. Methods such as saliency-based approaches associating the network output to perturbations of the input pixels have been proposed to give insights into these algorithms. Still, they are not easily comprehensible and not intuitive for human users and are threatened by bias. In this work, we deploy a method called Testing with Concept Activation Vectors (TCAV) that shifts the focus from pixels to user-defined concepts. To the best of our knowledge, our paper is the first to employ this method in the field of plant disease classification. Important concepts such as color, texture and disease related concepts were analyzed. The results suggest that concept-based explanation methods can significantly benefit automated plant disease identification.

CV-133-标题: AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder

链接: https://arxiv.org/abs/2309.08738
作者: Xingjian Diao, Ming Cheng, Shitong Cheng
备注:

点击查看摘要

Abstract: Learning high-quality video representation has shown significant applications in computer vision and remains challenging. Previous work based on mask autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of learning representations in images and videos through reconstruction strategy in the visual modality. However, these models exhibit inherent limitations, particularly in scenarios where extracting features solely from the visual modality proves challenging, such as when dealing with low-resolution and blurry original videos. Based on this, we propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information. Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content. Moreover, our result of the video classification task on the UCF101 dataset outperforms the existing work and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a top-5 accuracy of 99.9%.

CV-134-标题: Segmentation of Tubular Structures Using Iterative Training with Tailored Samples ICCV

链接: https://arxiv.org/abs/2309.08727
作者: Wei Liao
备注: Accepted to IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 2023

点击查看摘要

Abstract: We propose a minimal path method to simultaneously compute segmentation masks and extract centerlines of tubular structures with line-topology. Minimal path methods are commonly used for the segmentation of tubular structures in a wide variety of applications. Recent methods use features extracted by CNNs, and often outperform methods using hand-tuned features. However, for CNN-based methods, the samples used for training may be generated inappropriately, so that they can be very different from samples encountered during inference. We approach this discrepancy by introducing a novel iterative training scheme, which enables generating better training samples specifically tailored for the minimal path methods without changing existing annotations. In our method, segmentation masks and centerlines are not determined after one another by post-processing, but obtained using the same steps. Our method requires only very few annotated training images. Comparison with seven previous approaches on three public datasets, including satellite images and medical images, shows that our method achieves state-of-the-art results both for segmentation masks and centerlines.

CV-135-标题: Performance Metrics for Probabilistic Ordinal Classifiers MICCAI2023

链接: https://arxiv.org/abs/2309.08701
作者: Adrian Galdran
备注: Accepted to MICCAI 2023

点击查看摘要

Abstract: Ordinal classification models assign higher penalties to predictions further away from the true class. As a result, they are appropriate for relevant diagnostic tasks like disease progression prediction or medical image grading. The consensus for assessing their categorical predictions dictates the use of distance-sensitive metrics like the Quadratic-Weighted Kappa score or the Expected Cost. However, there has been little discussion regarding how to measure performance of probabilistic predictions for ordinal classifiers. In conventional classification, common measures for probabilistic predictions are Proper Scoring Rules (PSR) like the Brier score, or Calibration Errors like the ECE, yet these are not optimal choices for ordinal classification. A PSR named Ranked Probability Score (RPS), widely popular in the forecasting field, is more suitable for this task, but it has received no attention in the image analysis community. This paper advocates the use of the RPS for image grading tasks. In addition, we demonstrate a counter-intuitive and questionable behavior of this score, and propose a simple fix for it. Comprehensive experiments on four large-scale biomedical image grading problems over three different datasets show that the RPS is a more suitable performance metric for probabilistic ordinal predictions. Code to reproduce our experiments can be found at this https URL .

CV-136-标题: BANSAC: A dynamic BAyesian Network for adaptive SAmple Consensus ICCV2023

链接: https://arxiv.org/abs/2309.08690
作者: Valter Piedade, Pedro Miraldo
备注: ICCV 2023 paper

点击查看摘要

Abstract: RANSAC-based algorithms are the standard techniques for robust estimation in computer vision. These algorithms are iterative and computationally expensive; they alternate between random sampling of data, computing hypotheses, and running inlier counting. Many authors tried different approaches to improve efficiency. One of the major improvements is having a guided sampling, letting the RANSAC cycle stop sooner. This paper presents a new adaptive sampling process for RANSAC. Previous methods either assume no prior information about the inlier/outlier classification of data points or use some previously computed scores in the sampling. In this paper, we derive a dynamic Bayesian network that updates individual data points’ inlier scores while iterating RANSAC. At each iteration, we apply weighted sampling using the updated scores. Our method works with or without prior data point scorings. In addition, we use the updated inlier/outlier scoring for deriving a new stopping criterion for the RANSAC loop. We test our method in multiple real-world datasets for several applications and obtain state-of-the-art results. Our method outperforms the baselines in accuracy while needing less computational time.

CV-137-标题: An inspection technology of inner surface of the fine hole based on machine vision

链接: https://arxiv.org/abs/2309.08649
作者: Rongfang He, Weibin Zhang, Guofang Gao
备注:

点击查看摘要

Abstract: Fine holes are an important structural component of industrial components, and their inner surface quality is closely related to their this http URL order to detect the quality of the inner surface of the fine hole,a special optical measurement system was investigated in this paper. A sight pipe is employed to guide the external illumination light into the fine hole and output the relevant images simultaneously. A flexible light array is introduced to suit the narrow space, and the effective field of view is analyzed. Besides, the arc surface projection error and manufacturing assembly error of the device are analyzed, then compensated or ignored if small enough. In the test of prefabricated circular defects with the diameter \phi0.1mm, \phi0.2mm, 0.4mm distance distribution and the fissure defects with the width 0.3mm, the maximum measurement error standard deviation are all about 10\mum. The minimum diameter of the measured fine hole is 4mm and the depth can reach 47mm.

CV-138-标题: Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild ICCV2023

链接: https://arxiv.org/abs/2309.08644
作者: Sungchan Park, Eunyi You, Inhoe Lee, Joonseok Lee
备注: Published at ICCV 2023

点击查看摘要

Abstract: 3D pose estimation is an invaluable task in computer vision with various practical applications. Especially, 3D pose estimation for multi-person from a monocular video (3DMPPE) is particularly challenging and is still largely uncharted, far from applying to in-the-wild scenarios yet. We pose three unresolved issues with the existing methods: lack of robustness on unseen views during training, vulnerability to occlusion, and severe jittering in the output. As a remedy, we propose POTR-3D, the first realization of a sequence-to-sequence 2D-to-3D lifting model for 3DMPPE, powered by a novel geometry-aware data augmentation strategy, capable of generating unbounded data with a variety of views while caring about the ground plane and occlusions. Through extensive experiments, we verify that the proposed model and data augmentation robustly generalizes to diverse unseen views, robustly recovers the poses against heavy occlusions, and reliably generates more natural and smoother outputs. The effectiveness of our approach is verified not only by achieving the state-of-the-art performance on public benchmarks, but also by qualitative results on more challenging in-the-wild videos. Demo videos are available at this https URL.

CV-139-标题: Monitoring Urban Changes in Mariupol/Ukraine in 2022/23

链接: https://arxiv.org/abs/2309.08607
作者: Georg Zitzlsberger, Michal Podhoranyi
备注:

点击查看摘要

Abstract: The ability to constantly monitor urban changes is of large socio-economic interest. Previous works have already shown approaches in this field with the use of Deep Neural Networks (DNNs) and transfer learning. However, they fell short in demonstrating temporal scale outside of either the training or transfer domain. This work builds on existing research and proves that transfer learning with the use of historic data is a feasible solution, which still allows the urban change monitoring of later years. We considered a case with limited access to public and free Very High Resolution (VHR) imagery to guide the transfer. To provide a high temporal resolution, the core data of our monitoring method comprised multi-modal Synthetic Aperture Radar (SAR) and optical multispectral observations from Sentinel 1 and Sentinel 2, respectively. We chose a practical application of our methods for monitoring urban-related changes in the city of Mariupol in Ukraine during the beginning of the Russo-Ukrainian War in 2022/23. During this conflict, availability of VHR data was limited and hence an inexpensive direct transfer to the years 2022/23 was rendered impossible. Instead, a transfer was made for the years 2017-2020 that provided sufficient public and free VHR data with an application of the transferred model in the years late 2021 to mid-2023. It was shown that transferring for the years 2017-2020 with this inexpensive historical VHR data enabled monitoring during times of war in 2022/23. An ablation study on the impact of the frequency of observations showed our method as resilient to even a large loss of observations. However, it also indicated that our method, despite the multi-modal input, was more dependent on optical observations than SAR observations. Neither the indirect transfer, nor the malfunction of Sentinel 1B had a significant impact on the monitoring capabilities of our method.

CV-140-标题: vSHARP: variable Splitting Half-quadratic ADMM algorithm for Reconstruction of inverse-Problems

链接: https://arxiv.org/abs/2309.09954
作者: George Yiasemis, Nikita Moriakov, Jan-Jakob Sonke, Jonas Teuwen
备注:

点击查看摘要

Abstract: Medical Imaging (MI) tasks, such as accelerated Parallel Magnetic Resonance Imaging (MRI), often involve reconstructing an image from noisy or incomplete measurements. This amounts to solving ill-posed inverse problems, where a satisfactory closed-form analytical solution is not available. Traditional methods such as Compressed Sensing (CS) in MRI reconstruction can be time-consuming or prone to obtaining low-fidelity images. Recently, a plethora of supervised and self-supervised Deep Learning (DL) approaches have demonstrated superior performance in inverse-problem solving, surpassing conventional methods. In this study, we propose vSHARP (variable Splitting Half-quadratic ADMM algorithm for Reconstruction of inverse Problems), a novel DL-based method for solving ill-posed inverse problems arising in MI. vSHARP utilizes the Half-Quadratic Variable Splitting method and employs the Alternating Direction Method of Multipliers (ADMM) to unroll the optimization process. For data consistency, vSHARP unrolls a differentiable gradient descent process in the image domain, while a DL-based denoiser, such as a U-Net architecture, is applied to enhance image quality. vSHARP also employs a dilated-convolution DL-based model to predict the Lagrange multipliers for the ADMM initialization. We evaluate the proposed model by applying it to the task of accelerated Parallel MRI Reconstruction on two distinct datasets. We present a comparative analysis of our experimental results with state-of-the-art approaches, highlighting the superior performance of vSHARP.

CV-141-标题: Quantum Vision Clustering

链接: https://arxiv.org/abs/2309.09907
作者: Xuan Bac Nguyen, Benjamin Thompson, Hugh Churchill, Khoa Luu, Samee U. Khan
备注:

点击查看摘要

Abstract: Unsupervised visual clustering has recently received considerable attention. It aims to explain distributions of unlabeled visual images by clustering them via a parameterized appearance model. From a different perspective, the clustering algorithms can be treated as assignment problems, often NP-hard. They can be solved precisely for small instances on current hardware. Adiabatic quantum computing (AQC) offers a solution, as it can soon provide a considerable speedup on a range of NP-hard optimization problems. However, current clustering formulations are unsuitable for quantum computing due to their scaling properties. Consequently, in this work, we propose the first clustering formulation designed to be solved with AQC. We employ an Ising model representing the quantum mechanical system implemented on the AQC. Our approach is competitive compared to state-of-the-art optimization-based approaches, even using of-the-shelf integer programming solvers. Finally, we demonstrate that our clustering problem is already solvable on the current generation of real quantum computers for small examples and analyze the properties of the measured solutions.

CV-142-标题: Self-supervised TransUNet for Ultrasound regional segmentation of the distal radius in children

链接: https://arxiv.org/abs/2309.09490
作者: Yuyue Zhou, Jessica Knight, Banafshe Felfeliyan, Christopher Keen, Abhilash Rakkunedeth Hareendranathan, Jacob L. Jaremko
备注:

点击查看摘要

Abstract: Supervised deep learning offers great promise to automate analysis of medical images from segmentation to diagnosis. However, their performance highly relies on the quality and quantity of the data annotation. Meanwhile, curating large annotated datasets for medical images requires a high level of expertise, which is time-consuming and expensive. Recently, to quench the thirst for large data sets with high-quality annotation, self-supervised learning (SSL) methods using unlabeled domain-specific data, have attracted attention. Therefore, designing an SSL method that relies on minimal quantities of labeled data has far-reaching significance in medical images. This paper investigates the feasibility of deploying the Masked Autoencoder for SSL (SSL-MAE) of TransUNet, for segmenting bony regions from children’s wrist ultrasound scans. We found that changing the embedding and loss function in SSL-MAE can produce better downstream results compared to the original SSL-MAE. In addition, we determined that only pretraining TransUNet embedding and encoder with SSL-MAE does not work as well as TransUNet without SSL-MAE pretraining on downstream segmentation tasks.

CV-143-标题: An Accurate and Efficient Neural Network for OCTA Vessel Segmentation and a New Dataset

链接: https://arxiv.org/abs/2309.09483
作者: Haojian Ning, Chengliang Wang, Xinrun Chen, Shiying Li
备注:

点击查看摘要

Abstract: Optical coherence tomography angiography (OCTA) is a noninvasive imaging technique that can reveal high-resolution retinal vessels. In this work, we propose an accurate and efficient neural network for retinal vessel segmentation in OCTA images. The proposed network achieves accuracy comparable to other SOTA methods, while having fewer parameters and faster inference speed (e.g. 110x lighter and 1.3x faster than U-Net), which is very friendly for industrial applications. This is achieved by applying the modified Recurrent ConvNeXt Block to a full resolution convolutional network. In addition, we create a new dataset containing 918 OCTA images and their corresponding vessel annotations. The data set is semi-automatically annotated with the help of Segment Anything Model (SAM), which greatly improves the annotation speed. For the benefit of the community, our code and dataset can be obtained from this https URL.

CV-144-标题: Joint Demosaicing and Denoising with Double Deep Image Priors

链接: https://arxiv.org/abs/2309.09426
作者: Taihui Li, Anish Lahiri, Yutong Dai, Owen Mayer
备注:

点击查看摘要

Abstract: Demosaicing and denoising of RAW images are crucial steps in the processing pipeline of modern digital cameras. As only a third of the color information required to produce a digital image is captured by the camera sensor, the process of demosaicing is inherently ill-posed. The presence of noise further exacerbates this problem. Performing these two steps sequentially may distort the content of the captured RAW images and accumulate errors from one step to another. Recent deep neural-network-based approaches have shown the effectiveness of joint demosaicing and denoising to mitigate such challenges. However, these methods typically require a large number of training samples and do not generalize well to different types and intensities of noise. In this paper, we propose a novel joint demosaicing and denoising method, dubbed JDD-DoubleDIP, which operates directly on a single RAW image without requiring any training data. We validate the effectiveness of our method on two popular datasets – Kodak and McMaster – with various noises and noise intensities. The experimental results show that our method consistently outperforms other compared methods in terms of PSNR, SSIM, and qualitative visual perception.

CV-145-标题: BRONCO: Automated modelling of the bronchovascular bundle using the Computed Tomography Images

链接: https://arxiv.org/abs/2309.09410
作者: Wojciech Prażuch, Marek Socha, Anna Mrukwa, Aleksandra Suwalska, Agata Durawa, Malgorzata Jelitto-Górska, Katarzyna Dziadziuszko, Edyta Szurowska, Pawel Bożek, Michal Marczyk, Witold Rzyman, Joanna Polanska
备注:

点击查看摘要

Abstract: Segmentation of the bronchovascular bundle within the lung parenchyma is a key step for the proper analysis and planning of many pulmonary diseases. It might also be considered the preprocessing step when the goal is to segment the nodules from the lung parenchyma. We propose a segmentation pipeline for the bronchovascular bundle based on the Computed Tomography images, returning either binary or labelled masks of vessels and bronchi situated in the lung parenchyma. The method consists of two modules, modeling of the bronchial tree and vessels. The core revolves around a similar pipeline, the determination of the initial perimeter by the GMM method, skeletonization, and hierarchical analysis of the created graph. We tested our method on both low-dose CT and standard-dose CT, with various pathologies, reconstructed with various slice thicknesses, and acquired from various machines. We conclude that the method is invariant with respect to the origin and parameters of the CT series. Our pipeline is best suited for studies with healthy patients, patients with lung nodules, and patients with emphysema.

CV-146-标题: Deep conditional generative models for longitudinal single-slice abdominal computed tomography harmonization

链接: https://arxiv.org/abs/2309.09392
作者: Xin Yu, Qi Yang, Yucheng Tang, Riqiang Gao, Shunxing Bao, Leon Y. Cai, Ho Hin Lee, Yuankai Huo, Ann Zenobia Moore, Luigi Ferrucci, Bennett A. Landman
备注:

点击查看摘要

Abstract: Two-dimensional single-slice abdominal computed tomography (CT) provides a detailed tissue map with high resolution allowing quantitative characterization of relationships between health conditions and aging. However, longitudinal analysis of body composition changes using these scans is difficult due to positional variation between slices acquired in different years, which leading to different organs/tissues captured. To address this issue, we propose C-SliceGen, which takes an arbitrary axial slice in the abdominal region as a condition and generates a pre-defined vertebral level slice by estimating structural changes in the latent space. Our experiments on 2608 volumetric CT data from two in-house datasets and 50 subjects from the 2015 Multi-Atlas Abdomen Labeling Challenge dataset (BTCV) Challenge demonstrate that our model can generate high-quality images that are realistic and similar. We further evaluate our method’s capability to harmonize longitudinal positional variation on 1033 subjects from the Baltimore Longitudinal Study of Aging (BLSA) dataset, which contains longitudinal single abdominal slices, and confirmed that our method can harmonize the slice positional variance in terms of visceral fat area. This approach provides a promising direction for mapping slices from different vertebral levels to a target slice and reducing positional variance for single-slice longitudinal analysis. The source code is available at: this https URL.

CV-147-标题: All-optical image denoising using a diffractive visual processor

链接: https://arxiv.org/abs/2309.09215
作者: Cagatay Isıl, Tianyi Gan, F. Onuralp Ardic, Koray Mentesoglu, Jagrit Digani, Huseyin Karaca, Hanlong Chen, Jingxi Li, Deniz Mengu, Mona Jarrahi, Kaan Akşit, Aydogan Ozcan
备注: 21 Pages, 7 Figures

点击查看摘要

Abstract: Image denoising, one of the essential inverse problems, targets to remove noise/artifacts from input images. In general, digital image denoising algorithms, executed on computers, present latency due to several iterations implemented in, e.g., graphics processing units (GPUs). While deep learning-enabled methods can operate non-iteratively, they also introduce latency and impose a significant computational burden, leading to increased power consumption. Here, we introduce an analog diffractive image denoiser to all-optically and non-iteratively clean various forms of noise and artifacts from input images - implemented at the speed of light propagation within a thin diffractive visual processor. This all-optical image denoiser comprises passive transmissive layers optimized using deep learning to physically scatter the optical modes that represent various noise features, causing them to miss the output image Field-of-View (FoV) while retaining the object features of interest. Our results show that these diffractive denoisers can efficiently remove salt and pepper noise and image rendering-related spatial artifacts from input phase or intensity images while achieving an output power efficiency of ~30-40%. We experimentally demonstrated the effectiveness of this analog denoiser architecture using a 3D-printed diffractive visual processor operating at the terahertz spectrum. Owing to their speed, power-efficiency, and minimal computational overhead, all-optical diffractive denoisers can be transformative for various image display and projection systems, including, e.g., holographic displays.

信息检索

IR-0-标题: Predictive Uncertainty-based Bias Mitigation in Ranking

链接: https://arxiv.org/abs/2309.09833
作者: Maria Heuss, Daniel Cohen, Masoud Mansoury, Maarten de Rijke, Carsten Eickhoff
备注:

点击查看摘要

Abstract: Societal biases that are contained in retrieved documents have received increased interest. Such biases, which are often prevalent in the training data and learned by the model, can cause societal harms, by misrepresenting certain groups, and by enforcing stereotypes. Mitigating such biases demands algorithms that balance the trade-off between maximized utility for the user with fairness objectives, which incentivize unbiased rankings. Prior work on bias mitigation often assumes that ranking scores, which correspond to the utility that a document holds for a user, can be accurately determined. In reality, there is always a degree of uncertainty in the estimate of expected document utility. This uncertainty can be approximated by viewing ranking models through a Bayesian perspective, where the standard deterministic score becomes a distribution. In this work, we investigate whether uncertainty estimates can be used to decrease the amount of bias in the ranked results, while minimizing loss in measured utility. We introduce a simple method that uses the uncertainty of the ranking scores for an uncertainty-aware, post hoc approach to bias mitigation. We compare our proposed method with existing baselines for bias mitigation with respect to the utility-fairness trade-off, the controllability of methods, and computational costs. We show that an uncertainty-based approach can provide an intuitive and flexible trade-off that outperforms all baselines without additional training requirements, allowing for the post hoc use of this approach on top of arbitrary retrieval models.

IR-1-标题: How Much Freedom Does An Effectiveness Metric Really Have?

链接: https://arxiv.org/abs/2309.09477
作者: Alistair Moffat, Joel Mackenzie
备注:

点击查看摘要

Abstract: It is tempting to assume that because effectiveness metrics have free choice to assign scores to search engine result pages (SERPs) there must thus be a similar degree of freedom as to the relative order that SERP pairs can be put into. In fact that second freedom is, to a considerable degree, illusory. That’s because if one SERP in a pair has been given a certain score by a metric, fundamental ordering constraints in many cases then dictate that the score for the second SERP must be either not less than, or not greater than, the score assigned to the first SERP. We refer to these fixed relationships as innate pairwise SERP orderings. Our first goal in this work is to describe and defend those pairwise SERP relationship constraints, and tabulate their relative occurrence via both exhaustive and empirical experimentation. We then consider how to employ such innate pairwise relationships in IR experiments, leading to a proposal for a new measurement paradigm. Specifically, we argue that tables of results in which many different metrics are listed for champion versus challenger system comparisons should be avoided; and that instead a single metric be argued for in principled terms, with any relationships identified by that metric then reinforced via an assessment of the innate relationship as to whether other metrics - indeed, all other metrics - are likely to yield the same system-vs-system outcome.

IR-2-标题: Fairness for All: Investigating Harms to Within-Group Individuals in Producer Fairness Re-ranking Optimization – A Reproducibility Study

链接: https://arxiv.org/abs/2309.09277
作者: Giovanni Pellegrini, Vittorio Maria Faraco, Yashar Deldjoo
备注:

点击查看摘要

Abstract: Recommender systems are widely used to provide personalized recommendations to users. Recent research has shown that recommender systems may be subject to different types of biases, such as popularity bias, leading to an uneven distribution of recommendation exposure among producer groups. To mitigate this, producer-centered fairness re-ranking (PFR) approaches have been proposed to ensure equitable recommendation utility across groups. However, these approaches overlook the harm they may cause to within-group individuals associated with colder items, which are items with few or no interactions. This study reproduces previous PFR approaches and shows that they significantly harm colder items, leading to a fairness gap for these items in both advantaged and disadvantaged groups. Surprisingly, the unfair base recommendation models were providing greater exposure opportunities to these individual cold items, even though at the group level, they appeared to be unfair. To address this issue, the study proposes an amendment to the PFR approach that regulates the number of colder items recommended by the system. This modification achieves a balance between accuracy and producer fairness while optimizing the selection of colder items within each group, thereby preventing or reducing harm to within-group individuals and augmenting the novelty of all recommended items. The proposed method is able to register an increase in sub-group fairness (SGF) from 0.3104 to 0.3782, 0.6156, and 0.9442 while also improving group-level fairness (GF) (112% and 37% with respect to base models and traditional PFR). Moreover, the proposed method achieves these improvements with minimal or no reduction in accuracy (or even an increase sometimes). We evaluate the proposed method on various recommendation datasets and demonstrate promising results independent of the underlying model or datasets.

IR-3-标题: Leveraging Large Language Models for Sequential Recommendation

链接: https://arxiv.org/abs/2309.09261
作者: Jesse Harte, Wouter Zorgdrager, Panos Louridas, Asterios Katsifodimos, Dietmar Jannach, Marios Fragkoulis
备注: 9 pages

点击查看摘要

Abstract: Sequential recommendation problems have received increasing attention in research during the past few years, leading to the inception of a large variety of algorithmic approaches. In this work, we explore how large language models (LLMs), which are nowadays introducing disruptive effects in many AI-based applications, can be used to build or improve sequential recommendation approaches. Specifically, we devise and evaluate three approaches to leverage the power of LLMs in different ways. Our results from experiments on two datasets show that initializing the state-of-the-art sequential recommendation model BERT4Rec with embeddings obtained from an LLM improves NDCG by 15-20% compared to the vanilla BERT4Rec model. Furthermore, we find that a simple approach that leverages LLM embeddings for producing recommendations, can provide competitive performance by highlighting semantically related items. We publicly share the code and data of our experiments to ensure reproducibility.

IR-4-标题: SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription ICASSP2024

链接: https://arxiv.org/abs/2309.09085
作者: Yongyi Zang, Yi Zhong, Frank Cwitkowitz, Zhiyao Duan
备注: Submitted to ICASSP2024

点击查看摘要

Abstract: Guitar tablature is a form of music notation widely used among guitarists. It captures not only the musical content of a piece, but also its implementation and ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an important task with broad applications in music education and entertainment. Existing datasets are limited in size and scope, causing state-of-the-art GTT models trained on such datasets to suffer from overfitting and to fail in generalization across datasets. To address this issue, we developed a methodology for synthesizing SynthTab, a large-scale guitar tablature transcription dataset using multiple commercial acoustic and electric guitar plugins. This dataset is built on tablatures from DadaGP, which offers a vast collection and the degree of specificity we wish to transcribe. The proposed synthesis pipeline produces audio which faithfully adheres to the original fingerings, styles, and techniques specified in the tablature with diverse timbre. Experiments show that pre-training state-of-the-art GTT model on SynthTab improves transcription accuracy in same-dataset tests. More importantly, it significantly mitigates overfitting problems of GTT models in cross-dataset evaluation.

IR-5-标题: Bridging Dense and Sparse Maximum Inner Product Search

链接: https://arxiv.org/abs/2309.09013
作者: Sebastian Bruch, Franco Maria Nardini, Amir Ingber, Edo Liberty
备注:

点击查看摘要

Abstract: Maximum inner product search (MIPS) over dense and sparse vectors have progressed independently in a bifurcated literature for decades; the latter is better known as top- k retrieval in Information Retrieval. This duality exists because sparse and dense vectors serve different end goals. That is despite the fact that they are manifestations of the same mathematical problem. In this work, we ask if algorithms for dense vectors could be applied effectively to sparse vectors, particularly those that violate the assumptions underlying top- k retrieval methods. We study IVF-based retrieval where vectors are partitioned into clusters and only a fraction of clusters are searched during retrieval. We conduct a comprehensive analysis of dimensionality reduction for sparse vectors, and examine standard and spherical KMeans for partitioning. Our experiments demonstrate that IVF serves as an efficient solution for sparse MIPS. As byproducts, we identify two research opportunities and demonstrate their potential. First, we cast the IVF paradigm as a dynamic pruning technique and turn that insight into a novel organization of the inverted index for approximate MIPS for general sparse vectors. Second, we offer a unified regime for MIPS over vectors that have dense and sparse subspaces, and show its robustness to query distributions.

IR-6-标题: Analysis and Extraction of Tempo-Spatial Events in an Efficient Archival CDN with Emphasis on Telegram

链接: https://arxiv.org/abs/2309.08924
作者: Melika Bahman-Abadi, M. B. Ghaznavi-Ghoushchi
备注:

点击查看摘要

Abstract: This paper presents an efficient archival framework for exploring and tracking cyberspace large-scale data called Tempo-Spatial Content Delivery Network (TS-CDN). Social media data streams are renewing in time and spatial dimensions. Various types of websites and social networks (i.e., channels, groups, pages, etc.) are considered spatial in cyberspace. Accurate analysis entails encompassing the bulk of data. In TS-CDN by applying the hash function on big data an efficient content delivery network is created. Using hash function rebuffs data redundancy and leads to conclude unique data archive in large-scale. This framework based on entered query allows for apparent monitoring and exploring data in tempo-spatial dimension based on TF-IDF score. Also by conformance from i18n standard, the Unicode problem has been dissolved. For evaluation of TS-CDN framework, a dataset from Telegram news channels from March 23, 2020 (1399-01-01), to September 21, 2020 (1399-06-31) on topics including Coronavirus (COVID-19), vaccine, school reopening, flood, earthquake, justice shares, petroleum, and quarantine exploited. By applying hash on Telegram dataset in the mentioned time interval, a significant reduction in media files such as 39.8% for videos (from 79.5 GB to 47.8 GB), and 10% for images (from 4 GB to 3.6 GB) occurred. TS-CDN infrastructure in a web-based approach has been presented as a service-oriented system. Experiments conducted on enormous time series data, including different spatial dimensions (i.e., Khabare Fouri, Khabarhaye Fouri, Akhbare Rouze Iran, and Akhbare Rasmi Telegram news channels), demonstrate the efficiency and applicability of the implemented TS-CDN framework.

IR-7-标题: Reproducible Domain-Specific Knowledge Graphs in the Life Sciences: a Systematic Literature Review

链接: https://arxiv.org/abs/2309.08754
作者: Samira Babalou, Sheeba Samuel, Birgitta König-Ries
备注:

点击查看摘要

Abstract: Knowledge graphs (KGs) are widely used for representing and organizing structured knowledge in diverse domains. However, the creation and upkeep of KGs pose substantial challenges. Developing a KG demands extensive expertise in data modeling, ontology design, and data curation. Furthermore, KGs are dynamic, requiring continuous updates and quality control to ensure accuracy and relevance. These intricacies contribute to the considerable effort required for their development and maintenance. One critical dimension of KGs that warrants attention is reproducibility. The ability to replicate and validate KGs is fundamental for ensuring the trustworthiness and sustainability of the knowledge they represent. Reproducible KGs not only support open science by allowing others to build upon existing knowledge but also enhance transparency and reliability in disseminating information. Despite the growing number of domain-specific KGs, a comprehensive analysis concerning their reproducibility has been lacking. This paper addresses this gap by offering a general overview of domain-specific KGs and comparing them based on various reproducibility criteria. Our study over 19 different domains shows only eight out of 250 domain-specific KGs (3.2%) provide publicly available source code. Among these, only one system could successfully pass our reproducibility assessment (14.3%). These findings highlight the challenges and gaps in achieving reproducibility across domain-specific KGs. Our finding that only 0.4% of published domain-specific KGs are reproducible shows a clear need for further research and a shift in cultural practices.

人工智能

AI-0-标题: Artificial Intelligence for Web 3.0: A Comprehensive Survey

链接: https://arxiv.org/abs/2309.09972
作者: Meng Shen, Zhehui Tan, Dusit Niyato, Yuzhi Liu, Jiawen Kang, Zehui Xiong, Liehuang Zhu, Wei Wang, Xuemin (Sherman) Shen
备注:

点击查看摘要

Abstract: Web 3.0 is the new generation of the Internet that is reconstructed with distributed technology, which focuses on data ownership and value expression. Also, it operates under the principle that data and digital assets should be owned and controlled by users rather than large corporations. In this survey, we explore the current development state of Web 3.0 and the application of AI Technology in Web 3.0. Through investigating the existing applications and components of Web 3.0, we propose an architectural framework for Web 3.0 from the perspective of ecological application scenarios. We outline and divide the ecology of Web 3.0 into four layers. The main functions of each layer are data management, value circulation, ecological governance, and application scenarios. Our investigation delves into the major challenges and issues present in each of these layers. In this context, AI has shown its strong potential to solve existing problems of Web 3.0. We illustrate the crucial role of AI in the foundation and growth of Web 3.0. We begin by providing an overview of AI, including machine learning algorithms and deep learning techniques. Then, we thoroughly analyze the current state of AI technology applications in the four layers of Web 3.0 and offer some insights into its potential future development direction.

AI-1-标题: Mind Agent : Emergent Gaming Interaction

链接: https://arxiv.org/abs/2309.09971
作者: Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, Zilong Zheng, Song-Chun Zhu, Demetri Terzopoulos, Li Fei-Fei, Jianfeng Gao
备注: The first three authors contributed equally. 27 pages

点击查看摘要

Abstract: Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration. However, despite the introduction of numerous gaming frameworks, the community has insufficient benchmarks towards building general multi-agents collaboration infrastructure that encompass both LLM and human-NPCs collaborations. In this work, we propose a novel infrastructure - MindAgent - to evaluate planning and coordination emergent capabilities for gaming interaction. In particular, our infrastructure leverages existing gaming framework, to i) require understanding of the coordinator for a multi-agent system, ii) collaborate with human players via un-finetuned proper instructions, and iii) establish an in-context learning on few-shot prompt with feedback. Furthermore, we introduce CUISINEWORLD, a new gaming scenario and related benchmark that dispatch a multi-agent collaboration efficiency and supervise multiple agents playing the game simultaneously. We conduct comprehensive evaluations with new auto-metric CoS for calculating the collaboration efficiency. Finally, our infrastructure can be deployed into real-world gaming scenarios in a customized VR version of CUISINEWORLD and adapted in existing broader Minecraft gaming domain. We hope our findings on LLMs and the new infrastructure for general-purpose scheduling and coordination can help shed light on how such skills can be obtained by learning from large language corpora.

AI-2-标题: Plug in the Safety Chip: Enforcing Constraints for LLM -driven Robot Agent s

链接: https://arxiv.org/abs/2309.09919
作者: Ziyi Yang, Shreyas S. Raman, Ankit Shah, Stefanie Tellex
备注: 6 pages, 4 figures

点击查看摘要

Abstract: Recent advancements in large language models (LLMs) have enabled a new research domain, LLM agents, for solving robotics and planning tasks by leveraging the world knowledge and general reasoning abilities of LLMs obtained during pretraining. However, while considerable effort has been made to teach the robot the “dos,” the “don’ts” received relatively less attention. We argue that, for any practical usage, it is as crucial to teach the robot the “don’ts”: conveying explicit instructions about prohibited actions, assessing the robot’s comprehension of these restrictions, and, most importantly, ensuring compliance. Moreover, verifiable safe operation is essential for deployments that satisfy worldwide standards such as ISO 61508, which defines standards for safely deploying robots in industrial factory environments worldwide. Aiming at deploying the LLM agents in a collaborative environment, we propose a queryable safety constraint module based on linear temporal logic (LTL) that simultaneously enables natural language (NL) to temporal constraints encoding, safety violation reasoning and explaining, and unsafe action pruning. To demonstrate the effectiveness of our system, we conducted experiments in VirtualHome environment and on a real robot. The experimental results show that our system strictly adheres to the safety constraints and scales well with complex safety constraints, highlighting its potential for practical utility.

AI-3-标题: The role of causality in explainable artificial intelligence

链接: https://arxiv.org/abs/2309.09901
作者: Gianluca Carloni, Andrea Berti, Sara Colantonio
备注:

点击查看摘要

Abstract: Causality and eXplainable Artificial Intelligence (XAI) have developed as separate fields in computer science, even though the underlying concepts of causation and explanation share common ancient roots. This is further enforced by the lack of review works jointly covering these two fields. In this paper, we investigate the literature to try to understand how and to what extent causality and XAI are intertwined. More precisely, we seek to uncover what kinds of relationships exist between the two concepts and how one can benefit from them, for instance, in building trust in AI systems. As a result, three main perspectives are identified. In the first one, the lack of causality is seen as one of the major limitations of current AI and XAI approaches, and the “optimal” form of explanations is investigated. The second is a pragmatic perspective and considers XAI as a tool to foster scientific exploration for causal inquiry, via the identification of pursue-worthy experimental manipulations. Finally, the third perspective supports the idea that causality is propaedeutic to XAI in three possible manners: exploiting concepts borrowed from causality to support or improve XAI, utilizing counterfactuals for explainability, and considering accessing a causal model as explaining itself. To complement our analysis, we also provide relevant software solutions used to automate causal tasks. We believe our work provides a unified view of the two fields of causality and XAI by highlighting potential domain bridges and uncovering possible limitations.

AI-4-标题: Towards Ontology Construction with Language Models ISWC2023

链接: https://arxiv.org/abs/2309.09898
作者: Maurice Funk, Simon Hosemann, Jean Christoph Jung, Carsten Lutz
备注: KBC-LM’23: Knowledge Base Construction from Pre-trained Language Models workshop at ISWC 2023

点击查看摘要

Abstract: We present a method for automatically constructing a concept hierarchy for a given domain by querying a large language model. We apply this method to various domains using OpenAI’s GPT 3.5. Our experiments indicate that LLMs can be of considerable help for constructing concept hierarchies.

AI-5-标题: EGFE: End-to-end Grouping of Fragmented Elements in UI Designs with Multimodal Learning ICSE2024

链接: https://arxiv.org/abs/2309.09867
作者: Liuqing Chen, Yunnong Chen, Shuhong Xiao, Yaxuan Song, Lingyun Sun, Yankun Zhen, Tingting Zhou, Yanfang Chang
备注: Accepted to 46th International Conference on Software Engineering (ICSE 2024)

点击查看摘要

Abstract: When translating UI design prototypes to code in industry, automatically generating code from design prototypes can expedite the development of applications and GUI iterations. However, in design prototypes without strict design specifications, UI components may be composed of fragmented elements. Grouping these fragmented elements can greatly improve the readability and maintainability of the generated code. Current methods employ a two-stage strategy that introduces hand-crafted rules to group fragmented elements. Unfortunately, the performance of these methods is not satisfying due to visually overlapped and tiny UI elements. In this study, we propose EGFE, a novel method for automatically End-to-end Grouping Fragmented Elements via UI sequence prediction. To facilitate the UI understanding, we innovatively construct a Transformer encoder to model the relationship between the UI elements with multi-modal representation learning. The evaluation on a dataset of 4606 UI prototypes collected from professional UI designers shows that our method outperforms the state-of-the-art baselines in the precision (by 29.75%), recall (by 31.07%), and F1-score (by 30.39%) at edit distance threshold of 4. In addition, we conduct an empirical study to assess the improvement of the generated front-end code. The results demonstrate the effectiveness of our method on a real software engineering application. Our end-to-end fragmented elements grouping method creates opportunities for improving UI-related software engineering tasks.

AI-6-标题: Learning Spatial and Temporal Hierarchies: Hierarchical Active Inference for navigation in Multi-Room Maze Environments IROS2023

链接: https://arxiv.org/abs/2309.09864
作者: Daria de Tinguy, Toon Van de Maele, Tim Verbelen, Bart Dhoedt
备注: IROS 2023 Workshop World Models and Predictive Coding in Cognitive Robotics

点击查看摘要

Abstract: Cognitive maps play a crucial role in facilitating flexible behaviour by representing spatial and conceptual relationships within an environment. The ability to learn and infer the underlying structure of the environment is crucial for effective exploration and navigation. This paper introduces a hierarchical active inference model addressing the challenge of inferring structure in the world from pixel-based observations. We propose a three-layer hierarchical model consisting of a cognitive map, an allocentric, and an egocentric world model, combining curiosity-driven exploration with goal-oriented behaviour at the different levels of reasoning from context to place to motion. This allows for efficient exploration and goal-directed search in room-structured mini-grid environments.

AI-7-标题: Bias of AI-Generated Content: An Examination of News Produced by Large Language Models

链接: https://arxiv.org/abs/2309.09825
作者: Xiao Fang, Shangkun Che, Minjia Mao, Hongzhe Zhang, Ming Zhao, Xiaohang Zhao
备注:

点击查看摘要

Abstract: Large language models (LLMs) have the potential to transform our lives and work through the content they generate, known as AI-Generated Content (AIGC). To harness this transformation, we need to understand the limitations of LLMs. Here, we investigate the bias of AIGC produced by seven representative LLMs, including ChatGPT and LLaMA. We collect news articles from The New York Times and Reuters, both known for delivering relatively unbiased news. We then apply each examined LLM to generate news content with headlines of these news articles as prompts, and evaluate the gender and racial biases of the AIGC produced by the LLM by comparing the AIGC and the original news articles. We further analyze the gender bias of each LLM under biased prompts by adding gender-biased messages to prompts constructed from these news headlines. Our study reveals that the AIGC produced by each examined LLM demonstrates substantial gender and racial biases. Moreover, the AIGC generated by each LLM exhibits notable discrimination against females and individuals of the Black race. Among the LLMs, the AIGC generated by ChatGPT demonstrates the lowest level of bias, and ChatGPT is the sole model capable of declining content generation when provided with biased prompts.

AI-8-标题: Efficient Concept Drift Handling for Batch Android Malware Detection Models

链接: https://arxiv.org/abs/2309.09807
作者: Molina-Coronado B., Mori U., Mendiburu A., Miguel-Alonso J
备注: 18 pages

点击查看摘要

Abstract: The rapidly evolving nature of Android apps poses a significant challenge to static batch machine learning algorithms employed in malware detection systems, as they quickly become obsolete. Despite this challenge, the existing literature pays limited attention to addressing this issue, with many advanced Android malware detection approaches, such as Drebin, DroidDet and MaMaDroid, relying on static models. In this work, we show how retraining techniques are able to maintain detector capabilities over time. Particularly, we analyze the effect of two aspects in the efficiency and performance of the detectors: 1) the frequency with which the models are retrained, and 2) the data used for retraining. In the first experiment, we compare periodic retraining with a more advanced concept drift detection method that triggers retraining only when necessary. In the second experiment, we analyze sampling methods to reduce the amount of data used to retrain models. Specifically, we compare fixed sized windows of recent data and state-of-the-art active learning methods that select those apps that help keep the training dataset small but diverse. Our experiments show that concept drift detection and sample selection mechanisms result in very efficient retraining strategies which can be successfully used to maintain the performance of the static Android malware state-of-the-art detectors in changing environments.

AI-9-标题: How to Data in Datathons

链接: https://arxiv.org/abs/2309.09770
作者: Carlos Mougan, Richard Plant, Clare Teng, Marya Bazzi, Alvaro Cabregas Ejea, Ryan Sze-Yin Chan, David Salvador Jasin, Martin Stoffel, Kirstie Jane Whitaker, Jules Manser
备注:

点击查看摘要

Abstract: The rise of datathons, also known as data or data science hackathons, has provided a platform to collaborate, learn, and innovate in a short timeframe. Despite their significant potential benefits, organizations often struggle to effectively work with data due to a lack of clear guidelines and best practices for potential issues that might arise. Drawing on our own experiences and insights from organizing >80 datathon challenges with >60 partnership organizations since 2016, we provide guidelines and recommendations that serve as a resource for organizers to navigate the data-related complexities of datathons. We apply our proposed framework to 10 case studies.

AI-10-标题: Securing Fixed Neural Network Steganography

链接: https://arxiv.org/abs/2309.09700
作者: Zicong Luo, Sheng Li, Guobiao Li, Zhenxing Qian, Xinpeng Zhang
备注:

点击查看摘要

Abstract: Image steganography is the art of concealing secret information in images in a way that is imperceptible to unauthorized parties. Recent advances show that is possible to use a fixed neural network (FNN) for secret embedding and extraction. Such fixed neural network steganography (FNNS) achieves high steganographic performance without training the networks, which could be more useful in real-world applications. However, the existing FNNS schemes are vulnerable in the sense that anyone can extract the secret from the stego-image. To deal with this issue, we propose a key-based FNNS scheme to improve the security of the FNNS, where we generate key-controlled perturbations from the FNN for data embedding. As such, only the receiver who possesses the key is able to correctly extract the secret from the stego-image using the FNN. In order to improve the visual quality and undetectability of the stego-image, we further propose an adaptive perturbation optimization strategy by taking the perturbation cost into account. Experimental results show that our proposed scheme is capable of preventing unauthorized secret extraction from the stego-images. Furthermore, our scheme is able to generate stego-images with higher visual quality than the state-of-the-art FNNS scheme, especially when the FNN is a neural network for ordinary learning tasks.

AI-11-标题: Distributed course allocation with asymmetric friendships

链接: https://arxiv.org/abs/2309.09684
作者: Ilya Khakhiashvili, Lihi Dery, Tal Grinshpoun
备注:

点击查看摘要

Abstract: Students’ decisions on whether to take a class are strongly affected by whether their friends plan to take the class with them. A student may prefer to be assigned to a course they likes less, just to be with their friends, rather than taking a more preferred class alone. It has been shown that taking classes with friends positively affects academic performance. Thus, academic institutes should prioritize friendship relations when assigning course seats. The introduction of friendship relations results in several non-trivial changes to current course allocation methods. This paper explores how course allocation mechanisms can account for friendships between students and provide a unique, distributed solution. In particular, we model the problem as an asymmetric distributed constraint optimization problem and develop a new dedicated algorithm. Our extensive evaluation includes both simulated data and data derived from a user study on 177 students’ preferences over courses and friends. The results show that our algorithm obtains high utility for the students while keeping the solution fair and observing courses’ seat capacity limitations.

AI-12-标题: Adaptive Reorganization of Neural Pathways for Continual Learning with Hybrid Spiking Neural Networks

链接: https://arxiv.org/abs/2309.09550
作者: Bing Han, Feifei Zhao, Wenxuan Pan, Zhaoya Zhao, Xianqi Li, Qingqun Kong, Yi Zeng
备注:

点击查看摘要

Abstract: The human brain can self-organize rich and diverse sparse neural pathways to incrementally master hundreds of cognitive tasks. However, most existing continual learning algorithms for deep artificial and spiking neural networks are unable to adequately auto-regulate the limited resources in the network, which leads to performance drop along with energy consumption rise as the increase of tasks. In this paper, we propose a brain-inspired continual learning algorithm with adaptive reorganization of neural pathways, which employs Self-Organizing Regulation networks to reorganize the single and limited Spiking Neural Network (SOR-SNN) into rich sparse neural pathways to efficiently cope with incremental tasks. The proposed model demonstrates consistent superiority in performance, energy consumption, and memory capacity on diverse continual learning tasks ranging from child-like simple to complex tasks, as well as on generalized CIFAR100 and ImageNet datasets. In particular, the SOR-SNN model excels at learning more complex tasks as well as more tasks, and is able to integrate the past learned knowledge with the information from the current task, showing the backward transfer ability to facilitate the old tasks. Meanwhile, the proposed model exhibits self-repairing ability to irreversible damage and for pruned networks, could automatically allocate new pathway from the retained network to recover memory for forgotten knowledge.

AI-13-标题: A performance characteristic curve for model evaluation: the application in information diffusion prediction

链接: https://arxiv.org/abs/2309.09537
作者: Wenjin Xie, Xiaomeng Wang, Radosław Michalsk, Tao Jia
备注:

点击查看摘要

Abstract: The information diffusion prediction on social networks aims to predict future recipients of a message, with practical applications in marketing and social media. While different prediction models all claim to perform well, general frameworks for performance evaluation remain limited. Here, we aim to identify a performance characteristic curve for a model, which captures its performance on tasks of different complexity. We propose a metric based on information entropy to quantify the randomness in diffusion data, then identify a scaling pattern between the randomness and the prediction accuracy of the model. Data points in the patterns by different sequence lengths, system sizes, and randomness all collapse into a single curve, capturing a model’s inherent capability of making correct predictions against increased uncertainty. Given that this curve has such important properties that it can be used to evaluate the model, we define it as the performance characteristic curve of the model. The validity of the curve is tested by three prediction models in the same family, reaching conclusions in line with existing studies. Also, the curve is successfully applied to evaluate two distinct models from the literature. Our work reveals a pattern underlying the data randomness and prediction accuracy. The performance characteristic curve provides a new way to systematically evaluate models’ performance, and sheds light on future studies on other frameworks for model evaluation.

AI-14-标题: Prompt ST: Prompt -Enhanced Spatio-Temporal Multi-Attribute Prediction

链接: https://arxiv.org/abs/2309.09500
作者: Zijian Zhang, Xiangyu Zhao, Qidong Liu, Chunxu Zhang, Qian Ma, Wanyu Wang, Hongwei Zhao, Yiqi Wang, Zitao Liu
备注:

点击查看摘要

Abstract: In the era of information explosion, spatio-temporal data mining serves as a critical part of urban management. Considering the various fields demanding attention, e.g., traffic state, human activity, and social event, predicting multiple spatio-temporal attributes simultaneously can alleviate regulatory pressure and foster smart city construction. However, current research can not handle the spatio-temporal multi-attribute prediction well due to the complex relationships between diverse attributes. The key challenge lies in how to address the common spatio-temporal patterns while tackling their distinctions. In this paper, we propose an effective solution for spatio-temporal multi-attribute prediction, PromptST. We devise a spatio-temporal transformer and a parameter-sharing training scheme to address the common knowledge among different spatio-temporal attributes. Then, we elaborate a spatio-temporal prompt tuning strategy to fit the specific attributes in a lightweight manner. Through the pretrain and prompt tuning phases, our PromptST is able to enhance the specific spatio-temoral characteristic capture by prompting the backbone model to fit the specific target attribute while maintaining the learned common knowledge. Extensive experiments on real-world datasets verify that our PromptST attains state-of-the-art performance. Furthermore, we also prove PromptST owns good transferability on unseen spatio-temporal attributes, which brings promising application potential in urban computing. The implementation code is available to ease reproducibility.

AI-15-标题: Are You Worthy of My Trust?: A Socioethical Perspective on the Impacts of Trustworthy AI Systems on the Environment and Human Society

链接: https://arxiv.org/abs/2309.09450
作者: Jamell Dacon
备注: Under review

点击查看摘要

Abstract: With ubiquitous exposure of AI systems today, we believe AI development requires crucial considerations to be deemed trustworthy. While the potential of AI systems is bountiful, though, is still unknown-as are their risks. In this work, we offer a brief, high-level overview of societal impacts of AI systems. To do so, we highlight the requirement of multi-disciplinary governance and convergence throughout its lifecycle via critical systemic examinations (e.g., energy consumption), and later discuss induced effects on the environment (i.e., carbon footprint) and its users (i.e., social development). In particular, we consider these impacts from a multi-disciplinary perspective: computer science, sociology, environmental science, and so on to discuss its inter-connected societal risks and inability to simultaneously satisfy aspects of well-being. Therefore, we accentuate the necessity of holistically addressing pressing concerns of AI systems from a socioethical impact assessment perspective to explicate its harmful societal effects to truly enable humanity-centered Trustworthy AI.

AI-16-标题: A Schedule of Duties in the Cloud Space Using a Modified Salp Swarm Algorithm

链接: https://arxiv.org/abs/2309.09441
作者: Hossein Jamali, Ponkoj Chandra Shill, David Feil-Seifer, Frederick C. Harris, Jr., Sergiu M. Dascalu
备注: 15 pages, 6 figures, 2023 IFIP International Internet of Things Conference. Dallas-Fort Worth Metroplex, Texas, USA

点击查看摘要

Abstract: Cloud computing is a concept introduced in the information technology era, with the main components being the grid, distributed, and valuable computing. The cloud is being developed continuously and, naturally, comes up with many challenges, one of which is scheduling. A schedule or timeline is a mechanism used to optimize the time for performing a duty or set of duties. A scheduling process is accountable for choosing the best resources for performing a duty. The main goal of a scheduling algorithm is to improve the efficiency and quality of the service while at the same time ensuring the acceptability and effectiveness of the targets. The task scheduling problem is one of the most important NP-hard issues in the cloud domain and, so far, many techniques have been proposed as solutions, including using genetic algorithms (GAs), particle swarm optimization, (PSO), and ant colony optimization (ACO). To address this problem, in this paper, one of the collective intelligence algorithms, called the Salp Swarm Algorithm (SSA), has been expanded, improved, and applied. The performance of the proposed algorithm has been compared with that of GAs, PSO, continuous ACO, and the basic SSA. The results show that our algorithm has generally higher performance than the other algorithms. For example, compared to the basic SSA, the proposed method has an average reduction of approximately 21% in makespan.

AI-17-标题: Causal Discovery and Prediction: Methods and Algorithms

链接: https://arxiv.org/abs/2309.09416
作者: Gilles Blondel
备注: PhD Thesis, 101 pages. arXiv admin note: text overlap with arXiv:1610.05556

点击查看摘要

Abstract: We are not only observers but also actors of reality. Our capability to intervene and alter the course of some events in the space and time surrounding us is an essential component of how we build our model of the world. In this doctoral thesis we introduce a generic a-priori assessment of each possible intervention, in order to select the most cost-effective interventions only, and avoid unnecessary systematic experimentation on the real world. Based on this a-priori assessment, we propose an active learning algorithm that identifies the causal relations in any given causal model, using a least cost sequence of interventions. There are several novel aspects introduced by our algorithm. It is, in most case scenarios, able to discard many causal model candidates using relatively inexpensive interventions that only test one value of the intervened variables. Also, the number of interventions performed by the algorithm can be bounded by the number of causal model candidates. Hence, fewer initial candidates (or equivalently, more prior knowledge) lead to fewer interventions for causal discovery. Causality is intimately related to time, as causes appear to precede their effects. Cyclical causal processes are a very interesting case of causality in relation to time. In this doctoral thesis we introduce a formal analysis of time cyclical causal settings by defining a causal analog to the purely observational Dynamic Bayesian Networks, and provide a sound and complete algorithm for the identification of causal effects in the cyclic setting. We introduce the existence of two types of hidden confounder variables in this framework, which affect in substantially different ways the identification procedures, a distinction with no analog in either Dynamic Bayesian Networks or standard causal graphs.

AI-18-标题: (Deployed Application) Promoting Research Collaboration with Open Data Driven Team Recommendation in Response to Call for Proposals

链接: https://arxiv.org/abs/2309.09404
作者: Siva Likitha Valluru, Biplav Srivastava, Sai Teja Paladi, Siwen Yan, Sriraam Natarajan
备注: 9 pages, 2 figures

点击查看摘要

Abstract: Building teams and promoting collaboration are two very common business activities. An example of these are seen in the TeamingForFunding problem, where research institutions and researchers are interested to identify collaborative opportunities when applying to funding agencies in response to latter’s calls for proposals. We describe a novel system to recommend teams using a variety of AI methods, such that (1) each team achieves the highest possible skill coverage that is demanded by the opportunity, and (2) the workload of distributing the opportunities is balanced amongst the candidate members. We address these questions by extracting skills latent in open data of proposal calls (demand) and researcher profiles (supply), normalizing them using taxonomies, and creating efficient algorithms that match demand to supply. We create teams to maximize goodness along a novel metric balancing short- and long-term objectives. We validate the success of our algorithms (1) quantitatively, by evaluating the recommended teams using a goodness score and find that more informed methods lead to recommendations of smaller number of teams but higher goodness, and (2) qualitatively, by conducting a large-scale user study at a college-wide level, and demonstrate that users overall found the tool very useful and relevant. Lastly, we evaluate our system in two diverse settings in US and India (of researchers and proposal calls) to establish generality of our approach, and deploy it at a major US university for routine use.

AI-19-标题: Selecting which Dense Retriever to use for Zero-Shot Search

链接: https://arxiv.org/abs/2309.09403
作者: Ekaterina Khramtsova, Shengyao Zhuang, Mahsa Baktashmotlagh, Xi Wang, Guido Zuccon
备注:

点击查看摘要

Abstract: We propose the new problem of choosing which dense retrieval model to use when searching on a new collection for which no labels are available, i.e. in a zero-shot setting. Many dense retrieval models are readily available. Each model however is characterized by very differing search effectiveness – not just on the test portion of the datasets in which the dense representations have been learned but, importantly, also across different datasets for which data was not used to learn the dense representations. This is because dense retrievers typically require training on a large amount of labeled data to achieve satisfactory search effectiveness in a specific dataset or domain. Moreover, effectiveness gains obtained by dense retrievers on datasets for which they are able to observe labels during training, do not necessarily generalise to datasets that have not been observed during training. This is however a hard problem: through empirical experimentation we show that methods inspired by recent work in unsupervised performance evaluation with the presence of domain shift in the area of computer vision and machine learning are not effective for choosing highly performing dense retrievers in our setup. The availability of reliable methods for the selection of dense retrieval models in zero-shot settings that do not require the collection of labels for evaluation would allow to streamline the widespread adoption of dense retrieval. This is therefore an important new problem we believe the information retrieval community should consider. Implementation of methods, along with raw result files and analysis scripts are made publicly available at this https URL.

AI-20-标题: ChatGPT Hallucinates when Attributing Answers

链接: https://arxiv.org/abs/2309.09401
作者: Guido Zuccon, Bevan Koopman, Razia Shaik
备注:

点击查看摘要

Abstract: Can ChatGPT provide evidence to support its answers? Does the evidence it suggests actually exist and does it really support its answer? We investigate these questions using a collection of domain-specific knowledge-based questions, specifically prompting ChatGPT to provide both an answer and supporting evidence in the form of references to external sources. We also investigate how different prompts impact answers and evidence. We find that ChatGPT provides correct or partially correct answers in about half of the cases (50.6% of the times), but its suggested references only exist 14% of the times. We further provide insights on the generated references that reveal common traits among the references that ChatGPT generates, and show how even if a reference provided by the model does exist, this reference often does not support the claims ChatGPT attributes to it. Our findings are important because (1) they are the first systematic analysis of the references created by ChatGPT in its answers; (2) they suggest that the model may leverage good quality information in producing correct answers, but is unable to attribute real evidence to support its answers. Prompts, raw result files and manual analysis are made publicly available.

AI-21-标题: Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agent s

链接: https://arxiv.org/abs/2309.09346
作者: Carson Yu Liu, Gelareh Mohammadi, Yang Song, Wafa Johal
备注: RO-MAN’23, 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), August 2023, Busan, South Korea

点击查看摘要

Abstract: Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread. In human-human interactions, humans use nonverbal behaviours to convey their attitudes, feelings, and intentions. Therefore, this capability is also required for embodied agents in order to enhance the quality and effectiveness of their interactions with humans. In this paper, we propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances. Based on a conditional Generative Adversarial Network (GAN), our proposed neural network model learns the relationships between the co-speech gestures and both semantic and acoustic features from the speech input. In order to train our neural network model, we employ a public dataset containing co-speech gestures with corresponding speech audio utterances, which were captured from a single male native English speaker. The results from both objective and subjective evaluations demonstrate the efficacy of our gesture-generation framework for Robots and Embodied Agents.

AI-22-标题: Sim-to-Real Deep Reinforcement Learning with Manipulators for Pick-and-place

链接: https://arxiv.org/abs/2309.09247
作者: Wenxing Liu, Hanlin Niu, Robert Skilton, Joaquin Carrasco
备注:

点击查看摘要

Abstract: When transferring a Deep Reinforcement Learning model from simulation to the real world, the performance could be unsatisfactory since the simulation cannot imitate the real world well in many circumstances. This results in a long period of fine-tuning in the real world. This paper proposes a self-supervised vision-based DRL method that allows robots to pick and place objects effectively and efficiently when directly transferring a training model from simulation to the real world. A height-sensitive action policy is specially designed for the proposed method to deal with crowded and stacked objects in challenging environments. The training model with the proposed approach can be applied directly to a real suction task without any fine-tuning from the real world while maintaining a high suction success rate. It is also validated that our model can be deployed to suction novel objects in a real experiment with a suction success rate of 90% without any real-world fine-tuning. The experimental video is available at: this https URL.

AI-23-标题: From Cooking Recipes to Robot Task Trees – Improving Planning Correctness and Task Efficiency by Leveraging LLM s with a Knowledge Network

链接: https://arxiv.org/abs/2309.09181
作者: Md Sadman Sakib, Yu Sun
备注:

点击查看摘要

Abstract: Task planning for robotic cooking involves generating a sequence of actions for a robot to prepare a meal successfully. This paper introduces a novel task tree generation pipeline producing correct planning and efficient execution for cooking tasks. Our method first uses a large language model (LLM) to retrieve recipe instructions and then utilizes a fine-tuned GPT-3 to convert them into a task tree, capturing sequential and parallel dependencies among subtasks. The pipeline then mitigates the uncertainty and unreliable features of LLM outputs using task tree retrieval. We combine multiple LLM task tree outputs into a graph and perform a task tree retrieval to avoid questionable nodes and high-cost nodes to improve planning correctness and improve execution efficiency. Our evaluation results show its superior performance compared to previous works in task planning accuracy and efficiency.

AI-24-标题: Enhancing Quantised End-to-End ASR Models via Personalisation ICASSP2024

链接: https://arxiv.org/abs/2309.09136
作者: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng
备注: 5 pages, submitted to ICASSP 2024

点击查看摘要

Abstract: Recent end-to-end automatic speech recognition (ASR) models have become increasingly larger, making them particularly challenging to be deployed on resource-constrained devices. Model quantisation is an effective solution that sometimes causes the word error rate (WER) to increase. In this paper, a novel strategy of personalisation for a quantised model (PQM) is proposed, which combines speaker adaptive training (SAT) with model quantisation to improve the performance of heavily compressed models. Specifically, PQM uses a 4-bit NormalFloat Quantisation (NF4) approach for model quantisation and low-rank adaptation (LoRA) for SAT. Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora. Remarkably, with a 7x reduction in model size and 1% additional speaker-specific parameters, 15.1% and 23.3% relative WER reductions were achieved on quantised Whisper and Conformer-based attention-based encoder-decoder ASR models respectively, comparing to the original full precision models.

AI-25-标题: ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing

链接: https://arxiv.org/abs/2309.09128
作者: Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, Elena Glassman
备注: 23 pages, 7 figures, in submission

点击查看摘要

Abstract: Evaluating outputs of large language models (LLMs) is challenging, requiring making – and making sense of – many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.

AI-26-标题: Using Reinforcement Learning to Simplify Mealtime Insulin Dosing for People with Type 1 Diabetes: In-Silico Experiments

链接: https://arxiv.org/abs/2309.09125
作者: Anas El Fathi, Marc D. Breton
备注: 6 pages, 4 figures, conference

点击查看摘要

Abstract: People with type 1 diabetes (T1D) struggle to calculate the optimal insulin dose at mealtime, especially when under multiple daily injections (MDI) therapy. Effectively, they will not always perform rigorous and precise calculations, but occasionally, they might rely on intuition and previous experience. Reinforcement learning (RL) has shown outstanding results in outperforming humans on tasks requiring intuition and learning from experience. In this work, we propose an RL agent that recommends the optimal meal-accompanying insulin dose corresponding to a qualitative meal (QM) strategy that does not require precise carbohydrate counting (CC) (e.g., a usual meal at noon.). The agent is trained using the soft actor-critic approach and comprises long short-term memory (LSTM) neurons. For training, eighty virtual subjects (VS) of the FDA-accepted UVA/Padova T1D adult population were simulated using MDI therapy and QM strategy. For validation, the remaining twenty VS were examined in 26-week scenarios, including intra- and inter-day variabilities in glucose. \textitIn-silico results showed that the proposed RL approach outperforms a baseline run-to-run approach and can replace the standard CC approach. Specifically, after 26 weeks, the time-in-range ( 70-180 mg/dL) and time-in-hypoglycemia ( <70 mg/dL) were 73.1\pm11.6 % and 2.0\pm 1.8 % using the RL-optimized QM strategy compared to 70.6\pm14.8 % and 1.5\pm 1.5 % using CC. Such an approach can simplify diabetes treatment, resulting in improved quality of life and glycemic outcomes.

AI-27-标题: Public Perceptions of Gender Bias in Large Language Models : Cases of ChatGPT and Ernie

链接: https://arxiv.org/abs/2309.09120
作者: Kyrie Zhixuan Zhou, Madelyn Rose Sanfilippo
备注:

点击查看摘要

Abstract: Large language models are quickly gaining momentum, yet are found to demonstrate gender bias in their responses. In this paper, we conducted a content analysis of social media discussions to gauge public perceptions of gender bias in LLMs which are trained in different cultural contexts, i.e., ChatGPT, a US-based LLM, or Ernie, a China-based LLM. People shared both observations of gender bias in their personal use and scientific findings about gender bias in LLMs. A difference between the two LLMs was seen – ChatGPT was more often found to carry implicit gender bias, e.g., associating men and women with different profession titles, while explicit gender bias was found in Ernie’s responses, e.g., overly promoting women’s pursuit of marriage over career. Based on the findings, we reflect on the impact of culture on gender bias and propose governance recommendations to regulate gender bias in LLMs.

AI-28-标题: GenDOM: Generalizable One-shot Deformable Object Manipulation with Parameter-Aware Policy

链接: https://arxiv.org/abs/2309.09051
作者: So Kuroki, Jiaxian Guo, Tatsuya Matsushima, Takuya Okubo, Masato Kobayashi, Yuya Ikeda, Ryosuke Takanami, Paul Yoo, Yutaka Matsuo, Yusuke Iwasawa
备注: arXiv admin note: substantial text overlap with arXiv:2306.09872

点击查看摘要

Abstract: Due to the inherent uncertainty in their deformability during motion, previous methods in deformable object manipulation, such as rope and cloth, often required hundreds of real-world demonstrations to train a manipulation policy for each object, which hinders their applications in our ever-changing world. To address this issue, we introduce GenDOM, a framework that allows the manipulation policy to handle different deformable objects with only a single real-world demonstration. To achieve this, we augment the policy by conditioning it on deformable object parameters and training it with a diverse range of simulated deformable objects so that the policy can adjust actions based on different object parameters. At the time of inference, given a new object, GenDOM can estimate the deformable object parameters with only a single real-world demonstration by minimizing the disparity between the grid density of point clouds of real-world demonstrations and simulations in a differentiable physics simulator. Empirical validations on both simulated and real-world object manipulation setups clearly show that our method can manipulate different objects with a single demonstration and significantly outperforms the baseline in both environments (a 62% improvement for in-domain ropes and a 15% improvement for out-of-distribution ropes in simulation, as well as a 26% improvement for ropes and a 50% improvement for cloths in the real world), demonstrating the effectiveness of our approach in one-shot deformable object manipulation.

AI-29-标题: Generative AI-Driven Storytelling: A New Era for Marketing

链接: https://arxiv.org/abs/2309.09048
作者: Marko Vidrih, Shiva Mayahi
备注: 17 pages, 2 figures

点击查看摘要

Abstract: This paper delves into the transformative power of Generative AI-driven storytelling in the realm of marketing. Generative AI, distinct from traditional machine learning, offers the capability to craft narratives that resonate with consumers on a deeply personal level. Through real-world examples from industry leaders like Google, Netflix and Stitch Fix, we elucidate how this technology shapes marketing strategies, personalizes consumer experiences, and navigates the challenges it presents. The paper also explores future directions and recommendations for generative AI-driven storytelling, including prospective applications such as real-time personalized storytelling, immersive storytelling experiences, and social media storytelling. By shedding light on the potential and impact of generative AI-driven storytelling in marketing, this paper contributes to the understanding of this cutting-edge approach and its transformative power in the field of marketing.

AI-30-标题: Deliberative Context-Aware Ambient Intelligence System for Assisted Living Homes

链接: https://arxiv.org/abs/2309.08984
作者: Mohannad Babli, Jaime A Rincon, Eva Onaindia, Carlos Carrascosa, Vicente Julian
备注:

点击查看摘要

Abstract: Monitoring wellbeing and stress is one of the problems covered by ambient intelligence, as stress is a significant cause of human illnesses directly affecting our emotional state. The primary aim was to propose a deliberation architecture for an ambient intelligence healthcare application. The architecture provides a plan for comforting stressed seniors suffering from negative emotions in an assisted living home and executes the plan considering the environment’s dynamic nature. Literature was reviewed to identify the convergence between deliberation and ambient intelligence and the latter’s latest healthcare trends. A deliberation function was designed to achieve context-aware dynamic human-robot interaction, perception, planning capabilities, reactivity, and context-awareness with regard to the environment. A number of experimental case studies in a simulated assisted living home scenario were conducted to demonstrate the approach’s behavior and validity. The proposed methods were validated to show classification accuracy. The validation showed that the deliberation function has effectively achieved its deliberative objectives.

AI-31-标题: Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations

链接: https://arxiv.org/abs/2309.08978
作者: Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang
备注:

点击查看摘要

Abstract: Web applications are increasingly becoming the primary platform for AI service delivery, making in-browser deep learning (DL) inference more prominent. However, current in-browser inference systems fail to effectively utilize advanced web programming techniques and customize kernels for various client devices, leading to suboptimal performance. To address the issues, this paper presents the first in-browser inference system, nn-JIT.web, which enables just-in-time (JIT) auto-generation of optimized kernels for both CPUs and GPUs during inference. The system achieves this by using two novel web programming techniques that can significantly reduce kernel generation time, compared to other tensor compilers such as TVM, while maintaining or even improving performance. The first technique, Tensor-Web Compiling Co-Design, lowers compiling costs by unifying tensor and web compiling and eliminating redundant and ineffective compiling passes. The second technique, Web-Specific Lite Kernel Optimization Space Design, reduces kernel tuning costs by focusing on web programming requirements and efficient hardware resource utilization, limiting the optimization space to only dozens. nn-JIT.web is evaluated for modern transformer models on a range of client devices, including the mainstream CPUs and GPUs from ARM, Intel, AMD and Nvidia. Results show that nn-JIT.web can achieve up to 8.2x faster within 30 seconds compared to the baselines across various models.

AI-32-标题: An Unified Search and Recommendation Foundation Model for Cold-Start Scenario CIKM2023

链接: https://arxiv.org/abs/2309.08939
作者: Yuqi Gong, Xichen Ding, Yehui Su, Kaiming Shen, Zhongyi Liu, Guannan Zhang
备注: CIKM 2023,6 pages

点击查看摘要

Abstract: In modern commercial search engines and recommendation systems, data from multiple domains is available to jointly train the multi-domain model. Traditional methods train multi-domain models in the multi-task setting, with shared parameters to learn the similarity of multiple tasks, and task-specific parameters to learn the divergence of features, labels, and sample distributions of individual tasks. With the development of large language models, LLM can extract global domain-invariant text features that serve both search and recommendation tasks. We propose a novel framework called S&R Multi-Domain Foundation, which uses LLM to extract domain invariant features, and Aspect Gating Fusion to merge the ID feature, domain invariant text features and task-specific heterogeneous sparse features to obtain the representations of query and item. Additionally, samples from multiple search and recommendation scenarios are trained jointly with Domain Adaptive Multi-Task module to obtain the multi-domain foundation model. We apply the S&R Multi-Domain foundation model to cold start scenarios in the pretrain-finetune manner, which achieves better performance than other SOTA transfer learning methods. The S&R Multi-Domain Foundation model has been successfully deployed in Alipay Mobile Application’s online services, such as content query recommendation and service card recommendation, etc.

AI-33-标题: Exploration of TPUs for AI Applications

链接: https://arxiv.org/abs/2309.08918
作者: Diego Sanmartín Carrión, Vera Prohaska
备注: Research done by the Robotics & AI Club at IE University

点击查看摘要

Abstract: Tensor Processing Units (TPUs) are specialized hardware accelerators for deep learning developed by Google. This paper explores the performance of TPU with a focus on AI and its implementation in edge computing. It first provides an overview of TPUs, specifically their design in relation to neural networks, their general architecture, compilation techniques and supporting frameworks. Furthermore, we provide a comparative analysis of Cloud and Edge TPU performance against other counterpart chip architectures. It is then discussed how TPUs can be used to speed up AI workloads. The results show that TPUs can provide significant performance improvements both in cloud and edge computing. Additionally, we address the need for further research for the deployment of more architectures in the Edge TPU, as well as the need for the development of more robust comparisons in edge computing.

AI-34-标题: Bidirectional Graph GAN: Representing Brain Structure-Function Connections for Alzheimers Disease

链接: https://arxiv.org/abs/2309.08916
作者: Shuqiang Wang, Chen Ding
备注:

点击查看摘要

Abstract: The relationship between brain structure and function is critical for revealing the pathogenesis of brain disease, including Alzheimer’s disease (AD). However, it is a great challenge to map brain structure-function connections due to various reasons. In this work, a bidirectional graph generative adversarial networks (BGGAN) is proposed to represent brain structure-function connections. Specifically, by designing a module incorporating inner graph convolution network (InnerGCN), the generators of BGGAN can employ features of direct and indirect brain regions to learn the mapping function between structural domain and functional domain. Besides, a new module named Balancer is designed to counterpoise the optimization between generators and discriminators. By introducing the Balancer into BGGAN, both the structural generator and functional generator can not only alleviate the issue of mode collapse but also learn complementarity of structural and functional features. Experimental results using ADNI datasets show that the both the generated structure connections and generated function connections can improve the identification accuracy of AD. More importantly, based the proposed model, it is found that the relationship between brain structure and function is not a complete one-to-one correspondence. Brain structure is the basis of brain function. The strong structural connections are almost accompanied by strong functional connections.

AI-35-标题: Solving Satisfiability Modulo Counting for Symbolic and Statistical AI Integration With Provable Guarantees

链接: https://arxiv.org/abs/2309.08883
作者: Jinzhao Li, Nan Jiang, Yexiang Xue
备注:

点击查看摘要

Abstract: Satisfiability Modulo Counting (SMC) encompasses problems that require both symbolic decision-making and statistical reasoning. Its general formulation captures many real-world problems at the intersection of symbolic and statistical Artificial Intelligence. SMC searches for policy interventions to control probabilistic outcomes. Solving SMC is challenging because of its highly intractable nature( \textNP^\textPP -complete), incorporating statistical inference and symbolic reasoning. Previous research on SMC solving lacks provable guarantees and/or suffers from sub-optimal empirical performance, especially when combinatorial constraints are present. We propose XOR-SMC, a polynomial algorithm with access to NP-oracles, to solve highly intractable SMC problems with constant approximation guarantees. XOR-SMC transforms the highly intractable SMC into satisfiability problems, by replacing the model counting in SMC with SAT formulae subject to randomized XOR constraints. Experiments on solving important SMC problems in AI for social good demonstrate that XOR-SMC finds solutions close to the true optimum, outperforming several baselines which struggle to find good approximations for the intractable model counting in SMC.

AI-36-标题: ChatGPT -4 with Code Interpreter can be used to solve introductory college-level vector calculus and electromagnetism problems

链接: https://arxiv.org/abs/2309.08881
作者: Tanuj Kumar, Mikhail A. Kats
备注: Main text and appendices

点击查看摘要

Abstract: We evaluated ChatGPT 3.5, 4, and 4 with Code Interpreter on a set of college-level engineering-math and electromagnetism problems, such as those often given to sophomore electrical engineering majors. We selected a set of 13 problems, and had ChatGPT solve them multiple times, using a fresh instance (chat) each time. We found that ChatGPT-4 with Code Interpreter was able to satisfactorily solve most problems we tested most of the time – a major improvement over the performance of ChatGPT-4 (or 3.5) without Code Interpreter. The performance of ChatGPT was observed to be somewhat stochastic, and we found that solving the same problem N times in new ChatGPT instances and taking the most-common answer was an effective strategy. Based on our findings and observations, we provide some recommendations for instructors and students of classes at this level.

AI-37-标题: Trajectory Tracking Control of Skid-Steering Mobile Robots with Slip and Skid Compensation using Sliding-Mode Control and Deep Learning

链接: https://arxiv.org/abs/2309.08863
作者: Payam Nourizadeh, Fiona J Stevens McFadden, Will N Browne
备注:

点击查看摘要

Abstract: Slip and skid compensation is crucial for mobile robots’ navigation in outdoor environments and uneven terrains. In addition to the general slipping and skidding hazards for mobile robots in outdoor environments, slip and skid cause uncertainty for the trajectory tracking system and put the validity of stability analysis at risk. Despite research in this field, having a real-world feasible online slip and skid compensation is still challenging due to the complexity of wheel-terrain interaction in outdoor environments. This paper presents a novel trajectory tracking technique with real-world feasible online slip and skid compensation at the vehicle-level for skid-steering mobile robots in outdoor environments. The sliding mode control technique is utilized to design a robust trajectory tracking system to be able to consider the parameter uncertainty of this type of robot. Two previously developed deep learning models [1], [2] are integrated into the control feedback loop to estimate the robot’s slipping and undesired skidding and feed the compensator in a real-time manner. The main advantages of the proposed technique are (1) considering two slip-related parameters rather than the conventional three slip parameters at the wheel-level, and (2) having an online real-world feasible slip and skid compensator to be able to reduce the tracking errors in unforeseen environments. The experimental results show that the proposed controller with the slip and skid compensator improves the performance of the trajectory tracking system by more than 27%.

AI-38-标题: GPT as a Baseline for Recommendation Explanation Texts RECSYS2023

链接: https://arxiv.org/abs/2309.08817
作者: Joyce Zhou, Thorsten Joachims
备注: 8 pages, 4 tables/figures. Accepted in current form to IntRS@RecSys2023 workshop. Intending on making noticeable in-place revisions on ArXiv for future submission, including potential title change

点击查看摘要

Abstract: In this work, we establish a baseline potential for how modern model-generated text explanations of movie recommendations may help users, and explore what different components of these text explanations that users like or dislike, especially in contrast to existing human movie reviews. We found that participants gave no significantly different rankings between movies, nor did they give significantly different individual quality scores to reviews of movies that they had never seen before. However, participants did mark reviews as significantly better when they were movies they had seen before. We also explore specific aspects of movie review texts that participants marked as important for each quality. Overall, we establish that modern LLMs are a promising source of recommendation explanations, and we intend on further exploring personalizable text explanations in the future.

AI-39-标题: URA*: Uncertainty-aware Path Planning using Image-based Aerial-to-Ground Traversability Estimation for Off-road Environments

链接: https://arxiv.org/abs/2309.08814
作者: Charles Moore, Shaswata Mitra, Nisha Pillai, Marc Moore, Sudip Mittal, Cindy Bethel, Jingdao Chen
备注:

点击查看摘要

Abstract: A major challenge with off-road autonomous navigation is the lack of maps or road markings that can be used to plan a path for autonomous robots. Classical path planning methods mostly assume a perfectly known environment without accounting for the inherent perception and sensing uncertainty from detecting terrain and obstacles in off-road environments. Recent work in computer vision and deep neural networks has advanced the capability of terrain traversability segmentation from raw images; however, the feasibility of using these noisy segmentation maps for navigation and path planning has not been adequately explored. To address this problem, this research proposes an uncertainty-aware path planning method, URA* using aerial images for autonomous navigation in off-road environments. An ensemble convolutional neural network (CNN) model is first used to perform pixel-level traversability estimation from aerial images of the region of interest. The traversability predictions are represented as a grid of traversal probability values. An uncertainty-aware planner is then applied to compute the best path from a start point to a goal point given these noisy traversal probability estimates. The proposed planner also incorporates replanning techniques to allow rapid replanning during online robot operation. The proposed method is evaluated on the Massachusetts Road Dataset, the DeepGlobe dataset, as well as a dataset of aerial images from off-road proving grounds at Mississippi State University. Results show that the proposed image segmentation and planning methods outperform conventional planning algorithms in terms of the quality and feasibility of the initial path, as well as the quality of replanned paths.

AI-40-标题: SculptBot: Pre-Trained Models for 3D Deformable Object Manipulation

链接: https://arxiv.org/abs/2309.08728
作者: Alison Bartsch, Charlotte Avra, Amir Barati Farimani
备注:

点击查看摘要

Abstract: Deformable object manipulation presents a unique set of challenges in robotic manipulation by exhibiting high degrees of freedom and severe self-occlusion. State representation for materials that exhibit plastic behavior, like modeling clay or bread dough, is also difficult because they permanently deform under stress and are constantly changing shape. In this work, we investigate each of these challenges using the task of robotic sculpting with a parallel gripper. We propose a system that uses point clouds as the state representation and leverages pre-trained point cloud reconstruction Transformer to learn a latent dynamics model to predict material deformations given a grasp action. We design a novel action sampling algorithm that reasons about geometrical differences between point clouds to further improve the efficiency of model-based planners. All data and experiments are conducted entirely in the real world. Our experiments show the proposed system is able to successfully capture the dynamics of clay, and is able to create a variety of simple shapes.

AI-41-标题: Representation Learning in Low-rank Slate-based Recommender Systems ICML2023

链接: https://arxiv.org/abs/2309.08622
作者: Yijia Dai, Wen Sun
备注: in MFPL, ICML 2023. arXiv admin note: text overlap with arXiv:1905.12767 by other authors

点击查看摘要

Abstract: Reinforcement learning (RL) in recommendation systems offers the potential to optimize recommendations for long-term user engagement. However, the environment often involves large state and action spaces, which makes it hard to efficiently learn and explore. In this work, we propose a sample-efficient representation learning algorithm, using the standard slate recommendation setup, to treat this as an online RL problem with low-rank Markov decision processes (MDPs). We also construct the recommender simulation environment with the proposed setup and sampling method.

AI-42-标题: Exploring Social Choice Mechanisms for Recommendation Fairness in SCRUF

链接: https://arxiv.org/abs/2309.08621
作者: Amanda Aird, Cassidy All, Paresha Farastu, Elena Stefancova, Joshua Sun, Nicholas Mattei, Robin Burke
备注:

点击查看摘要

Abstract: Fairness problems in recommender systems often have a complexity in practice that is not adequately captured in simplified research formulations. A social choice formulation of the fairness problem, operating within a multi-agent architecture of fairness concerns, offers a flexible and multi-aspect alternative to fairness-aware recommendation approaches. Leveraging social choice allows for increased generality and the possibility of tapping into well-studied social choice algorithms for resolving the tension between multiple, competing fairness concerns. This paper explores a range of options for choice mechanisms in multi-aspect fairness applications using both real and synthetic data and shows that different classes of choice and allocation mechanisms yield different but consistent fairness / accuracy tradeoffs. We also show that a multi-agent formulation offers flexibility in adapting to user population dynamics.

AI-43-标题: Energy Concerns with HPC Systems and Applications

链接: https://arxiv.org/abs/2309.08615
作者: Roblex Nana, Claude Tadonki, Petr Dokladal, Youssef Mesri
备注: 20 pages

点击查看摘要

Abstract: For various reasons including those related to climate changes, \em energy has become a critical concern in all relevant activities and technical designs. For the specific case of computer activities, the problem is exacerbated with the emergence and pervasiveness of the so called \em intelligent devices. From the application side, we point out the special topic of \em Artificial Intelligence, who clearly needs an efficient computing support in order to succeed in its purpose of being a \em ubiquitous assistant. There are mainly two contexts where \em energy is one of the top priority concerns: \em embedded computing and \em supercomputing. For the former, power consumption is critical because the amount of energy that is available for the devices is limited. For the latter, the heat dissipated is a serious source of failure and the financial cost related to energy is likely to be a significant part of the maintenance budget. On a single computer, the problem is commonly considered through the electrical power consumption. This paper, written in the form of a survey, we depict the landscape of energy concerns in computer activities, both from the hardware and the software standpoints.

AI-44-标题: Maneuver Decision-Making Through Proximal Policy Optimization And Monte Carlo Tree Search

链接: https://arxiv.org/abs/2309.08611
作者: Zhang Hong-Peng
备注:

点击查看摘要

Abstract: Maneuver decision-making can be regarded as a Markov decision process and can be address by reinforcement learning. However, original reinforcement learning algorithms can hardly solve the maneuvering decision-making problem. One reason is that agents use random actions in the early stages of training, which makes it difficult to get rewards and learn how to make effective decisions. To address this issue, a method based on proximal policy optimization and Monte Carlo tree search is proposed. The method uses proximal policy optimization to train the agent, and regards the results of air combat as targets to train the value network. Then, based on the value network and the visit count of each node, Monte Carlo tree search is used to find the actions with more expected returns than random actions, which can improve the training performance. The ablation studies and simulation experiments indicate that agents trained by the proposed method can make different decisions according to different states, which demonstrates that the method can solve the maneuvering decision problem that the original reinforcement learning algorithm cannot solve.

AI-45-标题: A Quantum Optimization Case Study for a Transport Robot Scheduling Problem

链接: https://arxiv.org/abs/2309.09736
作者: Dominik Leib, Tobias Seidel, Sven Jäger, Raoul Heese, Caitlin Isobel Jones, Abhishek Awasthi, Astrid Niederle, Michael Bortz
备注:

点击查看摘要

Abstract: We present a comprehensive case study comparing the performance of D-Waves’ quantum-classical hybrid framework, Fujitsu’s quantum-inspired digital annealer, and Gurobi’s state-of-the-art classical solver in solving a transport robot scheduling problem. This problem originates from an industrially relevant real-world scenario. We provide three different models for our problem following different design philosophies. In our benchmark, we focus on the solution quality and end-to-end runtime of different model and solver combinations. We find promising results for the digital annealer and some opportunities for the hybrid quantum annealer in direct comparison with Gurobi. Our study provides insights into the workflow for solving an application-oriented optimization problem with different strategies, and can be useful for evaluating the strengths and weaknesses of different approaches.

AI-46-标题: HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

链接: https://arxiv.org/abs/2309.09493
作者: Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani
备注:

点击查看摘要

Abstract: Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for unseen speakers and achieves comparable performance to BigVGAN while being four times faster with only 1/6 of the parameters. Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications that demand high quality speech synthesis.

AI-47-标题: Continuous Modeling of the Denoising Process for Speech Enhancement Based on Deep Learning

链接: https://arxiv.org/abs/2309.09270
作者: Zilu Guo, Jun Du, CHin-Hui Lee
备注:

点击查看摘要

Abstract: In this paper, we explore a continuous modeling approach for deep-learning-based speech enhancement, focusing on the denoising process. We use a state variable to indicate the denoising process. The starting state is noisy speech and the ending state is clean speech. The noise component in the state variable decreases with the change of the state index until the noise component is 0. During training, a UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process. In testing, we introduce a controlling factor as an embedding, ranging from zero to one, to the neural network, allowing us to control the level of noise reduction. This approach enables controllable speech enhancement and is adaptable to various application scenarios. Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement, as evidenced by improvements in both objective speech measures and automatic speech recognition performance.

AI-48-标题: Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

链接: https://arxiv.org/abs/2309.09220
作者: Ahmed Adel Attia, Yashish M. Siriwardena, Carol Espy-Wilson
备注:

点击查看摘要

Abstract: The performance of deep learning models depends significantly on their capacity to encode input features efficiently and decode them into meaningful outputs. Better input and output representation has the potential to boost models’ performance and generalization. In the context of acoustic-to-articulatory speech inversion (SI) systems, we study the impact of utilizing speech representations acquired via self-supervised learning (SSL) models, such as HuBERT compared to conventional acoustic features. Additionally, we investigate the incorporation of novel tract variables (TVs) through an improved geometric transformation model. By combining these two approaches, we improve the Pearson product-moment correlation (PPMC) scores which evaluate the accuracy of TV estimation of the SI system from 0.7452 to 0.8141, a 6.9% increase. Our findings underscore the profound influence of rich feature representations from SSL models and improved geometric transformations with target TVs on the enhanced functionality of SI systems.

AI-49-标题: Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture ICASSP2024

链接: https://arxiv.org/abs/2309.09180
作者: Gaobin Yang, Maokui He, Shutong Niu, Ruoyu Wang, Yanyan Yue, Shuangqing Qian, Shilong Wu, Jun Du, Chin-Hui Lee
备注: Submitted to ICASSP 2024

点击查看摘要

Abstract: We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve a cleaner and more discriminative multi-speaker embedding, enabling the current model to outperform the system we used in the CHiME-7 DASR Challenge. Our code will be available at this https URL.

AI-50-标题: Earth Virtualization Engines – A Technical Perspective

链接: https://arxiv.org/abs/2309.09002
作者: Torsten Hoefler, Bjorn Stevens, Andreas F. Prein, Johanna Baehr, Thomas Schulthess, Thomas F. Stocker, John Taylor, Daniel Klocke, Pekka Manninen, Piers M. Forster, Tobias Kölling, Nicolas Gruber, Hartwig Anzt, Claudia Frauen, Florian Ziemen, Milan Klöwer, Karthik Kashinath, Christoph Schär, Oliver Fuhrer, Bryan N. Lawrence
备注:

点击查看摘要

Abstract: Participants of the Berlin Summit on Earth Virtualization Engines (EVEs) discussed ideas and concepts to improve our ability to cope with climate change. EVEs aim to provide interactive and accessible climate simulations and data for a wide range of users. They combine high-resolution physics-based models with machine learning techniques to improve the fidelity, efficiency, and interpretability of climate projections. At their core, EVEs offer a federated data layer that enables simple and fast access to exabyte-sized climate data through simple interfaces. In this article, we summarize the technical challenges and opportunities for developing EVEs, and argue that they are essential for addressing the consequences of climate change.

AI-51-标题: Emerging Approaches for THz Array Imaging: A Tutorial Review and Software Tool

链接: https://arxiv.org/abs/2309.08844
作者: Josiah W. Smith, Murat Torlak
备注: Submitted to Proceedings of IEEE

点击查看摘要

Abstract: Accelerated by the increasing attention drawn by 5G, 6G, and Internet of Things applications, communication and sensing technologies have rapidly evolved from millimeter-wave (mmWave) to terahertz (THz) in recent years. Enabled by significant advancements in electromagnetic (EM) hardware, mmWave and THz frequency regimes spanning 30 GHz to 300 GHz and 300 GHz to 3000 GHz, respectively, can be employed for a host of applications. The main feature of THz systems is high-bandwidth transmission, enabling ultra-high-resolution imaging and high-throughput communications; however, challenges in both the hardware and algorithmic arenas remain for the ubiquitous adoption of THz technology. Spectra comprising mmWave and THz frequencies are well-suited for synthetic aperture radar (SAR) imaging at sub-millimeter resolutions for a wide spectrum of tasks like material characterization and nondestructive testing (NDT). This article provides a tutorial review of systems and algorithms for THz SAR in the near-field with an emphasis on emerging algorithms that combine signal processing and machine learning techniques. As part of this study, an overview of classical and data-driven THz SAR algorithms is provided, focusing on object detection for security applications and SAR image super-resolution. We also discuss relevant issues, challenges, and future research directions for emerging algorithms and THz SAR, including standardization of system and algorithm benchmarking, adoption of state-of-the-art deep learning techniques, signal processing-optimized machine learning, and hybrid data-driven signal processing algorithms…

附件下载

点击下载今日全部论文列表