本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-04-29)

今日共更新366篇论文,其中:

  • 41篇自然语言处理(NLP: cs.CL)
  • 78篇计算机视觉(CV: cs.CV)
  • 77篇机器学习(ML: cs.LG)
  • 22篇人工智能(AI: cs.AI)
  • 4篇信息检索(IR: cs.IR)
  • 其它主题144篇

自然语言处理

NLP-0-标题: Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo

链接: https://arxiv.org/abs/2404.17546
作者: Stephen Zhao, Rob Brekelmans, Alireza Makhzani, Roger Grosse
备注:

点击查看摘要

Abstract:Numerous capability and safety techniques of Large Language Models (LLMs), including RLHF, automated red-teaming, prompt engineering, and infilling, can be cast as sampling from an unnormalized target distribution defined by a given reward or potential function over the full sequence. In this work, we leverage the rich toolkit of Sequential Monte Carlo (SMC) for these probabilistic inference problems. In particular, we use learned twist functions to estimate the expected future value of the potential at each timestep, which enables us to focus inference-time computation on promising partial sequences. We propose a novel contrastive method for learning the twist functions, and establish connections with the rich literature of soft reinforcement learning. As a complementary application of our twisted SMC framework, we present methods for evaluating the accuracy of language model inference techniques using novel bidirectional SMC bounds on the log partition function. These bounds can be used to estimate the KL divergence between the inference and target distributions in both directions. We apply our inference evaluation techniques to show that twisted SMC is effective for sampling undesirable outputs from a pretrained model (a useful component of harmlessness training and automated red-teaming), generating reviews with varied sentiment, and performing infilling tasks.

NLP-1-标题: Large Language Model Agent as a Mechanical Designer

链接: https://arxiv.org/abs/2404.17525
作者: Yayati Jadhav, Amir Barati Farimani
备注:

点击查看摘要

Abstract:Conventional mechanical design paradigms rely on experts systematically refining concepts through experience-guided modification and FEA to meet specific requirements. However, this approach can be time-consuming and heavily dependent on prior knowledge and experience. While numerous machine learning models have been developed to streamline this intensive and expert-driven iterative process, these methods typically demand extensive training data and considerable computational resources. Furthermore, methods based on deep learning are usually restricted to the specific domains and tasks for which they were trained, limiting their applicability across different tasks. This creates a trade-off between the efficiency of automation and the demand for resources. In this study, we present a novel approach that integrates pre-trained LLMs with a FEM module. The FEM module evaluates each design and provides essential feedback, guiding the LLMs to continuously learn, plan, generate, and optimize designs without the need for domain-specific training. We demonstrate the effectiveness of our proposed framework in managing the iterative optimization of truss structures, showcasing its capability to reason about and refine designs according to structured feedback and criteria. Our results reveal that these LLM-based agents can successfully generate truss designs that comply with natural language specifications with a success rate of up to 90%, which varies according to the applied constraints. By employing prompt-based optimization techniques we show that LLM based agents exhibit optimization behavior when provided with solution-score pairs to iteratively refine designs to meet specifications. This ability of LLM agents to produce viable designs and optimize them based on their inherent reasoning capabilities highlights their potential to develop and implement effective design strategies autonomously.

NLP-2-标题: On the Use of Large Language Models to Generate Capability Ontologies

链接: https://arxiv.org/abs/2404.17524
作者: Luis Miguel Vieira da Silva, Aljosha Köcher, Felix Gehlhoff, Alexander Fay
备注:

点击查看摘要

Abstract:Capability ontologies are increasingly used to model functionalities of systems or machines. The creation of such ontological models with all properties and constraints of capabilities is very complex and can only be done by ontology experts. However, Large Language Models (LLMs) have shown that they can generate machine-interpretable models from natural language text input and thus support engineers / ontology experts. Therefore, this paper investigates how LLMs can be used to create capability ontologies. We present a study with a series of experiments in which capabilities with varying complexities are generated using different prompting techniques and with different LLMs. Errors in the generated ontologies are recorded and compared. To analyze the quality of the generated ontologies, a semi-automated approach based on RDF syntax checking, OWL reasoning, and SHACL constraints is used. The results of this study are very promising because even for complex capabilities, the generated ontologies are almost free of errors.

NLP-3-标题: A Comprehensive Evaluation on Event Reasoning of Large Language Models

链接: https://arxiv.org/abs/2404.17513
作者: Zhengwei Tao, Zhi Jin, Yifan Zhang, Xiancai Chen, Xiaoying Bai, Yue Fang, Haiyan Zhao, Jia Li, Chongyang Tao
备注:

点击查看摘要

Abstract:Event reasoning is a fundamental ability that underlies many applications. It requires event schema knowledge to perform global reasoning and needs to deal with the diversity of the inter-event relations and the reasoning paradigms. How well LLMs accomplish event reasoning on various relations and reasoning paradigms remains unknown. To mitigate this disparity, we comprehensively evaluate the abilities of event reasoning of LLMs. We introduce a novel benchmark EV2 for EValuation of EVent reasoning. EV2 consists of two levels of evaluation of schema and instance and is comprehensive in relations and reasoning paradigms. We conduct extensive experiments on EV2. We find that LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory. We also notice the imbalance of event reasoning abilities in LLMs. Besides, LLMs have event schema knowledge, however, they’re not aligned with humans on how to utilize the knowledge. Based on these findings, we introduce two methods to guide the LLMs to utilize the event schema knowledge. Both methods achieve improvements.

NLP-4-标题: ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations LREC COLING2024

链接: https://arxiv.org/abs/2404.17481
作者: Tyler Loakman, Chenghua Lin
备注: Accepted to HumEval at LREC-Coling 2024

点击查看摘要

Abstract:This paper presents a partial reproduction of Generating Fact Checking Explanations by Anatanasova et al (2020) as part of the ReproHum element of the ReproNLP shared task to reproduce the findings of NLP research regarding human evaluation. This shared task aims to investigate the extent to which NLP as a field is becoming more or less reproducible over time. Following the instructions provided by the task organisers and the original authors, we collect relative rankings of 3 fact-checking explanations (comprising a gold standard and the outputs of 2 models) for 40 inputs on the criteria of Coverage. The results of our reproduction and reanalysis of the original work’s raw results lend support to the original findings, with similar patterns seen between the original work and our reproduction. Whilst we observe slight variation from the original results, our findings support the main conclusions drawn by the original authors pertaining to the efficacy of their proposed models.

NLP-5-标题: CEval: A Benchmark for Evaluating Counterfactual Text Generation

链接: https://arxiv.org/abs/2404.17475
作者: Van Bach Nguyen, Jörg Schlötterer, Christin Seifert
备注:

点击查看摘要

Abstract:Counterfactual text generation aims to minimally change a text, such that it is classified differently. Judging advancements in method development for counterfactual text generation is hindered by a non-uniform usage of data sets and metrics in related work. We propose CEval, a benchmark for comparing counterfactual text generation methods. CEval unifies counterfactual and text quality metrics, includes common counterfactual datasets with human annotations, standard baselines (MICE, GDBA, CREST) and the open-source language model LLAMA-2. Our experiments found no perfect method for generating counterfactual text. Methods that excel at counterfactual metrics often produce lower-quality text while LLMs with simple prompts generate high-quality text but struggle with counterfactual criteria. By making CEval available as an open-source Python library, we encourage the community to contribute more methods and maintain consistent evaluation in future work.

NLP-6-标题: Ruffle&Riley: Insights from Designing and Evaluating a Large Language Model-Based Conversational Tutoring System

链接: https://arxiv.org/abs/2404.17460
作者: Robin Schmucker, Meng Xia, Amos Azaria, Tom Mitchell
备注: arXiv admin note: substantial text overlap with arXiv:2310.01420

点击查看摘要

Abstract:Conversational tutoring systems (CTSs) offer learning experiences through interactions based on natural language. They are recognized for promoting cognitive engagement and improving learning outcomes, especially in reasoning tasks. Nonetheless, the cost associated with authoring CTS content is a major obstacle to widespread adoption and to research on effective instructional design. In this paper, we discuss and evaluate a novel type of CTS that leverages recent advances in large language models (LLMs) in two ways: First, the system enables AI-assisted content authoring by inducing an easily editable tutoring script automatically from a lesson text. Second, the system automates the script orchestration in a learning-by-teaching format via two LLM-based agents (Ruffle&Riley) acting as a student and a professor. The system allows for free-form conversations that follow the ITS-typical inner and outer loop structure. We evaluate Ruffle&Riley’s ability to support biology lessons in two between-subject online user studies (N = 200) comparing the system to simpler QA chatbots and reading activity. Analyzing system usage patterns, pre/post-test scores and user experience surveys, we find that Ruffle&Riley users report high levels of engagement, understanding and perceive the offered support as helpful. Even though Ruffle&Riley users require more time to complete the activity, we did not find significant differences in short-term learning gains over the reading activity. Our system architecture and user study provide various insights for designers of future CTSs. We further open-source our system to support ongoing research on effective instructional design of LLM-based learning technologies.

NLP-7-标题: Evaluation of Geographical Distortions in Language Models: A Crucial Step Towards Equitable Representations

链接: https://arxiv.org/abs/2404.17401
作者: Rémy Decoupes, Roberto Interdonato, Mathieu Roche, Maguelonne Teisseire, Sarah Valentin
备注:

点击查看摘要

Abstract:Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language Processing, five sources of bias are well-identified: data, annotation, representation, models, and research design. This study focuses on biases related to geographical knowledge. We explore the connection between geography and language models by highlighting their tendency to misrepresent spatial information, thus leading to distortions in the representation of geographical distances. This study introduces four indicators to assess these distortions, by comparing geographical and semantic distances. Experiments are conducted from these four indicators with ten widely used language models. Results underscore the critical necessity of inspecting and rectifying spatial biases in language models to ensure accurate and equitable representations.

NLP-8-标题: Child Speech Recognition in Human-Robot Interaction: Problem Solved?

链接: https://arxiv.org/abs/2404.17394
作者: Ruben Janssens, Eva Verhelst, Giulio Antonio Abbo, Qiaoqiao Ren, Maria Jose Pinto Bernal, Tony Belpaeme
备注: Presented at 2024 International Symposium on Technological Advances in Human-Robot Interaction

点击查看摘要

Abstract:Automated Speech Recognition shows superhuman performance for adult English speech on a range of benchmarks, but disappoints when fed children’s speech. This has long sat in the way of child-robot interaction. Recent evolutions in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, might mean a breakthrough for child speech recognition and social robot applications aimed at children. We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences, with sub-second transcription time running on a local GPU, showing potential for usable autonomous child-robot speech interactions.

NLP-9-标题: A Bionic Natural Language Parser Equivalent to a Pushdown Automaton IJCNN2024

链接: https://arxiv.org/abs/2404.17343
作者: Zhenghao Wei, Kehua Lin, Jianlin Feng
备注: to be published in IJCNN 2024

点击查看摘要

Abstract:Assembly Calculus (AC), proposed by Papadimitriou et al., aims to reproduce advanced cognitive functions through simulating neural activities, with several applications based on AC having been developed, including a natural language parser proposed by Mitropolsky et al. However, this parser lacks the ability to handle Kleene closures, preventing it from parsing all regular languages and rendering it weaker than Finite Automata (FA). In this paper, we propose a new bionic natural language parser (BNLP) based on AC and integrates two new biologically rational structures, Recurrent Circuit and Stack Circuit which are inspired by RNN and short-term memory mechanism. In contrast to the original parser, the BNLP can fully handle all regular languages and Dyck languages. Therefore, leveraging the Chomsky-Sch űtzenberger theorem, the BNLP which can parse all Context-Free Languages can be constructed. We also formally prove that for any PDA, a Parser Automaton corresponding to BNLP can always be formed, ensuring that BNLP has a description ability equal to that of PDA and addressing the deficiencies of the original parser.

NLP-10-标题: Can a Multichoice Dataset be Repurposed for Extractive Question Answering?

链接: https://arxiv.org/abs/2404.17342
作者: Teresa Lynn, Malik H. Altakrori, Samar Mohamed Magdy, Rocktim Jyoti Das, Chenyang Lyu, Mohamed Nasr, Younes Samih, Alham Fikri Aji, Preslav Nakov, Shantanu Godbole, Salim Roukos, Radu Florian, Nizar Habash
备注: Paper 8 pages, Appendix 12 pages. Submitted to ARR

点击查看摘要

Abstract:The rapid evolution of Natural Language Processing (NLP) has favored major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when it is task-specific. Here, we explore the feasibility of repurposing existing datasets for a new NLP task: we repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA), to enable extractive QA (EQA) in the style of machine reading comprehension. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). We also present QA evaluation results for several monolingual and cross-lingual QA pairs including English, MSA, and five Arabic dialects. Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced. We also conduct a thorough analysis and share our insights from the process, which we hope will contribute to a deeper understanding of the challenges and the opportunities associated with task reformulation in NLP research.

NLP-11-标题: Metronome: tracing variation in poetic meters via local sequence alignment

链接: https://arxiv.org/abs/2404.17337
作者: Ben Nagy, Artjoms Šeļa, Mirella De Sisto, Petr Plecháč
备注:

点击查看摘要

Abstract:All poetic forms come from somewhere. Prosodic templates can be copied for generations, altered by individuals, imported from foreign traditions, or fundamentally changed under the pressures of language evolution. Yet these relationships are notoriously difficult to trace across languages and times. This paper introduces an unsupervised method for detecting structural similarities in poems using local sequence alignment. The method relies on encoding poetic texts as strings of prosodic features using a four-letter alphabet; these sequences are then aligned to derive a distance measure based on weighted symbol (mis)matches. Local alignment allows poems to be clustered according to emergent properties of their underlying prosodic patterns. We evaluate method performance on a meter recognition tasks against strong baselines and show its potential for cross-lingual and historical research using three short case studies: 1) mutations in quantitative meter in classical Latin, 2) European diffusion of the Renaissance hendecasyllable, and 3) comparative alignment of modern meters in 18–19th century Czech, German and Russian. We release an implementation of the algorithm as a Python package with an open license.

NLP-12-标题: Introducing cosmos GPT : Monolingual Training for Turkish Language Models

链接: https://arxiv.org/abs/2404.17336
作者: H. Toprak Kesgin, M. Kaan Yuce, Eren Dogan, M. Egemen Uzun, Atahan Uz, H. Emre Seyrek, Ahmed Zeer, M. Fatih Amasyali
备注:

点击查看摘要

Abstract:The number of open source language models that can produce Turkish is increasing day by day, as in other languages. In order to create the basic versions of such models, the training of multilingual models is usually continued with Turkish corpora. The alternative is to train the model with only Turkish corpora. In this study, we first introduce the cosmosGPT models that we created with this alternative method. Then, we introduce new finetune datasets for basic language models to fulfill user requests and new evaluation datasets for measuring the capabilities of Turkish language models. Finally, a comprehensive comparison of the adapted Turkish language models on different capabilities is presented. The results show that the language models we built with the monolingual corpus have promising performance despite being about 10 times smaller than the others.

NLP-13-标题: When to Trust LLM s: Aligning Confidence with Response Quality

链接: https://arxiv.org/abs/2404.17287
作者: Shuchang Tao, Liuyi Yao, Hanxing Ding, Yuexiang Xie, Qi Cao, Fei Sun, Jinyang Gao, Huawei Shen, Bolin Ding
备注:

点击查看摘要

Abstract:Despite the success of large language models (LLMs) in natural language generation, much evidence shows that LLMs may produce incorrect or nonsensical text. This limitation highlights the importance of discerning when to trust LLMs, especially in safety-critical domains. Existing methods, which rely on verbalizing confidence to tell the reliability by inducing top-k responses and sampling-aggregating multiple responses, often fail, due to the lack of objective guidance of confidence. To address this, we propose CONfidence-Quality-ORDerpreserving alignment approach (CONQORD), leveraging reinforcement learning with a tailored dual-component reward function. This function encompasses quality reward and orderpreserving alignment reward functions. Specifically, the order-preserving reward incentivizes the model to verbalize greater confidence for responses of higher quality to align the order of confidence and quality. Experiments demonstrate that our CONQORD significantly improves the alignment performance between confidence levels and response accuracy, without causing the model to become over-cautious. Furthermore, the aligned confidence provided by CONQORD informs when to trust LLMs, and acts as a determinant for initiating the retrieval process of external knowledge. Aligning confidence with response quality ensures more transparent and reliable responses, providing better trustworthiness.

NLP-14-标题: Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM COLING2024

链接: https://arxiv.org/abs/2404.17283
作者: Xuan Zhang, Wei Gao
备注: Accepted by COLING 2024

点击查看摘要

Abstract:Retrieval-augmented language models have exhibited promising performance across various areas of natural language processing (NLP), including fact-critical tasks. However, due to the black-box nature of advanced large language models (LLMs) and the non-retrieval-oriented supervision signal of specific tasks, the training of retrieval model faces significant challenges under the setting of black-box LLM. We propose an approach leveraging Fine-grained Feedback with Reinforcement Retrieval (FFRR) to enhance fact-checking on news claims by using black-box LLM. FFRR adopts a two-level strategy to gather fine-grained feedback from the LLM, which serves as a reward for optimizing the retrieval policy, by rating the retrieved documents based on the non-retrieval ground truth of the task. We evaluate our model on two public datasets for real-world news claim verification, and the results demonstrate that FFRR achieves significant improvements over strong LLM-enabled and non-LLM baselines.

NLP-15-标题: Prompting Techniques for Reducing Social Bias in LLM s through System 1 and System 2 Cognitive Processes

链接: https://arxiv.org/abs/2404.17218
作者: Mahammed Kamruzzaman, Gene Louis Kim
备注:

点击查看摘要

Abstract:Dual process theory posits that human cognition arises via two systems. System 1, which is a quick, emotional, and intuitive process, which is subject to cognitive biases, and System 2, a slow, onerous, and deliberate process. NLP researchers often compare zero-shot prompting in LLMs to System 1 reasoning and chain-of-thought (CoT) prompting to System 2. In line with this interpretation, prior research has found that using CoT prompting in LLMs leads to reduced gender bias. We investigate the relationship between bias, CoT prompting, and dual process theory in LLMs directly. We compare zero-shot, CoT, and a variety of dual process theory-based prompting strategies on two bias datasets spanning nine different social bias categories. We also use human and machine personas to determine whether the effects of dual process theory in LLMs are based on modeling human cognition or inherent to the system. We find that a human persona, System 2, and CoT prompting all tend to reduce social biases in LLMs, though the best combination of features depends on the exact model and bias category – resulting in up to a 13 percent drop in stereotypical judgments by an LLM.

NLP-16-标题: Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot

链接: https://arxiv.org/abs/2404.17216
作者: Michelle Terblanche, Kayode Olaleye, Vukosi Marivate
备注: To be published in the Proceedings of SIGUL 2024: 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages

点击查看摘要

Abstract:Many multilingual communities, including numerous in Africa, frequently engage in code-switching during conversations. This behaviour stresses the need for natural language processing technologies adept at processing code-switched text. However, data scarcity, particularly in African languages, poses a significant challenge, as many are low-resourced and under-represented. In this study, we prompted GPT 3.5 to generate Afrikaans–English and Yoruba–English code-switched sentences, enhancing diversity using topic-keyword pairs, linguistic guidelines, and few-shot examples. Our findings indicate that the quality of generated sentences for languages using non-Latin scripts, like Yoruba, is considerably lower when compared with the high Afrikaans-English success rate. There is therefore a notable opportunity to refine prompting guidelines to yield sentences suitable for the fine-tuning of language models. We propose a framework for augmenting the diversity of synthetically generated code-switched data using GPT and propose leveraging this technology to mitigate data scarcity in low-resourced languages, underscoring the essential role of native speakers in this process.

NLP-17-标题: TIGQA:An Expert Annotated Question Answering Dataset in Tigrinya

链接: https://arxiv.org/abs/2404.17194
作者: Hailay Teklehaymanot, Dren Fazlija, Niloy Ganguly, Gourab K. Patro, Wolfgang Nejdl
备注: 9 pages,3 figures, 7 tables,2 listings

点击查看摘要

Abstract:The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources.This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert annotated educational dataset consisting of 2.68K question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pretrained models.The notable disparities between human performance and best model performance underscore the potential for further enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

NLP-18-标题: Prevalent Frequency of Emotional and Physical Symptoms in Social Anxiety using Zero Shot Classification: An Observational Study

链接: https://arxiv.org/abs/2404.17183
作者: Muhammad Rizwan, Jure Demšar
备注:

点击查看摘要

Abstract:Social anxiety represents a prevalent challenge in modern society, affecting individuals across personal and professional spheres. Left unaddressed, this condition can yield substantial negative consequences, impacting social interactions and performance. Further understanding its diverse physical and emotional symptoms becomes pivotal for comprehensive diagnosis and tailored therapeutic interventions. This study analyze prevalence and frequency of social anxiety symptoms taken from Mayo Clinic, exploring diverse human experiences from utilizing a large Reddit dataset dedicated to this issue. Leveraging these platforms, the research aims to extract insights and examine a spectrum of physical and emotional symptoms linked to social anxiety disorder. Upholding ethical considerations, the study maintains strict user anonymity within the dataset. By employing a novel approach, the research utilizes BART-based multi-label zero-shot classification to identify and measure symptom prevalence and significance in the form of probability score for each symptom under consideration. Results uncover distinctive patterns: “Trembling” emerges as a prevalent physical symptom, while emotional symptoms like “Fear of being judged negatively” exhibit high frequencies. These findings offer insights into the multifaceted nature of social anxiety, aiding clinical practices and interventions tailored to its diverse expressions.

NLP-19-标题: A Unified Label-Aware Contrastive Learning Framework for Few-Shot Named Entity Recognition

链接: https://arxiv.org/abs/2404.17178
作者: Haojie Zhang, Yimeng Zhuang
备注:

点击查看摘要

Abstract:Few-shot Named Entity Recognition (NER) aims to extract named entities using only a limited number of labeled examples. Existing contrastive learning methods often suffer from insufficient distinguishability in context vector representation because they either solely rely on label semantics or completely disregard them. To tackle this issue, we propose a unified label-aware token-level contrastive learning framework. Our approach enriches the context by utilizing label semantics as suffix prompts. Additionally, it simultaneously optimizes context-context and context-label contrastive learning objectives to enhance generalized discriminative contextual representations.Extensive experiments on various traditional test domains (OntoNotes, CoNLL’03, WNUT’17, GUM, I2B2) and the large-scale few-shot NER dataset (FEWNERD) demonstrate the effectiveness of our approach. It outperforms prior state-of-the-art models by a significant margin, achieving an average absolute gain of 7% in micro F1 scores across most scenarios. Further analysis reveals that our model benefits from its powerful transfer capability and improved contextual representations.

NLP-20-标题: Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls

链接: https://arxiv.org/abs/2404.17143
作者: Shotaro Ishihara
备注: TrustNLP: Fourth Workshop on Trustworthy Natural Language Processing (Non-Archival)

点击查看摘要

Abstract:Dominant pre-trained language models (PLMs) have been successful in high-quality natural language generation. However, the analysis of their generation is not mature: do they acquire generalizable linguistic abstractions, or do they simply memorize and recover substrings of the training data? Especially, few studies focus on domain-specific PLM. In this study, we pre-trained domain-specific GPT-2 models using a limited corpus of Japanese newspaper articles and quantified memorization of training data by comparing them with general Japanese GPT-2 models. Our experiments revealed that domain-specific PLMs sometimes “copy and paste” on a large scale. Furthermore, we replicated the empirical finding that memorization is related to duplication, model size, and prompt length, in Japanese the same as in previous English studies. Our evaluations are relieved from data contamination concerns by focusing on newspaper paywalls, which prevent their use as training data. We hope that our paper encourages a sound discussion such as the security and copyright of PLMs.

NLP-21-标题: Small Language Models Need Strong Verifiers to Self-Correct Reasoning

链接: https://arxiv.org/abs/2404.17140
作者: Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang
备注:

点击查看摘要

Abstract:Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs), where LLMs refine their solutions using self-generated critiques that pinpoint the errors. This work explores whether smaller-size (<= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs. We propose a novel pipeline that prompts smaller LMs to collect self-correction data that supports the training of self-refinement abilities. First, we leverage correct solutions to guide the model in critiquing their incorrect responses. Second, the generated critiques, after filtering, are used for supervised fine-tuning of the self-correcting reasoner through solution refinement. Our experimental results show improved self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier, though limitations are identified when using a weak self-verifier for determining when to correct.

NLP-22-标题: Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study

链接: https://arxiv.org/abs/2404.17136
作者: Yang Wu, Yao Wan, Hongyu Zhang, Yulei Sui, Wucai Wei, Wei Zhao, Guandong Xu, Hai Jin
备注:

点击查看摘要

Abstract:The Natural Language to Visualization (NL2Vis) task aims to transform natural-language descriptions into visual representations for a grounded table, enabling users to gain insights from vast amounts of data. Recently, many deep learning-based approaches have been developed for NL2Vis. Despite the considerable efforts made by these approaches, challenges persist in visualizing data sourced from unseen databases or spanning multiple tables. Taking inspiration from the remarkable generation capabilities of Large Language Models (LLMs), this paper conducts an empirical study to evaluate their potential in generating visualizations, and explore the effectiveness of in-context learning prompts for enhancing this task. In particular, we first explore the ways of transforming structured tabular data into sequential text prompts, as to feed them into LLMs and analyze which table content contributes most to the NL2Vis. Our findings suggest that transforming structured tabular data into programs is effective, and it is essential to consider the table schema when formulating prompts. Furthermore, we evaluate two types of LLMs: finetuned models (e.g., T5-Small) and inference-only models (e.g., GPT-3.5), against state-of-the-art methods, using the NL2Vis benchmarks (i.e., nvBench). The experimental results reveal that LLMs outperform baselines, with inference-only models consistently exhibiting performance improvements, at times even surpassing fine-tuned models when provided with certain few-shot demonstrations through in-context learning. Finally, we analyze when the LLMs fail in NL2Vis, and propose to iteratively update the results using strategies such as chain-of-thought, role-playing, and code-interpreter. The experimental results confirm the efficacy of iterative updates and hold great potential for future study.

NLP-23-标题: Text Sentiment Analysis and Classification Based on Bidirectional Gated Recurrent Units (GRUs) Model

链接: https://arxiv.org/abs/2404.17123
作者: Wei Xu, Jianlong Chen, Zhicheng Ding, Jinyin Wang
备注:

点击查看摘要

Abstract:This paper explores the importance of text sentiment analysis and classification in the field of natural language processing, and proposes a new approach to sentiment analysis and classification based on the bidirectional gated recurrent units (GRUs) model. The study firstly analyses the word cloud model of the text with six sentiment labels, and then carries out data preprocessing, including the steps of removing special symbols, punctuation marks, numbers, stop words and non-alphabetic parts. Subsequently, the data set is divided into training set and test set, and through model training and testing, it is found that the accuracy of the validation set is increased from 85% to 93% with training, which is an increase of 8%; at the same time, the loss value of the validation set decreases from 0.7 to 0.1 and tends to be stable, and the model is gradually close to the actual value, which can effectively classify the text emotions. The confusion matrix shows that the accuracy of the model on the test set reaches 94.8%, the precision is 95.9%, the recall is 99.1%, and the F1 score is 97.4%, which proves that the model has good generalisation ability and classification effect. Overall, the study demonstrated an effective method for text sentiment analysis and classification with satisfactory results.

NLP-24-标题: 2M-NER: Contrastive Learning for Multilingual and Multimodal NER with Language and Modal Fusion

链接: https://arxiv.org/abs/2404.17122
作者: Dongsheng Wang, Xiaoqin Feng, Zeming Liu, Chuan Wang
备注: 20 pages

点击查看摘要

Abstract:Named entity recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying entities in sentences into pre-defined types. It plays a crucial role in various research fields, including entity linking, question answering, and online product recommendation. Recent studies have shown that incorporating multilingual and multimodal datasets can enhance the effectiveness of NER. This is due to language transfer learning and the presence of shared implicit features across different modalities. However, the lack of a dataset that combines multilingualism and multimodality has hindered research exploring the combination of these two aspects, as multimodality can help NER in multiple languages simultaneously. In this paper, we aim to address a more challenging task: multilingual and multimodal named entity recognition (MMNER), considering its potential value and influence. Specifically, we construct a large-scale MMNER dataset with four languages (English, French, German and Spanish) and two modalities (text and image). To tackle this challenging MMNER task on the dataset, we introduce a new model called 2M-NER, which aligns the text and image representations using contrastive learning and integrates a multimodal collaboration module to effectively depict the interactions between the two modalities. Extensive experimental results demonstrate that our model achieves the highest F1 score in multilingual and multimodal NER tasks compared to some comparative and representative baselines. Additionally, in a challenging analysis, we discovered that sentence-level alignment interferes a lot with NER models, indicating the higher level of difficulty in our dataset.

NLP-25-标题: Talking Nonsense: Probing Large Language Models Understanding of Adversarial Gibberish Inputs

链接: https://arxiv.org/abs/2404.17120
作者: Valeriia Cherepanova, James Zou
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit excellent ability to understand human languages, but do they also understand their own language that appears gibberish to us? In this work we delve into this question, aiming to uncover the mechanisms underlying such behavior in LLMs. We employ the Greedy Coordinate Gradient optimizer to craft prompts that compel LLMs to generate coherent responses from seemingly nonsensical inputs. We call these inputs LM Babel and this work systematically studies the behavior of LLMs manipulated by these prompts. We find that the manipulation efficiency depends on the target text’s length and perplexity, with the Babel prompts often located in lower loss minima compared to natural prompts. We further examine the structure of the Babel prompts and evaluate their robustness. Notably, we find that guiding the model to generate harmful texts is not more difficult than into generating benign texts, suggesting lack of alignment for out-of-distribution prompts.

NLP-26-标题: Player-Driven Emergence in LLM -Driven Game Narrative

链接: https://arxiv.org/abs/2404.17027
作者: Xiangyu Peng, Jessica Quaye, Weijia Xu, Chris Brockett, Bill Dolan, Nebojsa Jojic, Gabriel DesGarennes, Ken Lobb, Michael Xu, Jorge Leandro, Claire Jin, Sudha Rao
备注:

点击查看摘要

Abstract:We explore how interaction with large language models (LLMs) can give rise to emergent behaviors, empowering players to participate in the evolution of game narratives. Our testbed is a text-adventure game in which players attempt to solve a mystery under a fixed narrative premise, but can freely interact with non-player characters generated by GPT-4, a large language model. We recruit 28 gamers to play the game and use GPT-4 to automatically convert the game logs into a node-graph representing the narrative in the player’s gameplay. We find that through their interactions with the non-deterministic behavior of the LLM, players are able to discover interesting new emergent nodes that were not a part of the original narrative but have potential for being fun and engaging. Players that created the most emergent nodes tended to be those that often enjoy games that facilitate discovery, exploration and experimentation.

NLP-27-标题: Türkçe Dil Modellerinin Performans Karşılaştırması Performance Comparison of Turkish Language Models

链接: https://arxiv.org/abs/2404.17010
作者: Eren Dogan, M. Egemen Uzun, Atahan Uz, H. Emre Seyrek, Ahmed Zeer, Ezgi Sevi, H. Toprak Kesgin, M. Kaan Yuce, M. Fatih Amasyali
备注: in Turkish language. Baz{\i} \c{c}al{\i}\c{s}malar{\i} i\c{c}ermedi\u{g}ini s"oyleyen hakem yorumu nedeniyle bir konferanstan kabul almad{\i}. Ancak hakemin bahsetti\u{g}i \c{c}al{\i}\c{s}malar bildiri g"onderme son tarihinde yay{\i}nlanmam{\i}\c{s}t{\i}

点击查看摘要

Abstract:The developments that language models have provided in fulfilling almost all kinds of tasks have attracted the attention of not only researchers but also the society and have enabled them to become products. There are commercially successful language models available. However, users may prefer open-source language models due to cost, data privacy, or regulations. Yet, despite the increasing number of these models, there is no comprehensive comparison of their performance for Turkish. This study aims to fill this gap in the literature. A comparison is made among seven selected language models based on their contextual learning and question-answering abilities. Turkish datasets for contextual learning and question-answering were prepared, and both automatic and human evaluations were conducted. The results show that for question-answering, continuing pretraining before fine-tuning with instructional datasets is more successful in adapting multilingual models to Turkish and that in-context learning performances do not much related to question-answering performances.

NLP-28-标题: Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models

链接: https://arxiv.org/abs/2404.17000
作者: Bradley P. Allen, Paul T. Groth
备注: 11 pages, 1 figure, 2 tables, accepted at the European Semantic Web Conference Special Track on Large Language Models for Knowledge Engineering, Hersonissos, Crete, GR, May 2024, for associated code and data, see this https URL

点击查看摘要

Abstract:A backbone of knowledge graphs are their class membership relations, which assign entities to a given class. As part of the knowledge engineering process, we propose a new method for evaluating the quality of these relations by processing descriptions of a given entity and class using a zero-shot chain-of-thought classifier that uses a natural language intensional definition of a class. We evaluate the method using two publicly available knowledge graphs, Wikidata and CaLiGraph, and 7 large language models. Using the gpt-4-0125-preview large language model, the method’s classification performance achieves a macro-averaged F1-score of 0.830 on data from Wikidata and 0.893 on data from CaLiGraph. Moreover, a manual analysis of the classification errors shows that 40.9% of errors were due to the knowledge graphs, with 16.0% due to missing relations and 24.9% due to incorrectly asserted relations. These results show how large language models can assist knowledge engineers in the process of knowledge graph refinement. The code and data are available on Github.

NLP-29-标题: Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

链接: https://arxiv.org/abs/2404.16966
作者: Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, James Bono
备注:

点击查看摘要

Abstract:Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model’s average performance across the test prompts of a benchmark to evaluate the model’s performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.

NLP-30-标题: A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice ACL

链接: https://arxiv.org/abs/2404.16958
作者: Juri Opitz
备注: to appear in TACL, this is a pre-MIT Press publication version

点击查看摘要

Abstract:Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called ‘macro’ metrics to rank systems (e.g., ‘macro F1’) but do not clearly specify what they would expect from such a ‘macro’ metric. This is problematic, since picking a metric can affect paper findings as well as shared task rankings, and thus any clarity in the process should be maximized. Starting from the intuitive concepts of bias and prevalence, we perform an analysis of common evaluation metrics, considering expectations as found expressed in papers. Equipped with a thorough understanding of the metrics, we survey metric selection in recent shared tasks of Natural Language Processing. The results show that metric choices are often not supported with convincing arguments, an issue that can make any ranking seem arbitrary. This work aims at providing overview and guidance for more informed and transparent metric selection, fostering meaningful evaluation.

NLP-31-标题: A Survey of Generative Search and Recommendation in the Era of Large Language Models

链接: https://arxiv.org/abs/2404.16924
作者: Yongqi Li, Xinyu Lin, Wenjie Wang, Fuli Feng, Liang Pang, Wenjie Li, Liqiang Nie, Xiangnan He, Tat-Seng Chua
备注:

点击查看摘要

Abstract:With the information explosion on the Web, search and recommendation are foundational infrastructures to satisfying users’ information needs. As the two sides of the same coin, both revolve around the same core research problem, matching queries with documents or users with items. In the recent few decades, search and recommendation have experienced synchronous technological paradigm shifts, including machine learning-based and deep learning-based paradigms. Recently, the superintelligent generative large language models have sparked a new paradigm in search and recommendation, i.e., generative search (retrieval) and recommendation, which aims to address the matching problem in a generative manner. In this paper, we provide a comprehensive survey of the emerging paradigm in information systems and summarize the developments in generative search and recommendation from a unified perspective. Rather than simply categorizing existing works, we abstract a unified framework for the generative paradigm and break down the existing works into different stages within this framework to highlight the strengths and weaknesses. And then, we distinguish generative search and recommendation with their unique challenges, identify open problems and future directions, and envision the next information-seeking paradigm.

NLP-32-标题: A Short Survey of Human Mobility Prediction in Epidemic Modeling from Transformer s to LLM s

链接: https://arxiv.org/abs/2404.16921
作者: Christian N. Mayemba, D’Jeff K. Nkashama, Jean Marie Tshimula, Maximilien V. Dialufuma, Jean Tshibangu Muabila, Mbuyi Mukendi Didier, Hugues Kanda, René Manassé Galekwa, Heber Dibwe Fita, Serge Mundele, Kalonji Kalala, Aristarque Ilunga, Lambert Mukendi Ntobo, Dominique Muteba, Aaron Aruna Abedi
备注:

点击查看摘要

Abstract:This paper provides a comprehensive survey of recent advancements in leveraging machine learning techniques, particularly Transformer models, for predicting human mobility patterns during epidemics. Understanding how people move during epidemics is essential for modeling the spread of diseases and devising effective response strategies. Forecasting population movement is crucial for informing epidemiological models and facilitating effective response planning in public health emergencies. Predicting mobility patterns can enable authorities to better anticipate the geographical and temporal spread of diseases, allocate resources more efficiently, and implement targeted interventions. We review a range of approaches utilizing both pretrained language models like BERT and Large Language Models (LLMs) tailored specifically for mobility prediction tasks. These models have demonstrated significant potential in capturing complex spatio-temporal dependencies and contextual patterns in textual data.

NLP-33-标题: Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing

链接: https://arxiv.org/abs/2404.16914
作者: Peizhuang Cong, Aomufei Yuan, Shimao Chen, Yuxuan Tian, Bowen Ye, Tong Yang
备注:

点击查看摘要

Abstract:MoE facilitates the development of large models by making the computational complexity of the model no longer scale linearly with increasing parameters. The learning sparse gating network selects a set of experts for each token to be processed; however, this may lead to differences in the number of tokens processed by each expert over several successive iterations, i.e., the expert load fluctuations, which reduces computational parallelization and resource utilization. To this end, we traced and analyzed loads of each expert in the training iterations for several large language models in this work, and defined the transient state with “obvious load fluctuation” and the stable state with “temporal locality”. Moreover, given the characteristics of these two states and the computational overhead, we deployed three classical prediction algorithms that achieve accurate expert load prediction results. For the GPT3 350M model, the average error rates for predicting the expert load proportion over the next 1,000 and 2,000 steps are approximately 1.3% and 1.8%, respectively. This work can provide valuable guidance for expert placement or resource allocation for MoE model training. Based on this work, we will propose an expert placement scheme for transient and stable states in our coming work.

NLP-34-标题: Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in Conversations

链接: https://arxiv.org/abs/2404.16905
作者: Shen Zhang, Haojie Zhang, Jing Zhang, Xudong Zhang, Yimeng Zhuang, Jinting Wu
备注:

点击查看摘要

Abstract:In human-computer interaction, it is crucial for agents to respond to human by understanding their emotions. Unraveling the causes of emotions is more challenging. A new task named Multimodal Emotion-Cause Pair Extraction in Conversations is responsible for recognizing emotion and identifying causal expressions. In this study, we propose a multi-stage framework to generate emotion and extract the emotion causal pairs given the target emotion. In the first stage, Llama-2-based InstructERC is utilized to extract the emotion category of each utterance in a conversation. After emotion recognition, a two-stream attention model is employed to extract the emotion causal pairs given the target emotion for subtask 2 while MuTEC is employed to extract causal span for subtask 1. Our approach achieved first place for both of the two subtasks in the competition.

NLP-35-标题: Attacks on Third-Party APIs of Large Language Models ICLR2024

链接: https://arxiv.org/abs/2404.16891
作者: Wanru Zhao, Vidit Khazanchi, Haodi Xing, Xuanli He, Qiongkai Xu, Nicholas Donald Lane
备注: ICLR 2024 Workshop on Secure and Trustworthy Large Language Models

点击查看摘要

Abstract:Large language model (LLM) services have recently begun offering a plugin ecosystem to interact with third-party API services. This innovation enhances the capabilities of LLMs, but it also introduces risks, as these plugins developed by various third parties cannot be easily trusted. This paper proposes a new attacking framework to examine security and safety vulnerabilities within LLM platforms that incorporate third-party services. Applying our framework specifically to widely used LLMs, we identify real-world malicious attacks across various domains on third-party APIs that can imperceptibly modify LLM outputs. The paper discusses the unique challenges posed by third-party API integration and offers strategic possibilities to improve the security and safety of LLM ecosystems moving forward. Our code is released at this https URL.

NLP-36-标题: Adv Prompt er: Fast Adaptive Adversarial Prompting for LLM s

链接: https://arxiv.org/abs/2404.16873
作者: Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, Yuandong Tian
备注: 32 pages, 9 figures, 7 tables

点击查看摘要

Abstract:While recently Large Language Models (LLMs) have achieved remarkable successes, they are vulnerable to certain jailbreaking attacks that lead to generation of inappropriate or harmful content. Manual red-teaming requires finding adversarial prompts that cause such jailbreaking, e.g. by appending a suffix to a given instruction, which is inefficient and time-consuming. On the other hand, automatic adversarial prompt generation often leads to semantically meaningless attacks that can easily be detected by perplexity-based filters, may require gradient information from the TargetLLM, or do not scale well due to time-consuming discrete optimization processes over the token space. In this paper, we present a novel method that uses another LLM, called the AdvPrompter, to generate human-readable adversarial prompts in seconds, \sim800\times faster than existing optimization-based approaches. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM. This process alternates between two steps: (1) generating high-quality target adversarial suffixes by optimizing the AdvPrompter predictions, and (2) low-rank fine-tuning of the AdvPrompter with the generated adversarial suffixes. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLMs show state-of-the-art results on the AdvBench dataset, that also transfer to closed-source black-box LLM APIs. Further, we demonstrate that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e. high MMLU scores.

NLP-37-标题: Rumour Evaluation with Very Large Language Models

链接: https://arxiv.org/abs/2404.16859
作者: Dahlia Shehata, Robin Cohen, Charles Clarke
备注:

点击查看摘要

Abstract:Conversational prompt-engineering-based large language models (LLMs) have enabled targeted control over the output creation, enhancing versatility, adaptability and adhoc retrieval. From another perspective, digital misinformation has reached alarming levels. The anonymity, availability and reach of social media offer fertile ground for rumours to propagate. This work proposes to leverage the advancement of prompting-dependent LLMs to combat misinformation by extending the research efforts of the RumourEval task on its Twitter dataset. To the end, we employ two prompting-based LLM variants (GPT-3.5-turbo and GPT-4) to extend the two RumourEval subtasks: (1) veracity prediction, and (2) stance classification. For veracity prediction, three classifications schemes are experimented per GPT variant. Each scheme is tested in zero-, one- and few-shot settings. Our best results outperform the precedent ones by a substantial margin. For stance classification, prompting-based-approaches show comparable performance to prior results, with no improvement over finetuning methods. Rumour stance subtask is also extended beyond the original setting to allow multiclass classification. All of the generated predictions for both subtasks are equipped with confidence scores determining their trustworthiness degree according to the LLM, and post-hoc justifications for explainability and interpretability purposes. Our primary aim is AI for social good.

NLP-38-标题: A Disease Labeler for Chinese Chest X-Ray Report Generation

链接: https://arxiv.org/abs/2404.16852
作者: Mengwei Wang, Ruixin Yan, Zeyi Hou, Ning Lang, Xiuzhuang Zhou
备注:

点击查看摘要

Abstract:In the field of medical image analysis, the scarcity of Chinese chest X-ray report datasets has hindered the development of technology for generating Chinese chest X-ray reports. On one hand, the construction of a Chinese chest X-ray report dataset is limited by the time-consuming and costly process of accurate expert disease annotation. On the other hand, a single natural language generation metric is commonly used to evaluate the similarity between generated and ground-truth reports, while the clinical accuracy and effectiveness of the generated reports rely on an accurate disease labeler (classifier). To address the issues, this study proposes a disease labeler tailored for the generation of Chinese chest X-ray reports. This labeler leverages a dual BERT architecture to handle diagnostic reports and clinical information separately and constructs a hierarchical label learning algorithm based on the affiliation between diseases and body parts to enhance text classification performance. Utilizing this disease labeler, a Chinese chest X-ray report dataset comprising 51,262 report samples was established. Finally, experiments and analyses were conducted on a subset of expert-annotated Chinese chest X-ray reports, validating the effectiveness of the proposed disease labeler.

NLP-39-标题: A Semi-Automatic Approach to Create Large Gender- and Age-Balanced Speaker Corpora: Usefulness of Speaker Diarization & Identification

链接: https://arxiv.org/abs/2404.17552
作者: Rémi Uro, David Doukhan, Albert Rilliard, Laëtitia Larcher, Anissa-Claire Adgharouamane, Marie Tahon, Antoine Laurent
备注: Keywords:, semi-automatic processing, corpus creation, diarization, speaker identification, gender-balanced, age-balanced, speaker corpus, diachrony

点击查看摘要

Abstract:This paper presents a semi-automatic approach to create a diachronic corpus of voices balanced for speaker’s age, gender, and recording period, according to 32 categories (2 genders, 4 age ranges and 4 recording periods). Corpora were selected at French National Institute of Audiovisual (INA) to obtain at least 30 speakers per category (a total of 960 speakers; only 874 have be found yet). For each speaker, speech excerpts were extracted from audiovisual documents using an automatic pipeline consisting of speech detection, background music and overlapped speech removal and speaker diarization, used to present clean speaker segments to human annotators identifying target speakers. This pipeline proved highly effective, cutting down manual processing by a factor of ten. Evaluation of the quality of the automatic processing and of the final output is provided. It shows the automatic processing compare to up-to-date process, and that the output provides high quality speech for most of the selected excerpts. This method shows promise for creating large corpora of known target speakers.

NLP-40-标题: Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation

链接: https://arxiv.org/abs/2404.16880
作者: Yikun Zhang, Geyan Ye, Chaohao Yuan, Bo Han, Long-Kai Huang, Jianhua Yao, Wei Liu, Yu Rong
备注:

点击查看摘要

Abstract:Molecule-and-text cross-modal representation learning has emerged as a promising direction for enhancing the quality of molecular representation, thereby improving performance in various scientific fields, including drug discovery and materials science. Existing studies adopt a global alignment approach to learn the knowledge from different modalities. These global alignment approaches fail to capture fine-grained information, such as molecular fragments and their corresponding textual description, which is crucial for downstream tasks. Furthermore, it is incapable to model such information using a similar global alignment strategy due to data scarcity of paired local part annotated data from existing datasets. In this paper, we propose Atomas, a multi-modal molecular representation learning framework to jointly learn representations from SMILES string and text. We design a Hierarchical Adaptive Alignment model to concurrently learn the fine-grained fragment correspondence between two modalities and align these representations of fragments in three levels. Additionally, Atomas’s end-to-end training framework incorporates the tasks of understanding and generating molecule, thereby supporting a wider range of downstream tasks. In the retrieval task, Atomas exhibits robust generalization ability and outperforms the baseline by 30.8% of recall@1 on average. In the generation task, Atomas achieves state-of-the-art results in both molecule captioning task and molecule generation task. Moreover, the visualization of the Hierarchical Adaptive Alignment model further confirms the chemical significance of our approach. Our codes can be found at https://anonymous.4open.science/r/Atomas-03C3.

机器学习

ML-0-标题: An exactly solvable model for emergence and scaling laws

链接: https://arxiv.org/abs/2404.17563
作者: Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Ard Louis
备注:

点击查看摘要

Abstract:Deep learning models can exhibit what appears to be a sudden ability to solve a new problem as training time ( T ), training data ( D ), or model size ( N ) increases, a phenomenon known as emergence. In this paper, we present a framework where each new ability (a skill) is represented as a basis function. We solve a simple multi-linear model in this skill-basis, finding analytic expressions for the emergence of new skills, as well as for scaling laws of the loss with training time, data size, model size, and optimal compute ( C ). We compare our detailed calculations to direct simulations of a two-layer neural network trained on multitask sparse parity, where the tasks in the dataset are distributed according to a power-law. Our simple model captures, using a single fit parameter, the sigmoidal emergence of multiple new skills as training time, data size or model size increases in the neural network.

ML-1-标题: Federated Transfer Component Analysis Towards Effective VNF Profiling

链接: https://arxiv.org/abs/2404.17553
作者: Xunzheng ZhangB, Shadi Moazzeni, Juan Marcelo Parra-Ullauri, Reza Nejabati, Dimitra Simeonidou
备注:

点击查看摘要

Abstract:The increasing concerns of knowledge transfer and data privacy challenge the traditional gather-and-analyse paradigm in networks. Specifically, the intelligent orchestration of Virtual Network Functions (VNFs) requires understanding and profiling the resource consumption. However, profiling all kinds of VNFs is time-consuming. It is important to consider transferring the well-profiled VNF knowledge to other lack-profiled VNF types while keeping data private. To this end, this paper proposes a Federated Transfer Component Analysis (FTCA) method between the source and target VNFs. FTCA first trains Generative Adversarial Networks (GANs) based on the source VNF profiling data, and the trained GANs model is sent to the target VNF domain. Then, FTCA realizes federated domain adaptation by using the generated source VNF data and less target VNF profiling data, while keeping the raw data locally. Experiments show that the proposed FTCA can effectively predict the required resources for the target VNF. Specifically, the RMSE index of the regression model decreases by 38.5% and the R-squared metric advances up to 68.6%.

ML-2-标题: Using Neural Implicit Flow To Represent Latent Dynamics Of Canonical Systems

链接: https://arxiv.org/abs/2404.17535
作者: Imran Nasim, Joaõ Lucas de Sousa Almeida
备注: Accepted into the International conference on Scientific Computation and Machine Learning 2024 (SCML 2024)

点击查看摘要

Abstract:The recently introduced class of architectures known as Neural Operators has emerged as highly versatile tools applicable to a wide range of tasks in the field of Scientific Machine Learning (SciML), including data representation and forecasting. In this study, we investigate the capabilities of Neural Implicit Flow (NIF), a recently developed mesh-agnostic neural operator, for representing the latent dynamics of canonical systems such as the Kuramoto-Sivashinsky (KS), forced Korteweg-de Vries (fKdV), and Sine-Gordon (SG) equations, as well as for extracting dynamically relevant information from them. Finally we assess the applicability of NIF as a dimensionality reduction algorithm and conduct a comparative analysis with another widely recognized family of neural operators, known as Deep Operator Networks (DeepONets).

ML-3-标题: Bridging the Fairness Divide: Achieving Group and Individual Fairness in Graph Neural Networks

链接: https://arxiv.org/abs/2404.17511
作者: Duna Zhan, Dongliang Guo, Pengsheng Ji, Sheng Li
备注: 16 pages, 3 figures

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as a powerful tool for analyzing and learning from complex data structured as graphs, demonstrating remarkable effectiveness in various applications, such as social network analysis, recommendation systems, and drug discovery. However, despite their impressive performance, the fairness problem has increasingly gained attention as a crucial aspect to consider. Existing research in graph learning focuses on either group fairness or individual fairness. However, since each concept provides unique insights into fairness from distinct perspectives, integrating them into a fair graph neural network system is crucial. To the best of our knowledge, no study has yet to comprehensively tackle both individual and group fairness simultaneously. In this paper, we propose a new concept of individual fairness within groups and a novel framework named Fairness for Group and Individual (FairGI), which considers both group fairness and individual fairness within groups in the context of graph learning. FairGI employs the similarity matrix of individuals to achieve individual fairness within groups, while leveraging adversarial learning to address group fairness in terms of both Equal Opportunity and Statistical Parity. The experimental results demonstrate that our approach not only outperforms other state-of-the-art models in terms of group fairness and individual fairness within groups, but also exhibits excellent performance in population-level individual fairness, while maintaining comparable prediction accuracy.

ML-4-标题: Constrained Neural Networks for Interpretable Heuristic Creation to Optimise Computer Algebra Systems

链接: https://arxiv.org/abs/2404.17508
作者: Dorian Florescu, Matthew England
备注: Accepted for presentation at ICMS 2024

点击查看摘要

Abstract:We present a new methodology for utilising machine learning technology in symbolic computation research. We explain how a well known human-designed heuristic to make the choice of variable ordering in cylindrical algebraic decomposition may be represented as a constrained neural network. This allows us to then use machine learning methods to further optimise the heuristic, leading to new networks of similar size, representing new heuristics of similar complexity as the original human-designed one. We present this as a form of ante-hoc explainability for use in computer algebra development.

ML-5-标题: Causally Abstracted Multi-armed Bandits UAI

链接: https://arxiv.org/abs/2404.17493
作者: Fabio Massimo Zennaro, Nicholas Bishop, Joel Dyer, Yorgos Felekis, Anisoara Calinescu, Michael Wooldridge, Theodoros Damoulas
备注: 8 pages, 3 figures (main article); 20 pages, 10 figures (appendix); 40th Conference on Uncertainty in Artificial Intelligence (UAI)

点击查看摘要

Abstract:Multi-armed bandits (MAB) and causal MABs (CMAB) are established frameworks for decision-making problems. The majority of prior work typically studies and solves individual MAB and CMAB in isolation for a given problem and associated data. However, decision-makers are often faced with multiple related problems and multi-scale observations where joint formulations are needed in order to efficiently exploit the problem structures and data dependencies. Transfer learning for CMABs addresses the situation where models are defined on identical variables, although causal connections may differ. In this work, we extend transfer learning to setups involving CMABs defined on potentially different variables, with varying degrees of granularity, and related via an abstraction map. Formally, we introduce the problem of causally abstracted MABs (CAMABs) by relying on the theory of causal abstraction in order to express a rigorous abstraction map. We propose algorithms to learn in a CAMAB, and study their regret. We illustrate the limitations and the strengths of our algorithms on a real-world scenario related to online advertising.

ML-6-标题: Tabular Data Contrastive Learning via Class-Conditioned and Feature-Correlation Based Augmentation

链接: https://arxiv.org/abs/2404.17489
作者: Wei Cui, Rasa Hosseinzadeh, Junwei Ma, Tongzi Wu, Yi Sui, Keyvan Golestan
备注: 14 pages, 4 algorithms, 3 figures, 5 tables

点击查看摘要

Abstract:Contrastive learning is a model pre-training technique by first creating similar views of the original data, and then encouraging the data and its corresponding views to be close in the embedding space. Contrastive learning has witnessed success in image and natural language data, thanks to the domain-specific augmentation techniques that are both intuitive and effective. Nonetheless, in tabular domain, the predominant augmentation technique for creating views is through corrupting tabular entries via swapping values, which is not as sound or effective. We propose a simple yet powerful improvement to this augmentation technique: corrupting tabular data conditioned on class identity. Specifically, when corrupting a specific tabular entry from an anchor row, instead of randomly sampling a value in the same feature column from the entire table uniformly, we only sample from rows that are identified to be within the same class as the anchor row. We assume the semi-supervised learning setting, and adopt the pseudo labeling technique for obtaining class identities over all table rows. We also explore the novel idea of selecting features to be corrupted based on feature correlation structures. Extensive experiments show that the proposed approach consistently outperforms the conventional corruption method for tabular data classification tasks. Our code is available at this https URL.

ML-7-标题: Conformal Prediction with Learned Features

链接: https://arxiv.org/abs/2404.17487
作者: Shayan Kiyani, George Pappas, Hamed Hassani
备注:

点击查看摘要

Abstract:In this paper, we focus on the problem of conformal prediction with conditional guarantees. Prior work has shown that it is impossible to construct nontrivial prediction sets with full conditional coverage guarantees. A wealth of research has considered relaxations of full conditional guarantees, relying on some predefined uncertainty structures. Departing from this line of thinking, we propose Partition Learning Conformal Prediction (PLCP), a framework to improve conditional validity of prediction sets through learning uncertainty-guided features from the calibration data. We implement PLCP efficiently with alternating gradient descent, utilizing off-the-shelf machine learning models. We further analyze PLCP theoretically and provide conditional guarantees for infinite and finite sample sizes. Finally, our experimental results over four real-world and synthetic datasets show the superior performance of PLCP compared to state-of-the-art methods in terms of coverage and length in both classification and regression scenarios.

ML-8-标题: Fast Abstracts and Student Forum Proceedings – EDCC 2024 – 19th European Dependable Computing Conference

链接: https://arxiv.org/abs/2404.17465
作者: Simona Bernardi, Tommaso Zoppi
备注:

点击查看摘要

Abstract:The goal of the Fast Abstracts track is to bring together researchers and practitioners working on dependable computing to discuss work in progress or opinion pieces. Contributions are welcome from academia and industry. Fast Abstracts aim to serve as a rapid and flexible mechanism to: (i) Report on current work that may or may not be complete; (ii) Introduce new ideas to the community; (iii) State positions on controversial issues or open problems; (iv) Share lessons learnt from real-word dependability engineering; and (v) Debunk or question results from other papers based on contra-indications. The Student Forum aims at creating a vibrant and friendly environment where students can present and discuss their work, and exchange ideas and experiences with other students, researchers and industry. One of the key goals of the Forum is to provide students with feedback on their preliminary results that might help with their future research directions.

ML-9-标题: Multi-layer random features and the approximation power of neural networks UAI

链接: https://arxiv.org/abs/2404.17461
作者: Rustem Takhanov
备注: Accepted to Uncertainty in Artificial Intelligence (UAI) 2024

点击查看摘要

Abstract:A neural architecture with randomly initialized weights, in the infinite width limit, is equivalent to a Gaussian Random Field whose covariance function is the so-called Neural Network Gaussian Process kernel (NNGP). We prove that a reproducing kernel Hilbert space (RKHS) defined by the NNGP contains only functions that can be approximated by the architecture. To achieve a certain approximation error the required number of neurons in each layer is defined by the RKHS norm of the target function. Moreover, the approximation can be constructed from a supervised dataset by a random multi-layer representation of an input vector, together with training of the last layer’s weights. For a 2-layer NN and a domain equal to an n-1 -dimensional sphere in \mathbb R^n , we compare the number of neurons required by Barron’s theorem and by the multi-layer features construction. We show that if eigenvalues of the integral operator of the NNGP decay slower than k^-n-\frac23 where k is an order of an eigenvalue, then our theorem guarantees a more succinct neural network approximation than Barron’s theorem. We also make some computational experiments to verify our theoretical findings. Our experiments show that realistic neural networks easily learn target functions even when both theorems do not give any guarantees.

ML-10-标题: Domain Adaptive and Fine-grained Anomaly Detection for Single-cell Sequencing Data and Beyond IJCAI2024

链接: https://arxiv.org/abs/2404.17454
作者: Kaichen Xu, Yueyang Ding, Suyang Hou, Weiqiang Zhan, Nisang Chen, Jun Wang, Xiaobo Sun
备注: 17 pages, 2 figures. Accepted by IJCAI 2024

点击查看摘要

Abstract:Fined-grained anomalous cell detection from affected tissues is critical for clinical diagnosis and pathological research. Single-cell sequencing data provide unprecedented opportunities for this task. However, current anomaly detection methods struggle to handle domain shifts prevalent in multi-sample and multi-domain single-cell sequencing data, leading to suboptimal performance. Moreover, these methods fall short of distinguishing anomalous cells into pathologically distinct subtypes. In response, we propose ACSleuth, a novel, reconstruction deviation-guided generative framework that integrates the detection, domain adaptation, and fine-grained annotating of anomalous cells into a methodologically cohesive workflow. Notably, we present the first theoretical analysis of using reconstruction deviations output by generative models for anomaly detection in lieu of domain shifts. This analysis informs us to develop a novel and superior maximum mean discrepancy-based anomaly scorer in ACSleuth. Extensive benchmarks over various single-cell data and other types of tabular data demonstrate ACSleuth’s superiority over the state-of-the-art methods in identifying and subtyping anomalies in multi-sample and multi-domain contexts. Our code is available at this https URL.

ML-11-标题: A Continuous Relaxation for Discrete Bayesian Optimization

链接: https://arxiv.org/abs/2404.17452
作者: Richard Michael, Simon Bartels, Miguel González-Duque, Yevgen Zainchkovskyy, Jes Frellsen, Søren Hauberg, Wouter Boomsma
备注:

点击查看摘要

Abstract:To optimize efficiently over discrete data and with only few available target observations is a challenge in Bayesian optimization. We propose a continuous relaxation of the objective function and show that inference and optimization can be computationally tractable. We consider in particular the optimization domain where very few observations and strict budgets exist; motivated by optimizing protein sequences for expensive to evaluate bio-chemical properties. The advantages of our approach are two-fold: the problem is treated in the continuous setting, and available prior knowledge over sequences can be incorporated directly. More specifically, we utilize available and learned distributions over the problem domain for a weighting of the Hellinger distance which yields a covariance function. We show that the resulting acquisition function can be optimized with both continuous or discrete optimization algorithms and empirically assess our method on two bio-chemical sequence optimization tasks.

ML-12-标题: Any-Quantile Probabilistic Forecasting of Short-Term Electricity Demand

链接: https://arxiv.org/abs/2404.17451
作者: Slawek Smyl, Boris N. Oreshkin, Paweł Pełka, Grzegorz Dudek
备注:

点击查看摘要

Abstract:Power systems operate under uncertainty originating from multiple factors that are impossible to account for deterministically. Distributional forecasting is used to control and mitigate risks associated with this uncertainty. Recent progress in deep learning has helped to significantly improve the accuracy of point forecasts, while accurate distributional forecasting still presents a significant challenge. In this paper, we propose a novel general approach for distributional forecasting capable of predicting arbitrary quantiles. We show that our general approach can be seamlessly applied to two distinct neural architectures leading to the state-of-the-art distributional forecasting results in the context of short-term electricity demand forecasting task. We empirically validate our method on 35 hourly electricity demand time-series for European countries. Our code is available here: this https URL.

ML-13-标题: Evaluations of Machine Learning Privacy Defenses are Misleading

链接: https://arxiv.org/abs/2404.17399
作者: Michael Aerni, Jie Zhang, Florian Tramèr
备注:

点击查看摘要

Abstract:Empirical defenses for machine learning privacy forgo the provable guarantees of differential privacy in the hope of achieving higher utility while resisting realistic adversaries. We identify severe pitfalls in existing empirical privacy evaluations (based on membership inference attacks) that result in misleading conclusions. In particular, we show that prior evaluations fail to characterize the privacy leakage of the most vulnerable samples, use weak attacks, and avoid comparisons with practical differential privacy baselines. In 5 case studies of empirical privacy defenses, we find that prior evaluations underestimate privacy leakage by an order of magnitude. Under our stronger evaluation, none of the empirical defenses we study are competitive with a properly tuned, high-utility DP-SGD baseline (with vacuous provable guarantees).

ML-14-标题: M3BAT: Unsupervised Domain Adaptation for Multimodal Mobile Sensing with Multi-Branch Adversarial Training

链接: https://arxiv.org/abs/2404.17391
作者: Lakmal Meegahapola, Hamza Hassoune, Daniel Gatica-Perez
备注: Accepted at the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT). Paper will be presented at ACM UbiComp 2024

点击查看摘要

Abstract:Over the years, multimodal mobile sensing has been used extensively for inferences regarding health and well being, behavior, and context. However, a significant challenge hindering the widespread deployment of such models in real world scenarios is the issue of distribution shift. This is the phenomenon where the distribution of data in the training set differs from the distribution of data in the real world, the deployment environment. While extensively explored in computer vision and natural language processing, and while prior research in mobile sensing briefly addresses this concern, current work primarily focuses on models dealing with a single modality of data, such as audio or accelerometer readings, and consequently, there is little research on unsupervised domain adaptation when dealing with multimodal sensor data. To address this gap, we did extensive experiments with domain adversarial neural networks (DANN) showing that they can effectively handle distribution shifts in multimodal sensor data. Moreover, we proposed a novel improvement over DANN, called M3BAT, unsupervised domain adaptation for multimodal mobile sensing with multi-branch adversarial training, to account for the multimodality of sensor data during domain adaptation with multiple branches. Through extensive experiments conducted on two multimodal mobile sensing datasets, three inference tasks, and 14 source-target domain pairs, including both regression and classification, we demonstrate that our approach performs effectively on unseen domains. Compared to directly deploying a model trained in the source domain to the target domain, the model shows performance increases up to 12% AUC (area under the receiver operating characteristics curves) on classification tasks, and up to 0.13 MAE (mean absolute error) on regression tasks.

ML-15-标题: Adversarial Consistency and the Uniqueness of the Adversarial Bayes Classifier

链接: https://arxiv.org/abs/2404.17358
作者: Natalie S. Frank
备注: 17 pages

点击查看摘要

Abstract:Adversarial training is a common technique for learning robust classifiers. Prior work showed that convex surrogate losses are not statistically consistent in the adversarial context – or in other words, a minimizing sequence of the adversarial surrogate risk will not necessarily minimize the adversarial classification error. We connect the consistency of adversarial surrogate losses to properties of minimizers to the adversarial classification risk, known as \emphadversarial Bayes classifiers. Specifically, under reasonable distributional assumptions, a convex loss is statistically consistent for adversarial learning iff the adversarial Bayes classifier satisfies a certain notion of uniqueness.

ML-16-标题: Fast Evaluation of Additive Kernels: Feature Arrangement Fourier Methods and Kernel Derivatives

链接: https://arxiv.org/abs/2404.17344
作者: Theresa Wagner, Franziska Nestler, Martin Stoll
备注: Official code this https URL

点击查看摘要

Abstract:One of the main computational bottlenecks when working with kernel based learning is dealing with the large and typically dense kernel matrix. Techniques dealing with fast approximations of the matrix vector product for these kernel matrices typically deteriorate in their performance if the feature vectors reside in higher-dimensional feature spaces. We here present a technique based on the non-equispaced fast Fourier transform (NFFT) with rigorous error analysis. We show that this approach is also well suited to allow the approximation of the matrix that arises when the kernel is differentiated with respect to the kernel hyperparameters; a problem often found in the training phase of methods such as Gaussian processes. We also provide an error analysis for this case. We illustrate the performance of the additive kernel scheme with fast matrix vector products on a number of data sets. Our code is available at this https URL

ML-17-标题: A Deep Dive into Effects of Structural Bias on CMA-ES Performance along Affine Trajectories PPSN2024

链接: https://arxiv.org/abs/2404.17323
作者: Niki van Stein, Sarah L. Thomson, Anna V. Kononova
备注: 15 pages, 5 figures, submitted to PPSN 2024

点击查看摘要

Abstract:To guide the design of better iterative optimisation heuristics, it is imperative to understand how inherent structural biases within algorithm components affect the performance on a wide variety of search landscapes. This study explores the impact of structural bias in the modular Covariance Matrix Adaptation Evolution Strategy (modCMA), focusing on the roles of various modulars within the algorithm. Through an extensive investigation involving 435,456 configurations of modCMA, we identified key modules that significantly influence structural bias of various classes. Our analysis utilized the Deep-BIAS toolbox for structural bias detection and classification, complemented by SHAP analysis for quantifying module contributions. The performance of these configurations was tested on a sequence of affine-recombined functions, maintaining fixed optimum locations while gradually varying the landscape features. Our results demonstrate an interplay between module-induced structural bias and algorithm performance across different landscape characteristics.

ML-18-标题: Lazy Data Practices Harm Fairness Research

链接: https://arxiv.org/abs/2404.17293
作者: Jan Simson, Alessandro Fabris, Christoph Kern
备注: Accepted for publication at the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2024

点击查看摘要

Abstract:Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field by highlighting shortcomings and proposing recommendations for improvement. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices hinder the reach and reliability of algorithmic fairness findings. We systematically study protected information encoded in tabular datasets and their usage in 280 experiments across 142 publications. Our analyses identify three main areas of concern: (1) a \textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread \textbfexclusion of minorities during data preprocessing; and (3) \textbfopaque data processing threatening the generalization of fairness research. By conducting exemplary analyses on the utilization of prominent datasets, we demonstrate how unreflective data decisions disproportionately affect minority groups, fairness metrics, and resultant model comparisons. Additionally, we identify supplementary factors such as limitations in publicly available data, privacy considerations, and a general lack of awareness, which exacerbate these challenges. To address these issues, we propose a set of recommendations for data usage in fairness research centered on transparency and responsible inclusion. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.

ML-19-标题: Machine Learning based prediction of Vanadium Redox Flow Battery temperature rise under different charge-discharge conditions

链接: https://arxiv.org/abs/2404.17284
作者: Anirudh Narayan D, Akshat Johar, Divye Kalra, Bhavya Ardeshna, Ankur Bhattacharjee
备注: 21 pages, 5 figures

点击查看摘要

Abstract:Accurate prediction of battery temperature rise is very essential for designing an efficient thermal management scheme. In this paper, machine learning (ML) based prediction of Vanadium Redox Flow Battery (VRFB) thermal behavior during charge-discharge operation has been demonstrated for the first time. Considering different currents with a specified electrolyte flow rate, the temperature of a kW scale VRFB system is studied through experiments. Three different ML algorithms; Linear Regression (LR), Support Vector Regression (SVR) and Extreme Gradient Boost (XGBoost) have been used for the prediction work. The training and validation of ML algorithms have been done by the practical dataset of a 1kW 6kWh VRFB storage under 40A, 45A, 50A and 60A charge-discharge currents and 10 L min-1 of flow rate. A comparative analysis among the ML algorithms is done in terms of performance metrics such as correlation coefficient (R2), mean absolute error (MAE) and root mean square error (RMSE). It is observed that XGBoost shows the highest accuracy in prediction of around 99%. The ML based prediction results obtained in this work can be very useful for controlling the VRFB temperature rise during operation and act as indicator for further development of an optimized thermal management system.

ML-20-标题: Efficient Deterministic Renewable Energy Forecasting Guided by Multiple-Location Weather Data

链接: https://arxiv.org/abs/2404.17276
作者: Charalampos Symeonidis, Nikos Nikolaidis
备注:

点击查看摘要

Abstract:Electricity generated from renewable energy sources has been established as an efficient remedy for both energy shortages and the environmental pollution stemming from conventional energy production methods. Solar and wind power are two of the most dominant renewable energy sources. The accurate forecasting of the energy generation of those sources facilitates their integration into electric grids, by minimizing the negative impact of uncertainty regarding their management and operation. This paper proposes a novel methodology for deterministic wind and solar energy generation forecasting for multiple generation sites, utilizing multi-location weather forecasts. The method employs a U-shaped Temporal Convolutional Auto-Encoder (UTCAE) architecture for temporal processing of weather-related and energy-related time-series across each site. The Multi-sized Kernels convolutional Spatio-Temporal Attention (MKST-Attention), inspired by the multi-head scaled-dot product attention mechanism, is also proposed aiming to efficiently transfer temporal patterns from weather data to energy data, without a priori knowledge of the locations of the power stations and the locations of provided weather data. The conducted experimental evaluation on a day-ahead solar and wind energy forecasting scenario on five datasets demonstrated that the proposed method achieves top results, outperforming all competitive time-series forecasting state-of-the-art methods.

ML-21-标题: Making Better Use of Unlabelled Data in Bayesian Active Learning AISTATS2024

链接: https://arxiv.org/abs/2404.17249
作者: Freddie Bickford Smith, Adam Foster, Tom Rainforth
备注: Published at AISTATS 2024

点击查看摘要

Abstract:Fully supervised models are predominant in Bayesian active learning. We argue that their neglect of the information present in unlabelled data harms not just predictive performance but also decisions about what data to acquire. Our proposed solution is a simple framework for semi-supervised Bayesian active learning. We find it produces better-performing models than either conventional Bayesian active learning or semi-supervised learning with randomly acquired data. It is also easier to scale up than the conventional approach. As well as supporting a shift towards semi-supervised models, our findings highlight the importance of studying models and acquisition methods in conjunction.

ML-22-标题: Cycling into the workshop: predictive maintenance for Barcelonas bike-sharing system

链接: https://arxiv.org/abs/2404.17217
作者: Jordi Grau-Escolano, Aleix Bassolas, Julian Vicens
备注: 25 pages, 9 figures, 7 tables

点击查看摘要

Abstract:Bike-sharing systems have emerged as a significant element of urban mobility, providing an environmentally friendly transportation alternative. With the increasing integration of electric bikes alongside mechanical bikes, it is crucial to illuminate distinct usage patterns and their impact on maintenance. Accordingly, this research aims to develop a comprehensive understanding of mobility dynamics, distinguishing between different mobility modes, and introducing a novel predictive maintenance system tailored for bikes. By utilising a combination of trip information and maintenance data from Barcelona’s bike-sharing system, Bicing, this study conducts an extensive analysis of mobility patterns and their relationship to failures of bike components. To accurately predict maintenance needs for essential bike parts, this research delves into various mobility metrics and applies statistical and machine learning survival models, including deep learning models. Due to their complexity, and with the objective of bolstering confidence in the system’s predictions, interpretability techniques explain the main predictors of maintenance needs. The analysis reveals marked differences in the usage patterns of mechanical bikes and electric bikes, with a growing user preference for the latter despite their extra costs. These differences in mobility were found to have a considerable impact on the maintenance needs within the bike-sharing system. Moreover, the predictive maintenance models proved effective in forecasting these maintenance needs, capable of operating across an entire bike fleet. Despite challenges such as approximated bike usage metrics and data imbalances, the study successfully showcases the feasibility of an accurate predictive maintenance system capable of improving operational costs, bike availability, and security.

ML-23-标题: An Explainable Deep Reinforcement Learning Model for Warfarin Maintenance Dosing Using Policy Distillation and Action Forging

链接: https://arxiv.org/abs/2404.17187
作者: Sadjad Anzabi Zadeh, W. Nick Street, Barrett W. Thomas
备注:

点击查看摘要

Abstract:Deep Reinforcement Learning is an effective tool for drug dosing for chronic condition management. However, the final protocol is generally a black box without any justification for its prescribed doses. This paper addresses this issue by proposing an explainable dosing protocol for warfarin using a Proximal Policy Optimization method combined with Policy Distillation. We introduce Action Forging as an effective tool to achieve explainability. Our focus is on the maintenance dosing protocol. Results show that the final model is as easy to understand and deploy as the current dosing protocols and outperforms the baseline dosing algorithms.

ML-24-标题: RE-RFME: Real-Estate RFME Model for customer segmentation

链接: https://arxiv.org/abs/2404.17177
作者: Anurag Kumar Pandey, Anil Goyal, Nikhil Sikka
备注:

点击查看摘要

Abstract:Marketing is one of the high-cost activities for any online platform. With the increase in the number of customers, it is crucial to understand customers based on their dynamic behaviors to design effective marketing strategies. Customer segmentation is a widely used approach to group customers into different categories and design the marketing strategy targeting each group individually. Therefore, in this paper, we propose an end-to-end pipeline RE-RFME for segmenting customers into 4 groups: high value, promising, need attention, and need activation. Concretely, we propose a novel RFME (Recency, Frequency, Monetary and Engagement) model to track behavioral features of customers and segment them into different categories. Finally, we train the K-means clustering algorithm to cluster the user into one of the 4 categories. We show the effectiveness of the proposed approach on real-world this http URL datasets for both website and mobile application users.

ML-25-标题: Optimizing Cycle Life Prediction of Lithium-ion Batteries via a Physics-Informed Model

链接: https://arxiv.org/abs/2404.17174
作者: Constantin-Daniel Nicolae, Sara Sameer, Nathan Sun, Karena Yan
备注:

点击查看摘要

Abstract:Accurately measuring the cycle lifetime of commercial lithium-ion batteries is crucial for performance and technology development. We introduce a novel hybrid approach combining a physics-based equation with a self-attention model to predict the cycle lifetimes of commercial lithium iron phosphate graphite cells via early-cycle data. After fitting capacity loss curves to this physics-based equation, we then use a self-attention layer to reconstruct entire battery capacity loss curves. Our model exhibits comparable performances to existing models while predicting more information: the entire capacity loss curve instead of cycle life. This provides more robustness and interpretability: our model does not need to be retrained for a different notion of end-of-life and is backed by physical intuition.

ML-26-标题: FairGT: A Fairness-aware Graph Transformer

链接: https://arxiv.org/abs/2404.17169
作者: Renqiang Luo, Huafei Huang, Shuo Yu, Xiuzhen Zhang, Feng Xia
备注:

点击查看摘要

Abstract:The design of Graph Transformers (GTs) generally neglects considerations for fairness, resulting in biased outcomes against certain sensitive subgroups. Since GTs encode graph information without relying on message-passing mechanisms, conventional fairness-aware graph learning methods cannot be directly applicable to address these issues. To tackle this challenge, we propose FairGT, a Fairness-aware Graph Transformer explicitly crafted to mitigate fairness concerns inherent in GTs. FairGT incorporates a meticulous structural feature selection strategy and a multi-hop node feature integration method, ensuring independence of sensitive features and bolstering fairness considerations. These fairness-aware graph information encodings seamlessly integrate into the Transformer framework for downstream tasks. We also prove that the proposed fair structural topology encoding with adjacency matrix eigenvector selection and multi-hop integration are theoretically effective. Empirical evaluations conducted across five real-world datasets demonstrate FairGT’s superiority in fairness metrics over existing graph transformers, graph neural networks, and state-of-the-art fairness-aware graph learning approaches.

ML-27-标题: DPGAN: A Dual-Path Generative Adversarial Network for Missing Data Imputation in Graphs

链接: https://arxiv.org/abs/2404.17164
作者: Xindi Zheng, Yuwei Wu, Yu Pan, Wanyu Lin, Lei Ma, Jianjun Zhao
备注: 9 pages

点击查看摘要

Abstract:Missing data imputation poses a paramount challenge when dealing with graph data. Prior works typically are based on feature propagation or graph autoencoders to address this issue. However, these methods usually encounter the over-smoothing issue when dealing with missing data, as the graph neural network (GNN) modules are not explicitly designed for handling missing data. This paper proposes a novel framework, called Dual-Path Generative Adversarial Network (DPGAN), that can deal simultaneously with missing data and avoid over-smoothing problems. The crux of our work is that it admits both global and local representations of the input graph signal, which can capture the long-range dependencies. It is realized via our proposed generator, consisting of two key components, i.e., MLPUNet++ and GraphUNet++. Our generator is trained with a designated discriminator via an adversarial process. In particular, to avoid assessing the entire graph as did in the literature, our discriminator focuses on the local subgraph fidelity, thereby boosting the quality of the local imputation. The subgraph size is adjustable, allowing for control over the intensity of adversarial regularization. Comprehensive experiments across various benchmark datasets substantiate that DPGAN consistently rivals, if not outperforms, existing state-of-the-art imputation algorithms. The code is provided at \urlthis https URL.

ML-28-标题: Online mathrmLnatural-Convex Minimization

链接: https://arxiv.org/abs/2404.17158
作者: Ken Yokoyama, Shinji Ito, Tatsuya Matsuoka, Kei Kimura, Makoto Yokoo
备注:

点击查看摘要

Abstract:An online decision-making problem is a learning problem in which a player repeatedly makes decisions in order to minimize the long-term loss. These problems that emerge in applications often have nonlinear combinatorial objective functions, and developing algorithms for such problems has attracted considerable attention. An existing general framework for dealing with such objective functions is the online submodular minimization. However, practical problems are often out of the scope of this framework, since the domain of a submodular function is limited to a subset of the unit hypercube. To manage this limitation of the existing framework, we in this paper introduce the online \mathrmL^\natural -convex minimization, where an \mathrmL^\natural -convex function generalizes a submodular function so that the domain is a subset of the integer lattice. We propose computationally efficient algorithms for the online \mathrmL^\natural -convex function minimization in two major settings: the full information and the bandit settings. We analyze the regrets of these algorithms and show in particular that our algorithm for the full information setting obtains a tight regret bound up to a constant factor. We also demonstrate several motivating examples that illustrate the usefulness of the online \mathrmL^\natural -convex minimization.

ML-29-标题: Neuro-Symbolic Embedding for Short and Effective Feature Selection via Autoregressive Generation

链接: https://arxiv.org/abs/2404.17157
作者: Nanxu Gong, Wangyang Ying, Dongjie Wang, Yanjie Fu
备注:

点击查看摘要

Abstract:Feature selection aims to identify the optimal feature subset for enhancing downstream models. Effective feature selection can remove redundant features, save computational resources, accelerate the model learning process, and improve the model overall performance. However, existing works are often time-intensive to identify the effective feature subset within high-dimensional feature spaces. Meanwhile, these methods mainly utilize a single downstream task performance as the selection criterion, leading to the selected subsets that are not only redundant but also lack generalizability. To bridge these gaps, we reformulate feature selection through a neuro-symbolic lens and introduce a novel generative framework aimed at identifying short and effective feature subsets. More specifically, we found that feature ID tokens of the selected subset can be formulated as symbols to reflect the intricate correlations among features. Thus, in this framework, we first create a data collector to automatically collect numerous feature selection samples consisting of feature ID tokens, model performance, and the measurement of feature subset redundancy. Building on the collected data, an encoder-decoder-evaluator learning paradigm is developed to preserve the intelligence of feature selection into a continuous embedding space for efficient search. Within the learned embedding space, we leverage a multi-gradient search algorithm to find more robust and generalized embeddings with the objective of improving model performance and reducing feature subset redundancy. These embeddings are then utilized to reconstruct the feature ID tokens for executing the final feature selection. Ultimately, comprehensive experiments and case studies are conducted to validate the effectiveness of the proposed framework.

ML-30-标题: Sensor Response-Time Reduction using Long-Short Term Memory Network Forecasting

链接: https://arxiv.org/abs/2404.17144
作者: Simon J. Ward, Muhamed Baljevic, Sharon M. Weiss
备注: 9 pages, 3 figures

点击查看摘要

Abstract:The response time of a biosensor is a crucial metric in safety-critical applications such as medical diagnostics where an earlier diagnosis can markedly improve patient outcomes. However, the speed at which a biosensor reaches a final equilibrium state can be limited by poor mass transport and long molecular diffusion times that increase the time it takes target molecules to reach the active sensing region of a biosensor. While optimization of system and sensor design can promote molecules reaching the sensing element faster, a simpler and complementary approach for response time reduction that is widely applicable across all sensor platforms is to use time-series forecasting to predict the ultimate steady-state sensor response. In this work, we show that ensembles of long short-term memory (LSTM) networks can accurately predict equilibrium biosensor response from a small quantity of initial time-dependent biosensor measurements, allowing for significant reduction in response time by a mean and median factor of improvement of 18.6 and 5.1, respectively. The ensemble of models also provides simultaneous estimation of uncertainty, which is vital to provide confidence in the predictions and subsequent safety-related decisions that are made. This approach is demonstrated on real-time experimental data collected by exposing porous silicon biosensors to buffered protein solutions using a multi-channel fluidic cell that enables the automated measurement of 100 porous silicon biosensors in parallel. The dramatic improvement in sensor response time achieved using LSTM network ensembles and associated uncertainty quantification opens the door to trustworthy and faster responding biosensors, enabling more rapid medical diagnostics for improved patient outcomes and healthcare access, as well as quicker identification of toxins in food and the environment.

ML-31-标题: Deep Evidential Learning for Dose Prediction

链接: https://arxiv.org/abs/2404.17126
作者: Hai Siong Tan, Kuancheng Wang, Rafe Mcbeth
备注: 24 pages, 8 figures

点击查看摘要

Abstract:In this work, we present a novel application of an uncertainty-quantification framework called Deep Evidential Learning in the domain of radiotherapy dose prediction. Using medical images of the Open Knowledge-Based Planning Challenge dataset, we found that this model can be effectively harnessed to yield uncertainty estimates that inherited correlations with prediction errors upon completion of network training. This was achieved only after reformulating the original loss function for a stable implementation. We found that (i)epistemic uncertainty was highly correlated with prediction errors, with various association indices comparable or stronger than those for Monte-Carlo Dropout and Deep Ensemble methods, (ii)the median error varied with uncertainty threshold much more linearly for epistemic uncertainty in Deep Evidential Learning relative to these other two conventional frameworks, indicative of a more uniformly calibrated sensitivity to model errors, (iii)relative to epistemic uncertainty, aleatoric uncertainty demonstrated a more significant shift in its distribution in response to Gaussian noise added to CT intensity, compatible with its interpretation as reflecting data noise. Collectively, our results suggest that Deep Evidential Learning is a promising approach that can endow deep-learning models in radiotherapy dose prediction with statistical robustness. Towards enhancing its clinical relevance, we demonstrate how we can use such a model to construct the predicted Dose-Volume-Histograms’ confidence intervals.

ML-32-标题: MER 2024: Semi-Supervised Learning Noise Robustness and Open-Vocabulary Multimodal Emotion Recognition

链接: https://arxiv.org/abs/2404.17113
作者: Zheng Lian, Haiyang Sun, Licai Sun, Zhuofan Wen, Siyuan Zhang, Shun Chen, Hao Gu, Jinming Zhao, Ziyang Ma, Xie Chen, Jiangyan Yi, Rui Liu, Kele Xu, Bin Liu, Erik Cambria, Guoying Zhao, Björn W. Schuller, Jianhua Tao
备注:

点击查看摘要

Abstract:Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing dataset size and building more effective architectures. However, due to various reasons (such as complex environments and inaccurate labels), current systems still cannot meet the demands of practical applications. Therefore, we plan to organize a series of challenges around emotion recognition to further promote the development of this field. Last year, we launched MER2023, focusing on three topics: multi-label learning, noise robustness, and semi-supervised learning. This year, we continue to organize MER2024. In addition to expanding the dataset size, we introduce a new track around open-vocabulary emotion recognition. The main consideration for this track is that existing datasets often fix the label space and use majority voting to enhance annotator consistency, but this process may limit the model’s ability to describe subtle emotions. In this track, we encourage participants to generate any number of labels in any category, aiming to describe the character’s emotional state as accurately as possible. Our baseline is based on MERTools and the code is available at: this https URL.

ML-33-标题: Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT

链接: https://arxiv.org/abs/2404.17110
作者: Triet H. M. Le, M. Ali Babar, Tung Hoang Thai
备注: Accepted in the 4th International Workshop on Software Security co-located with the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE) 2024

点击查看摘要

Abstract:Background: Software Vulnerability (SV) prediction in emerging languages is increasingly important to ensure software security in modern systems. However, these languages usually have limited SV data for developing high-performing prediction models. Aims: We conduct an empirical study to evaluate the impact of SV data scarcity in emerging languages on the state-of-the-art SV prediction model and investigate potential solutions to enhance the performance. Method: We train and test the state-of-the-art model based on CodeBERT with and without data sampling techniques for function-level and line-level SV prediction in three low-resource languages - Kotlin, Swift, and Rust. We also assess the effectiveness of ChatGPT for low-resource SV prediction given its recent success in other domains. Results: Compared to the original work in C/C++ with large data, CodeBERT’s performance of function-level and line-level SV prediction significantly declines in low-resource languages, signifying the negative impact of data scarcity. Regarding remediation, data sampling techniques fail to improve CodeBERT; whereas, ChatGPT showcases promising results, substantially enhancing predictive performance by up to 34.4% for the function level and up to 53.5% for the line level. Conclusion: We have highlighted the challenge and made the first promising step for low-resource SV prediction, paving the way for future research in this direction.

ML-34-标题: Unleashing the Potential of Fractional Calculus in Graph Neural Networks with FROND

链接: https://arxiv.org/abs/2404.17099
作者: Qiyu Kang, Kai Zhao, Qinxu Ding, Feng Ji, Xuhao Li, Wenfei Liang, Yang Song, Wee Peng Tay
备注: The Twelfth International Conference on Learning Representations

点击查看摘要

Abstract:We introduce the FRactional-Order graph Neural Dynamical network (FROND), a new continuous graph neural network (GNN) framework. Unlike traditional continuous GNNs that rely on integer-order differential equations, FROND employs the Caputo fractional derivative to leverage the non-local properties of fractional calculus. This approach enables the capture of long-term dependencies in feature updates, moving beyond the Markovian update mechanisms in conventional integer-order models and offering enhanced capabilities in graph representation learning. We offer an interpretation of the node feature updating process in FROND from a non-Markovian random walk perspective when the feature updating is particularly governed by a diffusion process. We demonstrate analytically that oversmoothing can be mitigated in this setting. Experimentally, we validate the FROND framework by comparing the fractional adaptations of various established integer-order continuous GNNs, demonstrating their consistently improved performance and underscoring the framework’s potential as an effective extension to enhance traditional continuous GNNs. The code is available at \urlthis https URL.

ML-35-标题: Channel Modeling for FR3 Upper Mid-band via Generative Adversarial Networks

链接: https://arxiv.org/abs/2404.17069
作者: Yaqi Hu, Mingsheng Yin, Marco Mezzavilla, Hao Guo, Sundeep Rangan
备注:

点击查看摘要

Abstract:The upper mid-band (FR3) has been recently attracting interest for new generation of mobile networks, as it provides a promising balance between spectrum availability and coverage, which are inherent limitations of the sub 6GHz and millimeter wave bands, respectively. In order to efficiently design and optimize the network, channel modeling plays a key role since FR3 systems are expected to operate at multiple frequency bands. Data-driven methods, especially generative adversarial networks (GANs), can capture the intricate relationships among data samples, and provide an appropriate tool for FR3 channel modeling. In this work, we present the architecture, link state model, and path generative network of GAN-based FR3 channel modeling. The comparison of our model greatly matches the ray-tracing simulated data.

ML-36-标题: Transductive Spiking Graph Neural Networks for Loihi

链接: https://arxiv.org/abs/2404.17048
作者: Shay Snyder (1), Victoria Clerico (1, 2), Guojing Cong (3), Shruti Kulkarni (3), Catherine Schuman (4), Sumedh R. Risbud (5), Maryam Parsa (1) ((1) George Mason University, (2) Universidad Politecnica de Madrid, (3) Oak Ridge National Laboratory, (4) University of Tennessee - Knoxville, (5) Intel Labs)
备注: 6 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Graph neural networks have emerged as a specialized branch of deep learning, designed to address problems where pairwise relations between objects are crucial. Recent advancements utilize graph convolutional neural networks to extract features within graph structures. Despite promising results, these methods face challenges in real-world applications due to sparse features, resulting in inefficient resource utilization. Recent studies draw inspiration from the mammalian brain and employ spiking neural networks to model and learn graph structures. However, these approaches are limited to traditional Von Neumann-based computing systems, which still face hardware inefficiencies. In this study, we present a fully neuromorphic implementation of spiking graph neural networks designed for Loihi 2. We optimize network parameters using Lava Bayesian Optimization, a novel hyperparameter optimization system compatible with neuromorphic computing architectures. We showcase the performance benefits of combining neuromorphic Bayesian optimization with our approach for citation graph classification using fixed-precision spiking neurons. Our results demonstrate the capability of integer-precision, Loihi 2 compatible spiking neural networks in performing citation graph classification with comparable accuracy to existing floating point implementations.

ML-37-标题: Near to Mid-term Risks and Opportunities of Open Source Generative AI

链接: https://arxiv.org/abs/2404.17047
作者: Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder de Witt, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Botos Csaba, Fabro Steibel, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Marvin Imperial, Juan A. Nolazco-Flores, Lori Landay, Matthew Jackson, Paul Röttger, Philip H.S. Torr, Trevor Darrell, Yong Suk Lee, Jakob Foerster
备注:

点击查看摘要

Abstract:In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This regulation is likely to put at risk the budding field of open source Generative AI. We argue for the responsible open sourcing of generative AI models in the near and medium term. To set the stage, we first introduce an AI openness taxonomy system and apply it to 40 current large language models. We then outline differential benefits and risks of open versus closed source AI and present potential risk mitigation, ranging from best practices to calls for technical and scientific contributions. We hope that this report will add a much needed missing voice to the current public discourse on near to mid-term AI safety and other societal impact.

ML-38-标题: Learning Actionable Counterfactual Explanations in Large State Spaces

链接: https://arxiv.org/abs/2404.17034
作者: Keziah Naggita, Matthew R. Walter, Avrim Blum
备注:

点击查看摘要

Abstract:Counterfactual explanations (CFEs) are sets of actions that an agent with a negative classification could take to achieve a (desired) positive classification, for consequential decisions such as loan applications, hiring, admissions, etc. In this work, we consider settings where optimal CFEs correspond to solutions of weighted set cover problems. In particular, there is a collection of actions that agents can perform that each have their own cost and each provide the agent with different sets of capabilities. The agent wants to perform the cheapest subset of actions that together provide all the needed capabilities to achieve a positive classification. Since this is an NP-hard optimization problem, we are interested in the question: can we, from training data (instances of agents and their optimal CFEs) learn a CFE generator that will quickly provide optimal sets of actions for new agents? In this work, we provide a deep-network learning procedure that we show experimentally is able to achieve strong performance at this task. We consider several problem formulations, including formulations in which the underlying “capabilities” and effects of actions are not explicitly provided, and so there is an informational challenge in addition to the computational challenge. Our problem can also be viewed as one of learning an optimal policy in a family of large but deterministic Markov Decision Processes (MDPs).

ML-39-标题: Out-of-Distribution Detection using Maximum Entropy Coding

链接: https://arxiv.org/abs/2404.17023
作者: Mojtaba Abolfazli, Mohammad Zaeri Amirani, Anders Høst-Madsen, June Zhang, Andras Bratincsak
备注:

点击查看摘要

Abstract:Given a default distribution P and a set of test data x^M=\x_1,x_2,\ldots,x_M\ this paper seeks to answer the question if it was likely that x^M was generated by P . For discrete distributions, the definitive answer is in principle given by Kolmogorov-Martin-Löf randomness. In this paper we seek to generalize this to continuous distributions. We consider a set of statistics T_1(x^M),T_2(x^M),\ldots . To each statistic we associate its maximum entropy distribution and with this a universal source coder. The maximum entropy distributions are subsequently combined to give a total codelength, which is compared with -\log P(x^M) . We show that this approach satisfied a number of theoretical properties. For real world data P usually is unknown. We transform data into a standard distribution in the latent space using a bidirectional generate network and use maximum entropy coding there. We compare the resulting method to other methods that also used generative neural networks to detect anomalies. In most cases, our results show better performance.

ML-40-标题: IDIL: Imitation Learning of Intent-Driven Expert Behavior AAMAS2024

链接: https://arxiv.org/abs/2404.16989
作者: Sangwon Seo, Vaibhav Unhelkar
备注: Extended version of an identically-titled paper accepted at AAMAS 2024

点击查看摘要

Abstract:When faced with accomplishing a task, human experts exhibit intentional behavior. Their unique intents shape their plans and decisions, resulting in experts demonstrating diverse behaviors to accomplish the same task. Due to the uncertainties encountered in the real world and their bounded rationality, experts sometimes adjust their intents, which in turn influences their behaviors during task execution. This paper introduces IDIL, a novel imitation learning algorithm to mimic these diverse intent-driven behaviors of experts. Iteratively, our approach estimates expert intent from heterogeneous demonstrations and then uses it to learn an intent-aware model of their behavior. Unlike contemporary approaches, IDIL is capable of addressing sequential tasks with high-dimensional state representations, while sidestepping the complexities and drawbacks associated with adversarial training (a mainstay of related techniques). Our empirical results suggest that the models generated by IDIL either match or surpass those produced by recent imitation learning benchmarks in metrics of task performance. Moreover, as it creates a generative model, IDIL demonstrates superior performance in intent inference metrics, crucial for human-agent interactions, and aptly captures a broad spectrum of expert behaviors.

ML-41-标题: COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations

链接: https://arxiv.org/abs/2404.16969
作者: Ruben Ciranni, Emilian Postolache, Giorgio Mariani, Michele Mancusi, Luca Cosmo, Emanuele Rodolà
备注: Demo page: this https URL

点击查看摘要

Abstract:We present COCOLA (Coherence-Oriented Contrastive Learning for Audio), a contrastive learning method for musical audio representations that captures the harmonic and rhythmic coherence between samples. Our method operates at the level of stems (or their combinations) composing music tracks and allows the objective evaluation of compositional models for music in the task of accompaniment generation. We also introduce a new baseline for compositional music generation called CompoNet, based on ControlNet \citezhang2023adding, generalizing the tasks of MSDM, and quantify it against the latter using COCOLA. We release all models trained on public datasets containing separate stems (MUSDB18-HQ, MoisesDB, Slakh2100, and CocoChorales).

ML-42-标题: ML2SC: Deploying Machine Learning Models as Smart Contracts on the Blockchain

链接: https://arxiv.org/abs/2404.16967
作者: Zhikai Li, Steve Vott, Bhaskar Krishnamachar
备注:

点击查看摘要

Abstract:With the growing concern of AI safety, there is a need to trust the computations done by machine learning (ML) models. Blockchain technology, known for recording data and running computations transparently and in a tamper-proof manner, can offer this trust. One significant challenge in deploying ML Classifiers on-chain is that while ML models are typically written in Python using an ML library such as Pytorch, smart contracts deployed on EVM-compatible blockchains are written in Solidity. We introduce Machine Learning to Smart Contract (ML2SC), a PyTorch to Solidity translator that can automatically translate multi-layer perceptron (MLP) models written in Pytorch to Solidity smart contract versions. ML2SC uses a fixed-point math library to approximate floating-point computation. After deploying the generated smart contract, we can train our models off-chain using PyTorch and then further transfer the acquired weights and biases to the smart contract using a function call. Finally, the model inference can also be done with a function call providing the input. We mathematically model the gas costs associated with deploying, updating model parameters, and running inference on these models on-chain, showing that the gas costs increase linearly in various parameters associated with an MLP. We present empirical results matching our modeling. We also evaluate the classification accuracy showing that the outputs obtained by our transparent on-chain implementation are identical to the original off-chain implementation with Pytorch.

ML-43-标题: A Notion of Uniqueness for the Adversarial Bayes Classifier

链接: https://arxiv.org/abs/2404.16956
作者: Natalie S. Frank
备注: 46 pages, 7 figures

点击查看摘要

Abstract:We propose a new notion of uniqueness for the adversarial Bayes classifier in the setting of binary classification. Analyzing this notion of uniqueness produces a simple procedure for computing all adversarial Bayes classifiers for a well-motivated family of one dimensional data distributions. This characterization is then leveraged to show that as the perturbation radius increases, certain notions of regularity improve for adversarial Bayes classifiers. We demonstrate with various examples that the boundary of the adversarial Bayes classifier frequently lies near the boundary of the Bayes classifier.

ML-44-标题: Taming False Positives in Out-of-Distribution Detection with Human Feedback AISTATS2024

链接: https://arxiv.org/abs/2404.16954
作者: Harit Vishwakarma, Heguang Lin, Ramya Korlakai Vinayak
备注: Appeared in the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024)

点击查看摘要

Abstract:Robustness to out-of-distribution (OOD) samples is crucial for safely deploying machine learning models in the open world. Recent works have focused on designing scoring functions to quantify OOD uncertainty. Setting appropriate thresholds for these scoring functions for OOD detection is challenging as OOD samples are often unavailable up front. Typically, thresholds are set to achieve a desired true positive rate (TPR), e.g., 95% TPR. However, this can lead to very high false positive rates (FPR), ranging from 60 to 96%, as observed in the Open-OOD benchmark. In safety-critical real-life applications, e.g., medical diagnosis, controlling the FPR is essential when dealing with various OOD samples dynamically. To address these challenges, we propose a mathematically grounded OOD detection framework that leverages expert feedback to \emphsafely update the threshold on the fly. We provide theoretical results showing that it is guaranteed to meet the FPR constraint at all times while minimizing the use of human feedback. Another key feature of our framework is that it can work with any scoring function for OOD uncertainty quantification. Empirical evaluation of our system on synthetic and benchmark OOD datasets shows that our method can maintain FPR at most 5% while maximizing TPR.

ML-45-标题: Structured Reinforcement Learning for Delay-Optimal Data Transmission in Dense mmWave Networks

链接: https://arxiv.org/abs/2404.16920
作者: Shufan Wang, Guojun Xiong, Shichen Zhang, Huacheng Zeng, Jian Li, Shivendra Panwar
备注: IEEE Transactions on Wireless Communications

点击查看摘要

Abstract:We study the data packet transmission problem (mmDPT) in dense cell-free millimeter wave (mmWave) networks, i.e., users sending data packet requests to access points (APs) via uplinks and APs transmitting requested data packets to users via downlinks. Our objective is to minimize the average delay in the system due to APs’ limited service capacity and unreliable wireless channels between APs and users. This problem can be formulated as a restless multi-armed bandits problem with fairness constraint (RMAB-F). Since finding the optimal policy for RMAB-F is intractable, existing learning algorithms are computationally expensive and not suitable for practical dynamic dense mmWave networks. In this paper, we propose a structured reinforcement learning (RL) solution for mmDPT by exploiting the inherent structure encoded in RMAB-F. To achieve this, we first design a low-complexity and provably asymptotically optimal index policy for RMAB-F. Then, we leverage this structure information to develop a structured RL algorithm called mmDPT-TS, which provably achieves an \tildeO(\sqrtT) Bayesian regret. More importantly, mmDPT-TS is computation-efficient and thus amenable to practical implementation, as it fully exploits the structure of index policy for making decisions. Extensive emulation based on data collected in realistic mmWave networks demonstrate significant gains of mmDPT-TS over existing approaches.

ML-46-标题: On-the-fly Data Augmentation for Forecasting with Deep Learning

链接: https://arxiv.org/abs/2404.16918
作者: Vitor Cerqueira, Moisés Santos, Yassine Baghoussi, Carlos Soares
备注:

点击查看摘要

Abstract:Deep learning approaches are increasingly used to tackle forecasting tasks. A key factor in the successful application of these methods is a large enough training sample size, which is not always available. In these scenarios, synthetic data generation techniques are usually applied to augment the dataset. Data augmentation is typically applied before fitting a model. However, these approaches create a single augmented dataset, potentially limiting their effectiveness. This work introduces OnDAT (On-the-fly Data Augmentation for Time series) to address this issue by applying data augmentation during training and validation. Contrary to traditional methods that create a single, static augmented dataset beforehand, OnDAT performs augmentation on-the-fly. By generating a new augmented dataset on each iteration, the model is exposed to a constantly changing augmented data variations. We hypothesize this process enables a better exploration of the data space, which reduces the potential for overfitting and improves forecasting performance. We validated the proposed approach using a state-of-the-art deep learning forecasting method and 8 benchmark datasets containing a total of 75797 time series. The experiments suggest that OnDAT leads to better forecasting performance than a strategy that applies data augmentation before training as well as a strategy that does not involve data augmentation. The method and experiments are publicly available.

ML-47-标题: DE-CGAN: Boosting rTMS Treatment Prediction with Diversity Enhancing Conditional Generative Adversarial Networks

链接: https://arxiv.org/abs/2404.16913
作者: Matthew Squires, Xiaohui Tao, Soman Elangovan, Raj Gururajan, Haoran Xie, Xujuan Zhou, Yuefeng Li, U Rajendra Acharya
备注:

点击查看摘要

Abstract:Repetitive Transcranial Magnetic Stimulation (rTMS) is a well-supported, evidence-based treatment for depression. However, patterns of response to this treatment are inconsistent. Emerging evidence suggests that artificial intelligence can predict rTMS treatment outcomes for most patients using fMRI connectivity features. While these models can reliably predict treatment outcomes for many patients for some underrepresented fMRI connectivity measures DNN models are unable to reliably predict treatment outcomes. As such we propose a novel method, Diversity Enhancing Conditional General Adversarial Network (DE-CGAN) for oversampling these underrepresented examples. DE-CGAN creates synthetic examples in difficult-to-classify regions by first identifying these data points and then creating conditioned synthetic examples to enhance data diversity. Through empirical experiments we show that a classification model trained using a diversity enhanced training set outperforms traditional data augmentation techniques and existing benchmark results. This work shows that increasing the diversity of a training dataset can improve classification model performance. Furthermore, this work provides evidence for the utility of synthetic patients providing larger more robust datasets for both AI researchers and psychiatrists to explore variable relationships.

ML-48-标题: Closing the gap: Optimizing Guidance and Control Networks through Neural ODEs

链接: https://arxiv.org/abs/2404.16908
作者: Sebastien Origer, Dario Izzo
备注:

点击查看摘要

Abstract:We improve the accuracy of Guidance & Control Networks (G&CNETs), trained to represent the optimal control policies of a time-optimal transfer and a mass-optimal landing, respectively. In both cases we leverage the dynamics of the spacecraft, described by Ordinary Differential Equations which incorporate a neural network on their right-hand side (Neural ODEs). Since the neural dynamics is differentiable, the ODEs sensitivities to the network parameters can be computed using the variational equations, thereby allowing to update the G&CNET parameters based on the observed dynamics. We start with a straightforward regression task, training the G&CNETs on datasets of optimal trajectories using behavioural cloning. These networks are then refined using the Neural ODE sensitivities by minimizing the error between the final states and the target states. We demonstrate that for the orbital transfer, the final error to the target can be reduced by 99% on a single trajectory and by 70% on a batch of 500 trajectories. For the landing problem the reduction in error is around 98-99% (position) and 40-44% (velocity). This step significantly enhances the accuracy of G&CNETs, which instills greater confidence in their reliability for operational use. We also compare our results to the popular Dataset Aggregation method (DaGGER) and allude to the strengths and weaknesses of both methods.

ML-49-标题: mlr3summary: Concise and interpretable summaries for machine learning models

链接: https://arxiv.org/abs/2404.16899
作者: Susanne Dandl, Marc Becker, Bernd Bischl, Giuseppe Casalicchio, Ludwig Bothmann
备注: 9 pages

点击查看摘要

Abstract:This work introduces a novel R package for concise, informative summaries of machine learning models. We take inspiration from the summary function for (generalized) linear models in R, but extend it in several directions: First, our summary function is model-agnostic and provides a unified summary output also for non-parametric machine learning models; Second, the summary output is more extensive and customizable – it comprises information on the dataset, model performance, model complexity, model’s estimated feature importances, feature effects, and fairness metrics; Third, models are evaluated based on resampling strategies for unbiased estimates of model performances, feature importances, etc. Overall, the clear, structured output should help to enhance and expedite the model selection process, making it a helpful tool for practitioners and researchers alike.

ML-50-标题: How to Parameterize Asymmetric Quantization Ranges for Quantization-Aware Training

链接: https://arxiv.org/abs/2404.16898
作者: Jaeseong You, Minseop Park, Kyunggeun Lee, Seokjun An, Chirag Patel, Markus Nage
备注:

点击查看摘要

Abstract:This paper investigates three different parameterizations of asymmetric uniform quantization for quantization-aware training: (1) scale and offset, (2) minimum and maximum, and (3) beta and gamma. We perform a comprehensive comparative analysis of these parameterizations’ influence on quantization-aware training, using both controlled experiments and real-world large language models. Our particular focus is on their changing behavior in response to critical training hyperparameters, bit width and learning rate. Based on our investigation, we propose best practices to stabilize and accelerate quantization-aware training with learnable asymmetric quantization ranges.

ML-51-标题: A Neural-Network-Based Approach for Loose-Fitting Clothing

链接: https://arxiv.org/abs/2404.16896
作者: Yongxu Jin, Dalton Omens, Zhenglin Geng, Joseph Teran, Abishek Kumar, Kenji Tashiro, Ronald Fedkiw
备注:

点击查看摘要

Abstract:Since loose-fitting clothing contains dynamic modes that have proven to be difficult to predict via neural networks, we first illustrate how to coarsely approximate these modes with a real-time numerical algorithm specifically designed to mimic the most important ballistic features of a classical numerical simulation. Although there is some flexibility in the choice of the numerical algorithm used as a proxy for full simulation, it is essential that the stability and accuracy be independent from any time step restriction or similar requirements in order to facilitate real-time performance. In order to reduce the number of degrees of freedom that require approximations to their dynamics, we simulate rigid frames and use skinning to reconstruct a rough approximation to a desirable mesh; as one might expect, neural-network-based skinning seems to perform better than linear blend skinning in this scenario. Improved high frequency deformations are subsequently added to the skinned mesh via a quasistatic neural network (QNN). In contrast to recurrent neural networks that require a plethora of training data in order to adequately generalize to new examples, QNNs perform well with significantly less training data.

ML-52-标题: On TinyML and Cybersecurity: Electric Vehicle Charging Infrastructure Use Case

链接: https://arxiv.org/abs/2404.16894
作者: Fatemeh Dehrouyeh, Li Yang, Firouz Badrkhani Ajaei, Abdallah Shami
备注:

点击查看摘要

Abstract:As technology advances, the use of Machine Learning (ML) in cybersecurity is becoming increasingly crucial to tackle the growing complexity of cyber threats. While traditional ML models can enhance cybersecurity, their high energy and resource demands limit their applications, leading to the emergence of Tiny Machine Learning (TinyML) as a more suitable solution for resource-constrained environments. TinyML is widely applied in areas such as smart homes, healthcare, and industrial automation. TinyML focuses on optimizing ML algorithms for small, low-power devices, enabling intelligent data processing directly on edge devices. This paper provides a comprehensive review of common challenges of TinyML techniques, such as power consumption, limited memory, and computational constraints; it also explores potential solutions to these challenges, such as energy harvesting, computational optimization techniques, and transfer learning for privacy preservation. On the other hand, this paper discusses TinyML’s applications in advancing cybersecurity for Electric Vehicle Charging Infrastructures (EVCIs) as a representative use case. It presents an experimental case study that enhances cybersecurity in EVCI using TinyML, evaluated against traditional ML in terms of reduced delay and memory usage, with a slight trade-off in accuracy. Additionally, the study includes a practical setup using the ESP32 microcontroller in the PlatformIO environment, which provides a hands-on assessment of TinyML’s application in cybersecurity for EVCI.

ML-53-标题: Automatic AI controller that can drive with confidence: steering vehicle with uncertainty knowledge

链接: https://arxiv.org/abs/2404.16893
作者: Neha Kumari, Sumit Kumar. Sneha Priya, Ayush Kumar, Akash Fogla
备注: arXiv admin note: substantial text overlap with arXiv:2303.08187

点击查看摘要

Abstract:In safety-critical systems that interface with the real world, the role of uncertainty in decision-making is pivotal, particularly in the context of machine learning models. For the secure functioning of Cyber-Physical Systems (CPS), it is imperative to manage such uncertainty adeptly. In this research, we focus on the development of a vehicle’s lateral control system using a machine learning framework. Specifically, we employ a Bayesian Neural Network (BNN), a probabilistic learning model, to address uncertainty quantification. This capability allows us to gauge the level of confidence or uncertainty in the model’s predictions. The BNN based controller is trained using simulated data gathered from the vehicle traversing a single track and subsequently tested on various other tracks. We want to share two significant results: firstly, the trained model demonstrates the ability to adapt and effectively control the vehicle on multiple similar tracks. Secondly, the quantification of prediction confidence integrated into the controller serves as an early-warning system, signaling when the algorithm lacks confidence in its predictions and is therefore susceptible to failure. By establishing a confidence threshold, we can trigger manual intervention, ensuring that control is relinquished from the algorithm when it operates outside of safe parameters.

ML-54-标题: NEPENTHE: Entropy-Based Pruning as a Neural Network Depths Reducer

链接: https://arxiv.org/abs/2404.16890
作者: Zhu Liao, Victor Quétu, Van-Tam Nguyen, Enzo Tartaglione
备注:

点击查看摘要

Abstract:While deep neural networks are highly effective at solving complex tasks, their computational demands can hinder their usefulness in real-time applications and with limited-resources systems. Besides, for many tasks it is known that these models are over-parametrized: neoteric works have broadly focused on reducing the width of these networks, rather than their depth. In this paper, we aim to reduce the depth of over-parametrized deep neural networks: we propose an eNtropy-basEd Pruning as a nEural Network depTH’s rEducer (NEPENTHE) to alleviate deep neural networks’ computational burden. Based on our theoretical finding, NEPENTHE focuses on un-structurally pruning connections in layers with low entropy to remove them entirely. We validate our approach on popular architectures such as MobileNet and Swin-T, showing that when encountering an over-parametrization regime, it can effectively linearize some layers (hence reducing the model’s depth) with little to no performance loss. The code will be publicly available upon acceptance of the article.

ML-55-标题: Anomaly Detection for Incident Response at Scale ASPLOS2024

链接: https://arxiv.org/abs/2404.16887
作者: Hanzhang Wang, Gowtham Kumar Tangirala, Gilkara Pranav Naidu, Charles Mayville, Arighna Roy, Joanne Sun, Ramesh Babu Mandava
备注: ASPLOS 2024 AIOps workshop

点击查看摘要

Abstract:We present a machine learning-based anomaly detection product, AI Detect and Respond (AIDR), that monitors Walmart’s business and system health in real-time. During the validation over 3 months, the product served predictions from over 3000 models to more than 25 application, platform, and operation teams, covering 63% of major incidents and reducing the mean-time-to-detect (MTTD) by more than 7 minutes. Unlike previous anomaly detection methods, our solution leverages statistical, ML and deep learning models while continuing to incorporate rule-based static thresholds to incorporate domain-specific knowledge. Both univariate and multivariate ML models are deployed and maintained through distributed services for scalability and high availability. AIDR has a feedback loop that assesses model quality with a combination of drift detection algorithms and customer feedback. It also offers self-onboarding capabilities and customizability. AIDR has achieved success with various internal teams with lower time to detection and fewer false positives than previous methods. As we move forward, we aim to expand incident coverage and prevention, reduce noise, and integrate further with root cause recommendation (RCR) to enable an end-to-end AIDR experience.

ML-56-标题: Review of Data-centric Time Series Analysis from Sample Feature and Period

链接: https://arxiv.org/abs/2404.16886
作者: Chenxi Sun, Hongyan Li, Yaliang Li, Shenda Hong
备注: 9 pages, 1 figure

点击查看摘要

Abstract:Data is essential to performing time series analysis utilizing machine learning approaches, whether for classic models or today’s large language models. A good time-series dataset is advantageous for the model’s accuracy, robustness, and convergence, as well as task outcomes and costs. The emergence of data-centric AI represents a shift in the landscape from model refinement to prioritizing data quality. Even though time-series data processing methods frequently come up in a wide range of research fields, it hasn’t been well investigated as a specific topic. To fill the gap, in this paper, we systematically review different data-centric methods in time series analysis, covering a wide range of research topics. Based on the time-series data characteristics at sample, feature, and period, we propose a taxonomy for the reviewed data selection methods. In addition to discussing and summarizing their characteristics, benefits, and drawbacks targeting time-series data, we also introduce the challenges and opportunities by proposing recommendations, open problems, and possible research topics.

ML-57-标题: Aligning Knowledge Graphs Provided by Humans and Generated from Neural Networks in Specific Tasks

链接: https://arxiv.org/abs/2404.16884
作者: Tangrui Li, Jun Zhou
备注:

点击查看摘要

Abstract:This paper develops an innovative method that enables neural networks to generate and utilize knowledge graphs, which describe their concept-level knowledge and optimize network parameters through alignment with human-provided knowledge. This research addresses a gap where traditionally, network-generated knowledge has been limited to applications in downstream symbolic analysis or enhancing network transparency. By integrating a novel autoencoder design with the Vector Symbolic Architecture (VSA), we have introduced auxiliary tasks that support end-to-end training. Our approach eschews traditional dependencies on ontologies or word embedding models, mining concepts from neural networks and directly aligning them with human knowledge. Experiments show that our method consistently captures network-generated concepts that align closely with human knowledge and can even uncover new, useful concepts not previously identified by humans. This plug-and-play strategy not only enhances the interpretability of neural networks but also facilitates the integration of symbolic logical reasoning within these systems.

ML-58-标题: Myopically Verifiable Probabilistic Certificates for Safe Control and Learning

链接: https://arxiv.org/abs/2404.16883
作者: Zhuoyuan Wang, Haoming Jing, Christian Kurniawan, Albert Chern, Yorie Nakahira
备注: arXiv admin note: substantial text overlap with arXiv:2110.13380

点击查看摘要

Abstract:This paper addresses the design of safety certificates for stochastic systems, with a focus on ensuring long-term safety through fast real-time control. In stochastic environments, set invariance-based methods that restrict the probability of risk events in infinitesimal time intervals may exhibit significant long-term risks due to cumulative uncertainties/risks. On the other hand, reachability-based approaches that account for the long-term future may require prohibitive computation in real-time decision making. To overcome this challenge involving stringent long-term safety vs. computation tradeoffs, we first introduce a novel technique termed `probabilistic invariance’. This technique characterizes the invariance conditions of the probability of interest. When the target probability is defined using long-term trajectories, this technique can be used to design myopic conditions/controllers with assured long-term safe probability. Then, we integrate this technique into safe control and learning. The proposed control methods efficiently assure long-term safety using neural networks or model predictive controllers with short outlook horizons. The proposed learning methods can be used to guarantee long-term safety during and after training. Finally, we demonstrate the performance of the proposed techniques in numerical simulations.

ML-59-标题: On uncertainty-penalized Bayesian information criterion

链接: https://arxiv.org/abs/2404.16881
作者: Pongpisit Thanasutives, Ken-ichi Fukui
备注: 4 pages, 2 figures

点击查看摘要

Abstract:The uncertainty-penalized information criterion (UBIC) has been proposed as a new model-selection criterion for data-driven partial differential equation (PDE) discovery. In this paper, we show that using the UBIC is equivalent to employing the conventional BIC to a set of overparameterized models derived from the potential regression models of different complexity measures. The result indicates that the asymptotic property of the UBIC and BIC holds indifferently.

ML-60-标题: Learning Control Barrier Functions and their application in Reinforcement Learning: A Survey

链接: https://arxiv.org/abs/2404.16879
作者: Maeva Guerrier, Hassan Fouad, Giovanni Beltrame
备注:

点击查看摘要

Abstract:Reinforcement learning is a powerful technique for developing new robot behaviors. However, typical lack of safety guarantees constitutes a hurdle for its practical application on real robots. To address this issue, safe reinforcement learning aims to incorporate safety considerations, enabling faster transfer to real robots and facilitating lifelong learning. One promising approach within safe reinforcement learning is the use of control barrier functions. These functions provide a framework to ensure that the system remains in a safe state during the learning process. However, synthesizing control barrier functions is not straightforward and often requires ample domain knowledge. This challenge motivates the exploration of data-driven methods for automatically defining control barrier functions, which is highly appealing. We conduct a comprehensive review of the existing literature on safe reinforcement learning using control barrier functions. Additionally, we investigate various techniques for automatically learning the Control Barrier Functions, aiming to enhance the safety and efficacy of Reinforcement Learning in practical robot applications.

ML-61-标题: Rapid Deployment of DNNs for Edge Computing via Structured Pruning at Initialization

链接: https://arxiv.org/abs/2404.16877
作者: Bailey J. Eccles, Leon Wong, Blesson Varghese
备注: The 24th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing

点击查看摘要

Abstract:Edge machine learning (ML) enables localized processing of data on devices and is underpinned by deep neural networks (DNNs). However, DNNs cannot be easily run on devices due to their substantial computing, memory and energy requirements for delivering performance that is comparable to cloud-based ML. Therefore, model compression techniques, such as pruning, have been considered. Existing pruning methods are problematic for edge ML since they: (1) Create compressed models that have limited runtime performance benefits (using unstructured pruning) or compromise the final model accuracy (using structured pruning), and (2) Require substantial compute resources and time for identifying a suitable compressed DNN model (using neural architecture search). In this paper, we explore a new avenue, referred to as Pruning-at-Initialization (PaI), using structured pruning to mitigate the above problems. We develop Reconvene, a system for rapidly generating pruned models suited for edge deployments using structured PaI. Reconvene systematically identifies and prunes DNN convolution layers that are least sensitive to structured pruning. Reconvene rapidly creates pruned DNNs within seconds that are up to 16.21x smaller and 2x faster while maintaining the same accuracy as an unstructured PaI counterpart.

ML-62-标题: AdaQAT: Adaptive Bit-Width Quantization-Aware Training

链接: https://arxiv.org/abs/2404.16876
作者: Cédric Gernigon (TARAN), Silviu-Ioan Filip (TARAN), Olivier Sentieys (TARAN), Clément Coggiola (CNES), Mickael Bruno (CNES)
备注:

点击查看摘要

Abstract:Large-scale deep neural networks (DNNs) have achieved remarkable success in many application scenarios. However, high computational complexity and energy costs of modern DNNs make their deployment on edge devices challenging. Model quantization is a common approach to deal with deployment constraints, but searching for optimized bit-widths can be challenging. In this work, we present Adaptive Bit-Width Quantization Aware Training (AdaQAT), a learning-based method that automatically optimizes weight and activation signal bit-widths during training for more efficient DNN inference. We use relaxed real-valued bit-widths that are updated using a gradient descent rule, but are otherwise discretized for all quantization operations. The result is a simple and flexible QAT approach for mixed-precision uniform quantization problems. Compared to other methods that are generally designed to be run on a pretrained network, AdaQAT works well in both training from scratch and fine-tuning scenarios.Initial results on the CIFAR-10 and ImageNet datasets using ResNet20 and ResNet18 models, respectively, indicate that our method is competitive with other state-of-the-art mixed-precision quantization approaches.

ML-63-标题: LEMDA: A Novel Feature Engineering Method for Intrusion Detection in IoT Systems

链接: https://arxiv.org/abs/2404.16870
作者: Ali Ghubaish, Zebo Yang, Aiman Erbad, Raj Jain
备注:

点击查看摘要

Abstract:Intrusion detection systems (IDS) for the Internet of Things (IoT) systems can use AI-based models to ensure secure communications. IoT systems tend to have many connected devices producing massive amounts of data with high dimensionality, which requires complex models. Complex models have notorious problems such as overfitting, low interpretability, and high computational complexity. Adding model complexity penalty (i.e., regularization) can ease overfitting, but it barely helps interpretability and computational efficiency. Feature engineering can solve these issues; hence, it has become critical for IDS in large-scale IoT systems to reduce the size and dimensionality of data, resulting in less complex models with excellent performance, smaller data storage, and fast detection. This paper proposes a new feature engineering method called LEMDA (Light feature Engineering based on the Mean Decrease in Accuracy). LEMDA applies exponential decay and an optional sensitivity factor to select and create the most informative features. The proposed method has been evaluated and compared to other feature engineering methods using three IoT datasets and four AI/ML models. The results show that LEMDA improves the F1 score performance of all the IDS models by an average of 34% and reduces the average training and detection times in most cases.

ML-64-标题: Predicting SSH keys in Open SSH Memory dumps

链接: https://arxiv.org/abs/2404.16838
作者: Florian Rascoussier
备注: The report contains 148 pages, 22 figures, 17 tables and 34 listings. The GitHub of the project can be accessed here: this https URL This work is part of an ongoing effort for publication

点击查看摘要

Abstract:As the digital landscape evolves, cybersecurity has become an indispensable focus of IT systems. Its ever-escalating challenges have amplified the importance of digital forensics, particularly in the analysis of heap dumps from main memory. In this context, the Secure Shell protocol (SSH) designed for encrypted communications, serves as both a safeguard and a potential veil for malicious activities. This research project focuses on predicting SSH keys in OpenSSH memory dumps, aiming to enhance protective measures against illicit access and enable the development of advanced security frameworks or tools like honeypots. This Masterarbeit is situated within the broader SmartVMI project, and seeks to build upon existing research on key prediction in OpenSSH heap dumps. Utilizing machine learning (ML) and deep learning models, the study aims to refine features for embedding techniques and explore innovative methods for effective key detection based on recent advancements in Knowledge Graph and ML. The objective is to accurately predict the presence and location of SSH keys within memory dumps. This work builds upon, and aims to enhance, the foundations laid by SSHkex and SmartKex, enriching both the methodology and the results of the original research while exploring the untapped potential of newly proposed approaches. The current thesis dives into memory graph modelization from raw binary heap dump files. Each memory graph can support a range of embeddings that can be used directly for model training, through the use of classic ML models and graph neural network. It offers an in-depth discussion on the current state-of-the-art in key prediction for OpenSSH memory dumps, research questions, experimental setups, programs development, results as well as discussing potential future directions.

ML-65-标题: Q-Learning to navigate turbulence without a map

链接: https://arxiv.org/abs/2404.17495
作者: Marco Rando, Martin James, Alessandro Verri, Lorenzo Rosasco, Agnese Seminara
备注: 18 pages, 8 figures

点击查看摘要

Abstract:We consider the problem of olfactory searches in a turbulent environment. We focus on agents that respond solely to odor stimuli, with no access to spatial perception nor prior information about the odor location. We ask whether navigation strategies to a target can be learned robustly within a sequential decision making framework. We develop a reinforcement learning algorithm using a small set of interpretable olfactory states and train it with realistic turbulent odor cues. By introducing a temporal memory, we demonstrate that two salient features of odor traces, discretized in few olfactory states, are sufficient to learn navigation in a realistic odor plume. Performance is dictated by the sparse nature of turbulent plumes. An optimal memory exists which ignores blanks within the plume and activates a recovery strategy outside the plume. We obtain the best performance by letting agents learn their recovery strategy and show that it is mostly casting cross wind, similar to behavior observed in flying insects. The optimal strategy is robust to substantial changes in the odor plumes, suggesting minor parameter tuning may be sufficient to adapt to different environments.

ML-66-标题: Differentiable Pareto-Smoothed Weighting for High-Dimensional Heterogeneous Treatment Effect Estimation UAI2024

链接: https://arxiv.org/abs/2404.17483
作者: Yoichi Chikahara, Kansei Ushiyama
备注: Accepted to UAI2024. 14 pages, 4 figures

点击查看摘要

Abstract:There is a growing interest in estimating heterogeneous treatment effects across individuals using their high-dimensional feature attributes. Achieving high performance in such high-dimensional heterogeneous treatment effect estimation is challenging because in this setup, it is usual that some features induce sample selection bias while others do not but are predictive of potential outcomes. To avoid losing such predictive feature information, existing methods learn separate feature representations using the inverse of probability weighting (IPW). However, due to the numerically unstable IPW weights, they suffer from estimation bias under a finite sample setup. To develop a numerically robust estimator via weighted representation learning, we propose a differentiable Pareto-smoothed weighting framework that replaces extreme weight values in an end-to-end fashion. Experimental results show that by effectively correcting the weight values, our method outperforms the existing ones, including traditional weighting schemes.

ML-67-标题: FTL: Transfer Learning Nonlinear Plasma Dynamic Transitions in Low Dimensional Embeddings via Deep Neural Networks

链接: https://arxiv.org/abs/2404.17466
作者: Zhe Bai, Xishuo Wei, William Tang, Leonid Oliker, Zhihong Lin, Samuel Williams
备注: 18 pages, 10 figures

点击查看摘要

Abstract:Deep learning algorithms provide a new paradigm to study high-dimensional dynamical behaviors, such as those in fusion plasma systems. Development of novel model reduction methods, coupled with detection of abnormal modes with plasma physics, opens a unique opportunity for building efficient models to identify plasma instabilities for real-time control. Our Fusion Transfer Learning (FTL) model demonstrates success in reconstructing nonlinear kink mode structures by learning from a limited amount of nonlinear simulation data. The knowledge transfer process leverages a pre-trained neural encoder-decoder network, initially trained on linear simulations, to effectively capture nonlinear dynamics. The low-dimensional embeddings extract the coherent structures of interest, while preserving the inherent dynamics of the complex system. Experimental results highlight FTL’s capacity to capture transitional behaviors and dynamical features in plasma dynamics – a task often challenging for conventional methods. The model developed in this study is generalizable and can be extended broadly through transfer learning to address various magnetohydrodynamics (MHD) modes.

ML-68-标题: Uniform Generalization Bounds on Data-Dependent Hypothesis Sets via PAC-Bayesian Theory on Random Sets

链接: https://arxiv.org/abs/2404.17442
作者: Benjamin Dupuis, Paul Viallard, George Deligiannidis, Umut Simsekli
备注:

点击查看摘要

Abstract:We propose data-dependent uniform generalization bounds by approaching the problem from a PAC-Bayesian perspective. We first apply the PAC-Bayesian framework on `random sets’ in a rigorous way, where the training algorithm is assumed to output a data-dependent hypothesis set after observing the training data. This approach allows us to prove data-dependent bounds, which can be applicable in numerous contexts. To highlight the power of our approach, we consider two main applications. First, we propose a PAC-Bayesian formulation of the recently developed fractal-dimension-based generalization bounds. The derived results are shown to be tighter and they unify the existing results around one simple proof technique. Second, we prove uniform bounds over the trajectories of continuous Langevin dynamics and stochastic gradient Langevin dynamics. These results provide novel information about the generalization properties of noisy algorithms.

ML-69-标题: Separation capacity of linear reservoirs with random connectivity matrix

链接: https://arxiv.org/abs/2404.17429
作者: Youness Boutaib
备注:

点击查看摘要

Abstract:We argue that the success of reservoir computing lies within the separation capacity of the reservoirs and show that the expected separation capacity of random linear reservoirs is fully characterised by the spectral decomposition of an associated generalised matrix of moments. Of particular interest are reservoirs with Gaussian matrices that are either symmetric or whose entries are all independent. In the symmetric case, we prove that the separation capacity always deteriorates with time; while for short inputs, separation with large reservoirs is best achieved when the entries of the matrix are scaled with a factor \rho_T/\sqrtN , where N is the dimension of the reservoir and \rho_T depends on the maximum length of the input time series. In the i.i.d. case, we establish that optimal separation with large reservoirs is consistently achieved when the entries of the reservoir matrix are scaled with the exact factor 1/\sqrtN . We further give upper bounds on the quality of separation in function of the length of the time series. We complement this analysis with an investigation of the likelihood of this separation and the impact of the chosen architecture on separation consistency.

ML-70-标题: Online Policy Learning and Inference by Matrix Completion

链接: https://arxiv.org/abs/2404.17398
作者: Congyuan Duan, Jingyang Li, Dong Xia
备注:

点击查看摘要

Abstract:Making online decisions can be challenging when features are sparse and orthogonal to historical ones, especially when the optimal policy is learned through collaborative filtering. We formulate the problem as a matrix completion bandit (MCB), where the expected reward under each arm is characterized by an unknown low-rank matrix. The \epsilon -greedy bandit and the online gradient descent algorithm are explored. Policy learning and regret performance are studied under a specific schedule for exploration probabilities and step sizes. A faster decaying exploration probability yields smaller regret but learns the optimal policy less accurately. We investigate an online debiasing method based on inverse propensity weighting (IPW) and a general framework for online policy inference. The IPW-based estimators are asymptotically normal under mild arm-optimality conditions. Numerical simulations corroborate our theoretical findings. Our methods are applied to the San Francisco parking pricing project data, revealing intriguing discoveries and outperforming the benchmark policy.

ML-71-标题: Similarity Equivariant Graph Neural Networks for Homogenization of Metamaterials

链接: https://arxiv.org/abs/2404.17365
作者: Fleur Hendriks (1), Vlado Menkovski (1), Martin Doškář (2), Marc G. D. Geers (1), Ondřej Rokoš (1) ((1) Eindhoven University of Technology, (2) Czech Technical University in Prague)
备注: 54 pages, 20 figures submitted to CMAME (Computer Methods in Applied Mechanics and Engineering)

点击查看摘要

Abstract:Soft, porous mechanical metamaterials exhibit pattern transformations that may have important applications in soft robotics, sound reduction and biomedicine. To design these innovative materials, it is important to be able to simulate them accurately and quickly, in order to tune their mechanical properties. Since conventional simulations using the finite element method entail a high computational cost, in this article we aim to develop a machine learning-based approach that scales favorably to serve as a surrogate model. To ensure that the model is also able to handle various microstructures, including those not encountered during training, we include the microstructure as part of the network input. Therefore, we introduce a graph neural network that predicts global quantities (energy, stress stiffness) as well as the pattern transformations that occur (the kinematics). To make our model as accurate and data-efficient as possible, various symmetries are incorporated into the model. The starting point is an E(n)-equivariant graph neural network (which respects translation, rotation and reflection) that has periodic boundary conditions (i.e., it is in-/equivariant with respect to the choice of RVE), is scale in-/equivariant, can simulate large deformations, and can predict scalars, vectors as well as second and fourth order tensors (specifically energy, stress and stiffness). The incorporation of scale equivariance makes the model equivariant with respect to the similarities group, of which the Euclidean group E(n) is a subgroup. We show that this network is more accurate and data-efficient than graph neural networks with fewer symmetries. To create an efficient graph representation of the finite element discretization, we use only the internal geometrical hole boundaries from the finite element mesh to achieve a better speed-up and scaling with the mesh size.

ML-72-标题: Neyman Meets Causal Machine Learning: Experimental Evaluation of Individualized Treatment Rules

链接: https://arxiv.org/abs/2404.17019
作者: Michael Lingzhi Li, Kosuke Imai
备注:

点击查看摘要

Abstract:A century ago, Neyman showed how to evaluate the efficacy of treatment using a randomized experiment under a minimal set of assumptions. This classical repeated sampling framework serves as a basis of routine experimental analyses conducted by today’s scientists across disciplines. In this paper, we demonstrate that Neyman’s methodology can also be used to experimentally evaluate the efficacy of individualized treatment rules (ITRs), which are derived by modern causal machine learning algorithms. In particular, we show how to account for additional uncertainty resulting from a training process based on cross-fitting. The primary advantage of Neyman’s approach is that it can be applied to any ITR regardless of the properties of machine learning algorithms that are used to derive the ITR. We also show, somewhat surprisingly, that for certain metrics, it is more efficient to conduct this ex-post experimental evaluation of an ITR than to conduct an ex-ante experimental evaluation that randomly assigns some units to the ITR. Our analysis demonstrates that Neyman’s repeated sampling framework is as relevant for causal inference today as it has been since its inception.

ML-73-标题: HEroBM: a deep equivariant graph neural network for universal backmapping from coarse-grained to all-atom representations

链接: https://arxiv.org/abs/2404.16911
作者: Daniele Angioletti, Stefano Raniolo, Vittorio Limongelli
备注:

点击查看摘要

Abstract:Molecular simulations have assumed a paramount role in the fields of chemistry, biology, and material sciences, being able to capture the intricate dynamic properties of systems. Within this realm, coarse-grained (CG) techniques have emerged as invaluable tools to sample large-scale systems and reach extended timescales by simplifying system representation. However, CG approaches come with a trade-off: they sacrifice atomistic details that might hold significant relevance in deciphering the investigated process. Therefore, a recommended approach is to identify key CG conformations and process them using backmapping methods, which retrieve atomistic coordinates. Currently, rule-based methods yield subpar geometries and rely on energy relaxation, resulting in less-than-optimal outcomes. Conversely, machine learning techniques offer higher accuracy but are either limited in transferability between systems or tied to specific CG mappings. In this work, we introduce HEroBM, a dynamic and scalable method that employs deep equivariant graph neural networks and a hierarchical approach to achieve high-resolution backmapping. HEroBM handles any type of CG mapping, offering a versatile and efficient protocol for reconstructing atomistic structures with high accuracy. Focused on local principles, HEroBM spans the entire chemical space and is transferable to systems of varying sizes. We illustrate the versatility of our framework through diverse biological systems, including a complex real-case scenario. Here, our end-to-end backmapping approach accurately generates the atomistic coordinates of a G protein-coupled receptor bound to an organic small molecule within a cholesterol/phospholipid bilayer.

ML-74-标题: Season combinatorial intervention predictions with Salt & Peper

链接: https://arxiv.org/abs/2404.16907
作者: Thomas Gaudelet, Alice Del Vecchio, Eli M Carrami, Juliana Cudini, Chantriolnt-Andreas Kapourani, Caroline Uhler, Lindsay Edwards
备注:

点击查看摘要

Abstract:Interventions play a pivotal role in the study of complex biological systems. In drug discovery, genetic interventions (such as CRISPR base editing) have become central to both identifying potential therapeutic targets and understanding a drug’s mechanism of action. With the advancement of CRISPR and the proliferation of genome-scale analyses such as transcriptomics, a new challenge is to navigate the vast combinatorial space of concurrent genetic interventions. Addressing this, our work concentrates on estimating the effects of pairwise genetic combinations on the cellular transcriptome. We introduce two novel contributions: Salt, a biologically-inspired baseline that posits the mostly additive nature of combination effects, and Peper, a deep learning model that extends Salt’s additive assumption to achieve unprecedented accuracy. Our comprehensive comparison against existing state-of-the-art methods, grounded in diverse metrics, and our out-of-distribution analysis highlight the limitations of current models in realistic settings. This analysis underscores the necessity for improved modelling techniques and data acquisition strategies, paving the way for more effective exploration of genetic intervention effects.

ML-75-标题: Space-Variant Total Variation boosted by learning techniques in few-view tomographic imaging

链接: https://arxiv.org/abs/2404.16900
作者: Elena Morotti, Davide Evangelista, Andrea Sebastiani, Elena Loli Piccolomini
备注:

点击查看摘要

Abstract:This paper focuses on the development of a space-variant regularization model for solving an under-determined linear inverse problem. The case study is a medical image reconstruction from few-view tomographic noisy data. The primary objective of the proposed optimization model is to achieve a good balance between denoising and the preservation of fine details and edges, overcoming the performance of the popular and largely used Total Variation (TV) regularization through the application of appropriate pixel-dependent weights. The proposed strategy leverages the role of gradient approximations for the computation of the space-variant TV weights. For this reason, a convolutional neural network is designed, to approximate both the ground truth image and its gradient using an elastic loss function in its training. Additionally, the paper provides a theoretical analysis of the proposed model, showing the uniqueness of its solution, and illustrates a Chambolle-Pock algorithm tailored to address the specific problem at hand. This comprehensive framework integrates innovative regularization techniques with advanced neural network capabilities, demonstrating promising results in achieving high-quality reconstructions from low-sampled tomographic data.

ML-76-标题: Functional Protein Design with Local Domain Alignment

链接: https://arxiv.org/abs/2404.16866
作者: Chaohao Yuan, Songyou Li, Geyan Ye, Yikun Zhang, Long-Kai Huang, Wenbing Huang, Wei Liu, Jianhua Yao, Yu Rong
备注:

点击查看摘要

Abstract:The core challenge of de novo protein design lies in creating proteins with specific functions or properties, guided by certain conditions. Current models explore to generate protein using structural and evolutionary guidance, which only provide indirect conditions concerning functions and properties. However, textual annotations of proteins, especially the annotations for protein domains, which directly describe the protein’s high-level functionalities, properties, and their correlation with target amino acid sequences, remain unexplored in the context of protein design tasks. In this paper, we propose Protein-Annotation Alignment Generation (PAAG), a multi-modality protein design framework that integrates the textual annotations extracted from protein database for controllable generation in sequence space. Specifically, within a multi-level alignment module, PAAG can explicitly generate proteins containing specific domains conditioned on the corresponding domain annotations, and can even design novel proteins with flexible combinations of different kinds of annotations. Our experimental results underscore the superiority of the aligned protein representations from PAAG over 7 prediction tasks. Furthermore, PAAG demonstrates a nearly sixfold increase in generation success rate (24.7% vs 4.7% in zinc finger, and 54.3% vs 8.7% in the immunoglobulin domain) in comparison to the existing model.

计算机视觉

CV-0-标题: Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

链接: https://arxiv.org/abs/2404.17571
作者: Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, Changxin Gao
备注: Project Page: this https URL

点击查看摘要

Abstract:Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named “Tunnel Try-on.” The core idea is excavating a “focus tunnel” in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.

CV-1-标题: MaPa: Text-driven Photorealistic Material Painting for 3D Shapes SIGGRAPH2024

链接: https://arxiv.org/abs/2404.17569
作者: Shangzhan Zhang, Sida Peng, Tao Xu, Yuanbo Yang, Tianrun Chen, Nan Xue, Yujun Shen, Hujun Bao, Ruizhen Hu, Xiaowei Zhou
备注: SIGGRAPH 2024. Project page: this https URL

点击查看摘要

Abstract:This paper aims to generate materials for 3D meshes from text descriptions. Unlike existing methods that synthesize texture maps, we propose to generate segment-wise procedural material graphs as the appearance representation, which supports high-quality rendering and provides substantial flexibility in editing. Instead of relying on extensive paired data, i.e., 3D meshes with material graphs and corresponding text descriptions, to train a material graph generative model, we propose to leverage the pre-trained 2D diffusion model as a bridge to connect the text and material graphs. Specifically, our approach decomposes a shape into a set of segments and designs a segment-controlled diffusion model to synthesize 2D images that are aligned with mesh parts. Based on generated images, we initialize parameters of material graphs and fine-tune them through the differentiable rendering module to produce materials in accordance with the textual description. Extensive experiments demonstrate the superior performance of our framework in photorealism, resolution, and editability over existing methods. Project page: this https URL

CV-2-标题: ChangeBind: A Hybrid Change Encoder for Remote Sensing Change Detection

链接: https://arxiv.org/abs/2404.17565
作者: Mubashir Noman, Mustansar Fiaz, Hisham Cholakkal
备注: accepted at IGARSS 2024

点击查看摘要

Abstract:Change detection (CD) is a fundamental task in remote sensing (RS) which aims to detect the semantic changes between the same geographical regions at different time stamps. Existing convolutional neural networks (CNNs) based approaches often struggle to capture long-range dependencies. Whereas recent transformer-based methods are prone to the dominant global representation and may limit their capabilities to capture the subtle change regions due to the complexity of the objects in the scene. To address these limitations, we propose an effective Siamese-based framework to encode the semantic changes occurring in the bi-temporal RS images. The main focus of our design is to introduce a change encoder that leverages local and global feature representations to capture both subtle and large change feature information from multi-scale features to precisely estimate the change regions. Our experimental study on two challenging CD datasets reveals the merits of our approach and obtains state-of-the-art performance.

CV-3-标题: Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

链接: https://arxiv.org/abs/2404.17534
作者: Yuhang Huang, Zihan Wu, Chongyang Gao, Jiawei Peng, Xu Yang
备注: 11 pages, 9 figures, 6 tables. For associated code, see this https URL

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are gaining traction for their remarkable ability to process and integrate visual and textual data. Despite their popularity, the capacity of LVLMs to generate precise, fine-grained textual descriptions has not been fully explored. This study addresses this gap by focusing on \textitdistinctiveness and \textitfidelity, assessing how models like Open-Flamingo, IDEFICS, and MiniGPT-4 can distinguish between similar objects and accurately describe visual features. We proposed the Textual Retrieval-Augmented Classification (TRAC) framework, which, by leveraging its generative capabilities, allows us to delve deeper into analyzing fine-grained visual description generation. This research provides valuable insights into the generation quality of LVLMs, enhancing the understanding of multimodal language models. Notably, MiniGPT-4 stands out for its better ability to generate fine-grained descriptions, outperforming the other two models in this aspect. The code is provided at \urlhttps://anonymous.4open.science/r/Explore_FGVDs-E277.

CV-4-标题: Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields CVPR2024

链接: https://arxiv.org/abs/2404.17528
作者: Tianqi Liu, Xinyi Ye, Min Shi, Zihao Huang, Zhiyu Pan, Zhan Peng, Zhiguo Cao
备注: Accepted by CVPR 2024. Project page: this https URL

点击查看摘要

Abstract:Generalizable NeRF aims to synthesize novel views for unseen scenes. Common practices involve constructing variance-based cost volumes for geometry reconstruction and encoding 3D descriptors for decoding novel views. However, existing methods show limited generalization ability in challenging conditions due to inaccurate geometry, sub-optimal descriptors, and decoding strategies. We address these issues point by point. First, we find the variance-based cost volume exhibits failure patterns as the features of pixels corresponding to the same point can be inconsistent across different views due to occlusions or reflections. We introduce an Adaptive Cost Aggregation (ACA) approach to amplify the contribution of consistent pixel pairs and suppress inconsistent ones. Unlike previous methods that solely fuse 2D features into descriptors, our approach introduces a Spatial-View Aggregator (SVA) to incorporate 3D context into descriptors through spatial and inter-view interaction. When decoding the descriptors, we observe the two existing decoding strategies excel in different areas, which are complementary. A Consistency-Aware Fusion (CAF) strategy is proposed to leverage the advantages of both. We incorporate the above ACA, SVA, and CAF into a coarse-to-fine framework, termed Geometry-aware Reconstruction and Fusion-refined Rendering (GeFu). GeFu attains state-of-the-art performance across multiple datasets. Code is available at this https URL .

CV-5-标题: Ag2Manip: Learning Novel Manipulation Skills with Agent -Agnostic Visual and Action Representations

链接: https://arxiv.org/abs/2404.17521
作者: Puhao Li, Tengyu Liu, Yuyang Li, Muzhi Han, Haoran Geng, Shu Wang, Yixin Zhu, Song-Chun Zhu, Siyuan Huang
备注: Project website and open-source code: this https URL

点击查看摘要

Abstract:Autonomous robotic systems capable of learning novel manipulation tasks are poised to transform industries from manufacturing to service automation. However, modern methods (e.g., VIP and R3M) still face significant hurdles, notably the domain gap among robotic embodiments and the sparsity of successful task executions within specific action spaces, resulting in misaligned and ambiguous task representations. We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting these challenges through two key innovations: a novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability; and an agent-agnostic action representation abstracting a robot’s kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object. Ag2Manip’s empirical validation across simulated benchmarks like FrankaKitchen, ManiSkill, and PartManip shows a 325% increase in performance, achieved without domain-specific demonstrations. Ablation studies underline the essential contributions of the visual and action representations to this success. Extending our evaluations to the real world, Ag2Manip significantly improves imitation learning success rates from 50% to 77.5%, demonstrating its effectiveness and generalizability across both simulated and physical environments.

CV-6-标题: HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

链接: https://arxiv.org/abs/2404.17507
作者: Wonjae Kim, Sanghyuk Chun, Taekyung Kim, Dongyoon Han, Sangdoo Yun
备注: 28pages, 4.5MB

点击查看摘要

Abstract:In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. Our approach leverages hyperbolic embeddings and the concept of entailment cones to evaluate and filter out samples with meaningless or underspecified semantics, focusing on enhancing the specificity of each data sample. HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark when combined with existing filtering techniques. This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models. Additionally, the image specificity \epsilon_i can be independently applied to induce an image-only dataset from an image-text or image-only data pool for training image-only self-supervised models and showed superior performance when compared to the dataset induced by CLIP score.

CV-7-标题: Inhomogeneous illuminated image enhancement under extremely low visibility condition

链接: https://arxiv.org/abs/2404.17503
作者: Libang Chen, Yikun Liu, Jianying Zhou
备注:

点击查看摘要

Abstract:Imaging through fog significantly impacts fields such as object detection and recognition. In conditions of extremely low visibility, essential image information can be obscured, rendering standard extraction methods ineffective. Traditional digital processing techniques, such as histogram stretching, aim to mitigate fog effects by enhancing object light contrast diminished by atmospheric scattering. However, these methods often experience reduce effectiveness under inhomogeneous illumination. This paper introduces a novel approach that adaptively filters background illumination under extremely low visibility and preserve only the essential signal information. Additionally, we employ a visual optimization strategy based on image gradients to eliminate grayscale banding. Finally, the image is transformed to achieve high contrast and maintain fidelity to the original information through maximum histogram equalization. Our proposed method significantly enhances signal clarity in conditions of extremely low visibility and outperforms existing algorithms.

CV-8-标题: Learning text-to-video retrieval from image captioning CVPR2023

链接: https://arxiv.org/abs/2404.17498
作者: Lucas Ventura, Cordelia Schmid, Gül Varol
备注: A short version of this work appeared at CVPR 2023 Workshops. Project page: this https URL

点击查看摘要

Abstract:We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD.

CV-9-标题: Low Cost Machine Vision for Insect Classification

链接: https://arxiv.org/abs/2404.17488
作者: Danja Brandt, Martin Tschaikner, Teodor Chiaburu, Henning Schmidt, Ilona Schrimpf, Alexandra Stadel, Ingeborg E. Beckers, Frank Haußer
备注:

点击查看摘要

Abstract:Preserving the number and diversity of insects is one of our society’s most important goals in the area of environmental sustainability. A prerequisite for this is a systematic and up-scaled monitoring in order to detect correlations and identify countermeasures. Therefore, automatized monitoring using live traps is important, but so far there is no system that provides image data of sufficient detailed information for entomological classification. In this work, we present an imaging method as part of a multisensor system developed as a low-cost, scalable, open-source system that is adaptable to classical trap types. The image quality meets the requirements needed for classification in the taxonomic tree. Therefore, illumination and resolution have been optimized and motion artefacts have been suppressed. The system is evaluated exemplarily on a dataset consisting of 16 insect species of the same as well as different genus, family and order. We demonstrate that standard CNN-architectures like ResNet50 (pretrained on iNaturalist data) or MobileNet perform very well for the prediction task after re-training. Smaller custom made CNNs also lead to promising results. Classification accuracy of >96% has been achieved. Moreover, it was proved that image cropping of insects is necessary for classification of species with high inter-class similarity.

CV-10-标题: TextGaze: Gaze-Controllable Face Generation with Natural Language

链接: https://arxiv.org/abs/2404.17486
作者: Hengfei Wang, Zhongqun Zhang, Yihua Cheng, Hyung Jin Chang
备注: Under review

点击查看摘要

Abstract:Generating face image with specific gaze information has attracted considerable attention. Existing approaches typically input gaze values directly for face generation, which is unnatural and requires annotated gaze datasets for training, thereby limiting its application. In this paper, we present a novel gaze-controllable face generation task. Our approach inputs textual descriptions that describe human gaze and head behavior and generates corresponding face images. Our work first introduces a text-of-gaze dataset containing over 90k text descriptions spanning a dense distribution of gaze and head poses. We further propose a gaze-controllable text-to-face method. Our method contains a sketch-conditioned face diffusion module and a model-based sketch diffusion module. We define a face sketch based on facial landmarks and eye segmentation map. The face diffusion module generates face images from the face sketch, and the sketch diffusion module employs a 3D face model to generate face sketch from text description. Experiments on the FFHQ dataset show the effectiveness of our method. We will release our dataset and code for future research.

CV-11-标题: Sparse Reconstruction of Optical Doppler Tomography Based on State Space Model

链接: https://arxiv.org/abs/2404.17484
作者: Zhenghong Li, Jiaxiang Ren, Wensheng Cheng, Congwu Du, Yingtian Pan, Haibin Ling
备注: 19 pages, 5 figures

点击查看摘要

Abstract:Optical Doppler Tomography (ODT) is a blood flow imaging technique popularly used in bioengineering applications. The fundamental unit of ODT is the 1D frequency response along the A-line (depth), named raw A-scan. A 2D ODT image (B-scan) is obtained by first sensing raw A-scans along the B-line (width), and then constructing the B-scan from these raw A-scans via magnitude-phase analysis and post-processing. To obtain a high-resolution B-scan with a precise flow map, densely sampled A-scans are required in current methods, causing both computational and storage burdens. To address this issue, in this paper we propose a novel sparse reconstruction framework with four main sequential steps: 1) early magnitude-phase fusion that encourages rich interaction of the complementary information in magnitude and phase, 2) State Space Model (SSM)-based representation learning, inspired by recent successes in Mamba and VMamba, to naturally capture both the intra-A-scan sequential information and between-A-scan interactions, 3) an Inception-based Feedforward Network module (IncFFN) to further boost the SSM-module, and 4) a B-line Pixel Shuffle (BPS) layer to effectively reconstruct the final results. In the experiments on real-world animal data, our method shows clear effectiveness in reconstruction accuracy. As the first application of SSM for image reconstruction tasks, we expect our work to inspire related explorations in not only efficient ODT imaging techniques but also generic image enhancement.

CV-12-标题: Prompt CIR: Blind Compressed Image Restoration with Prompt Learning

链接: https://arxiv.org/abs/2404.17433
作者: Bingchen Li, Xin Li, Yiting Lu, Ruoyu Feng, Mengxi Guo, Shijie Zhao, Li Zhang, Zhibo Chen
备注: Winner of NTIRE 2024 Blind Compressed Image Enhancement Challenge

点击查看摘要

Abstract:Blind Compressed Image Restoration (CIR) has garnered significant attention due to its practical applications. It aims to mitigate compression artifacts caused by unknown quality factors, particularly with JPEG codecs. Existing works on blind CIR often seek assistance from a quality factor prediction network to facilitate their network to restore compressed images. However, the predicted numerical quality factor lacks spatial information, preventing network adaptability toward image contents. Recent studies in prompt-learning-based image restoration have showcased the potential of prompts to generalize across varied degradation types and degrees. This motivated us to design a prompt-learning-based compressed image restoration network, dubbed PromptCIR, which can effectively restore images from various compress levels. Specifically, PromptCIR exploits prompts to encode compression information implicitly, where prompts directly interact with soft weights generated from image features, thus providing dynamic content-aware and distortion-aware guidance for the restoration process. The light-weight prompts enable our method to adapt to different compression levels, while introducing minimal parameter overhead. Overall, PromptCIR leverages the powerful transformer-based backbone with the dynamic prompt module to proficiently handle blind CIR tasks, winning first place in the NTIRE 2024 challenge of blind compressed image enhancement track. Extensive experiments have validated the effectiveness of our proposed PromptCIR. The code is available at this https URL.

CV-13-标题: Cost-Sensitive Uncertainty-Based Failure Recognition for Object Detection UAI2024

链接: https://arxiv.org/abs/2404.17427
作者: Moussa Kassem Sbeyti, Michelle Karg, Christian Wirth, Nadja Klein, Sahin Albayrak
备注: Accepted with an oral presentation at UAI 2024

点击查看摘要

Abstract:Object detectors in real-world applications often fail to detect objects due to varying factors such as weather conditions and noisy input. Therefore, a process that mitigates false detections is crucial for both safety and accuracy. While uncertainty-based thresholding shows promise, previous works demonstrate an imperfect correlation between uncertainty and detection errors. This hinders ideal thresholding, prompting us to further investigate the correlation and associated cost with different types of uncertainty. We therefore propose a cost-sensitive framework for object detection tailored to user-defined budgets on the two types of errors, missing and false detections. We derive minimum thresholding requirements to prevent performance degradation and define metrics to assess the applicability of uncertainty for failure recognition. Furthermore, we automate and optimize the thresholding process to maximize the failure recognition rate w.r.t. the specified budget. Evaluation on three autonomous driving datasets demonstrates that our approach significantly enhances safety, particularly in challenging scenarios. Leveraging localization aleatoric uncertainty and softmax-based entropy only, our method boosts the failure recognition rate by 36-60% compared to conventional approaches. Code is available at this https URL.

CV-14-标题: Multi-view Image Prompt ed Multi-view Diffusion for Improved 3D Generation

链接: https://arxiv.org/abs/2404.17419
作者: Seungwook Kim, Yichun Shi, Kejie Li, Minsu Cho, Peng Wang
备注: 5 pages including references, 2 figures, 2 tables

点击查看摘要

Abstract:Using image as prompts for 3D generation demonstrate particularly strong performances compared to using text prompts alone, for images provide a more intuitive guidance for the 3D generation process. In this work, we delve into the potential of using multiple image prompts, instead of a single image prompt, for 3D generation. Specifically, we build on ImageDream, a novel image-prompt multi-view diffusion model, to support multi-view images as the input prompt. Our method, dubbed MultiImageDream, reveals that transitioning from a single-image prompt to multiple-image prompts enhances the performance of multi-view and 3D object generation according to various quantitative evaluation metrics and qualitative assessments. This advancement is achieved without the necessity of fine-tuning the pre-trained ImageDream multi-view diffusion model.

CV-15-标题: Spatial-frequency Dual-Domain Feature Fusion Network for Low-Light Remote Sensing Image Enhancement

链接: https://arxiv.org/abs/2404.17400
作者: Zishu Yao, Guodong Fan, Jinfu Fan, Min Gan, C.L. Philip Chen
备注: 14 page

点击查看摘要

Abstract:Low-light remote sensing images generally feature high resolution and high spatial complexity, with continuously distributed surface features in space. This continuity in scenes leads to extensive long-range correlations in spatial domains within remote sensing images. Convolutional Neural Networks, which rely on local correlations for long-distance modeling, struggle to establish long-range correlations in such images. On the other hand, transformer-based methods that focus on global information face high computational complexities when processing high-resolution remote sensing images. From another perspective, Fourier transform can compute global information without introducing a large number of parameters, enabling the network to more efficiently capture the overall image structure and establish long-range correlations. Therefore, we propose a Dual-Domain Feature Fusion Network (DFFN) for low-light remote sensing image enhancement. Specifically, this challenging task of low-light enhancement is divided into two more manageable sub-tasks: the first phase learns amplitude information to restore image brightness, and the second phase learns phase information to refine details. To facilitate information exchange between the two phases, we designed an information fusion affine block that combines data from different phases and scales. Additionally, we have constructed two dark light remote sensing datasets to address the current lack of datasets in dark light remote sensing image enhancement. Extensive evaluations show that our method outperforms existing state-of-the-art methods. The code is available at this https URL.

CV-16-标题: Frequency-Guided Multi-Level Human Action Anomaly Detection with Normalizing Flows

链接: https://arxiv.org/abs/2404.17381
作者: Shun Maeda, Chunzhi Gu, Jun Yu, Shogo Tokai, Shangce Gao, Chao Zhang
备注:

点击查看摘要

Abstract:We introduce the task of human action anomaly detection (HAAD), which aims to identify anomalous motions in an unsupervised manner given only the pre-determined normal category of training action samples. Compared to prior human-related anomaly detection tasks which primarily focus on unusual events from videos, HAAD involves the learning of specific action labels to recognize semantically anomalous human behaviors. To address this task, we propose a normalizing flow (NF)-based detection framework where the sample likelihood is effectively leveraged to indicate anomalies. As action anomalies often occur in some specific body parts, in addition to the full-body action feature learning, we incorporate extra encoding streams into our framework for a finer modeling of body subsets. Our framework is thus multi-level to jointly discover global and local motion anomalies. Furthermore, to show awareness of the potentially jittery data during recording, we resort to discrete cosine transformation by converting the action samples from the temporal to the frequency domain to mitigate the issue of data instability. Extensive experimental results on two human action datasets demonstrate that our method outperforms the baselines formed by adapting state-of-the-art human activity AD approaches to our task of HAAD.

CV-17-标题: Estimating the Robustness Radius for Randomized Smoothing with 100times Sample Efficiency

链接: https://arxiv.org/abs/2404.17371
作者: Emmanouil Seferis, Stefanos Kollias, Chih-Hong Cheng
备注:

点击查看摘要

Abstract:Randomized smoothing (RS) has successfully been used to improve the robustness of predictions for deep neural networks (DNNs) by adding random noise to create multiple variations of an input, followed by deciding the consensus. To understand if an RS-enabled DNN is effective in the sampled input domains, it is mandatory to sample data points within the operational design domain, acquire the point-wise certificate regarding robustness radius, and compare it with pre-defined acceptance criteria. Consequently, ensuring that a point-wise robustness certificate for any given data point is obtained relatively cost-effectively is crucial. This work demonstrates that reducing the number of samples by one or two orders of magnitude can still enable the computation of a slightly smaller robustness radius (commonly ~20% radius reduction) with the same confidence. We provide the mathematical foundation for explaining the phenomenon while experimentally showing promising results on the standard CIFAR-10 and ImageNet datasets.

CV-18-标题: MV-VTON: Multi-View Virtual Try-On with Diffusion Models

链接: https://arxiv.org/abs/2404.17364
作者: Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, Wangmeng Zuo
备注: 15 pages

点击查看摘要

Abstract:The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, most existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person’s view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results of a person from multiple views using the given clothes. On the one hand, given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. On the other hand, the diffusion models that have demonstrated superior abilities are adopted to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person’s view. Subsequently, we suggest a joint attention block to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset, i.e., Multi-View Garment (MVG), in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets. Codes and datasets will be publicly released at this https URL .

CV-19-标题: UniRGB-IR: A Unified Framework for Visible-Infrared Downstream Tasks via Adapter Tuning

链接: https://arxiv.org/abs/2404.17360
作者: Maoxun Yuan, Bo Cui, Tianyi Zhao, Xingxing Wei
备注:

点击查看摘要

Abstract:Semantic analysis on visible (RGB) and infrared (IR) images has gained attention for its ability to be more accurate and robust under low-illumination and complex weather conditions. Due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. In this work, we propose a scalable and efficient framework called UniRGB-IR to unify RGB-IR downstream tasks, in which a novel adapter is developed to efficiently introduce richer RGB-IR features into the pre-trained RGB-based foundation model. Specifically, our framework consists of a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR downstream tasks demonstrate that our method can achieve state-of-the-art performance. The source code and results are available at this https URL.

CV-20-标题: On the Road to Clarity: Exploring Explainable AI for World Models in a Driver Assistance System

链接: https://arxiv.org/abs/2404.17350
作者: Mohamed Roshdi, Julian Petzold, Mostafa Wahby, Hussein Ebrahim, Mladen Berekovic, Heiko Hamann
备注: 8 pages, 6 figures, to be published in IEEE CAI 2024

点击查看摘要

Abstract:In Autonomous Driving (AD) transparency and safety are paramount, as mistakes are costly. However, neural networks used in AD systems are generally considered black boxes. As a countermeasure, we have methods of explainable AI (XAI), such as feature relevance estimation and dimensionality reduction. Coarse graining techniques can also help reduce dimensionality and find interpretable global patterns. A specific coarse graining method is Renormalization Groups from statistical physics. It has previously been applied to Restricted Boltzmann Machines (RBMs) to interpret unsupervised learning. We refine this technique by building a transparent backbone model for convolutional variational autoencoders (VAE) that allows mapping latent values to input features and has performance comparable to trained black box VAEs. Moreover, we propose a custom feature map visualization technique to analyze the internal convolutional layers in the VAE to explain internal causes of poor reconstruction that may lead to dangerous traffic scenarios in AD applications. In a second key contribution, we propose explanation and evaluation techniques for the internal dynamics and feature relevance of prediction networks. We test a long short-term memory (LSTM) network in the computer vision domain to evaluate the predictability and in future applications potentially safety of prediction models. We showcase our methods by analyzing a VAE-LSTM world model that predicts pedestrian perception in an urban traffic situation.

CV-21-标题: Masked Two-channel Decoupling Framework for Incomplete Multi-view Weak Multi-label Learning NEURIPS2023

链接: https://arxiv.org/abs/2404.17340
作者: Chengliang Liu, Jie Wen, Yabo Liu, Chao Huang, Zhihao Wu, Xiaoling Luo, Yong Xu
备注: Accepted at NeurIPS 2023. Email: liucl1996@163.com

点击查看摘要

Abstract:Multi-view learning has become a popular research topic in recent years, but research on the cross-application of classic multi-label classification and multi-view learning is still in its early stages. In this paper, we focus on the complex yet highly realistic task of incomplete multi-view weak multi-label learning and propose a masked two-channel decoupling framework based on deep neural networks to solve this problem. The core innovation of our method lies in decoupling the single-channel view-level representation, which is common in deep multi-view learning methods, into a shared representation and a view-proprietary representation. We also design a cross-channel contrastive loss to enhance the semantic property of the two channels. Additionally, we exploit supervised information to design a label-guided graph regularization loss, helping the extracted embedding features preserve the geometric structure among samples. Inspired by the success of masking mechanisms in image and text analysis, we develop a random fragment masking strategy for vector features to improve the learning ability of encoders. Finally, it is important to emphasize that our model is fully adaptable to arbitrary view and label absences while also performing well on the ideal full data. We have conducted sufficient and convincing experiments to confirm the effectiveness and advancement of our model.

CV-22-标题: A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

链接: https://arxiv.org/abs/2404.17335
作者: Xin Zhang, Liangxiu Han, Tam Sobeih, Lianghao Han, Darren Dancey
备注: 10 pages

点击查看摘要

Abstract:Depth estimation is crucial for interpreting complex environments, especially in areas such as autonomous vehicle navigation and robotics. Nonetheless, obtaining accurate depth readings from event camera data remains a formidable challenge. Event cameras operate differently from traditional digital cameras, continuously capturing data and generating asynchronous binary spikes that encode time, location, and light intensity. Yet, the unique sampling mechanisms of event cameras render standard image based algorithms inadequate for processing spike data. This necessitates the development of innovative, spike-aware algorithms tailored for event cameras, a task compounded by the irregularity, continuity, noise, and spatial and temporal characteristics inherent in spiking data.Harnessing the strong generalization capabilities of transformer neural networks for spatiotemporal data, we propose a purely spike-driven spike transformer network for depth estimation from spiking camera data. To address performance limitations with Spiking Neural Networks (SNN), we introduce a novel single-stage cross-modality knowledge transfer framework leveraging knowledge from a large vision foundational model of artificial neural networks (ANN) (DINOv2) to enhance the performance of SNNs with limited data. Our experimental results on both synthetic and real datasets show substantial improvements over existing models, with notable gains in Absolute Relative and Square Relative errors (49% and 39.77% improvements over the benchmark model Spike-T, respectively). Besides accuracy, the proposed model also demonstrates reduced power consumptions, a critical factor for practical applications.

CV-23-标题: Dense Road Surface Grip Map Prediction from Multimodal Image Data ICPR2024

链接: https://arxiv.org/abs/2404.17324
作者: Jyri Maanpää, Julius Pesonen, Heikki Hyyti, Iaroslav Melekhov, Juho Kannala, Petri Manninen, Antero Kukko, Juha Hyyppä
备注: 17 pages, 7 figures (supplementary material 1 page, 1 figure). Submitted to 27th International Conference of Pattern Recognition (ICPR 2024)

点击查看摘要

Abstract:Slippery road weather conditions are prevalent in many regions and cause a regular risk for traffic. Still, there has been less research on how autonomous vehicles could detect slippery driving conditions on the road to drive safely. In this work, we propose a method to predict a dense grip map from the area in front of the car, based on postprocessed multimodal sensor data. We trained a convolutional neural network to predict pixelwise grip values from fused RGB camera, thermal camera, and LiDAR reflectance images, based on weakly supervised ground truth from an optical road weather sensor. The experiments show that it is possible to predict dense grip values with good accuracy from the used data modalities as the produced grip map follows both ground truth measurements and local weather conditions, such as snowy areas on the road. The model using only the RGB camera or LiDAR reflectance modality provided good baseline results for grip prediction accuracy while using models fusing the RGB camera, thermal camera, and LiDAR modalities improved the grip predictions significantly.

CV-24-标题: Image Copy-Move Forgery Detection via Deep PatchMatch and Pairwise Ranking Learning

链接: https://arxiv.org/abs/2404.17310
作者: Yuanman Li, Yingjie He, Changsheng Chen, Li Dong, Bin Li, Jiantao Zhou, Xia Li
备注: 16 pages, 14figures

点击查看摘要

Abstract:Recent advances in deep learning algorithms have shown impressive progress in image copy-move forgery detection (CMFD). However, these algorithms lack generalizability in practical scenarios where the copied regions are not present in the training images, or the cloned regions are part of the background. Additionally, these algorithms utilize convolution operations to distinguish source and target regions, leading to unsatisfactory results when the target regions blend well with the background. To address these limitations, this study proposes a novel end-to-end CMFD framework that integrates the strengths of conventional and deep learning methods. Specifically, the study develops a deep cross-scale PatchMatch (PM) method that is customized for CMFD to locate copy-move regions. Unlike existing deep models, our approach utilizes features extracted from high-resolution scales to seek explicit and reliable point-to-point matching between source and target regions. Furthermore, we propose a novel pairwise rank learning framework to separate source and target regions. By leveraging the strong prior of point-to-point matches, the framework can identify subtle differences and effectively discriminate between source and target regions, even when the target regions blend well with the background. Our framework is fully differentiable and can be trained end-to-end. Comprehensive experimental results highlight the remarkable generalizability of our scheme across various copy-move scenarios, significantly outperforming existing methods.

CV-25-标题: Part-Guided 3D RL for Sim2Real Articulated Object Manipulation

链接: https://arxiv.org/abs/2404.17302
作者: Pengwei Xie, Rui Chen, Siang Chen, Yuzhe Qin, Fanbo Xiang, Tianyu Sun, Jing Xu, Guijin Wang, Hao Su
备注: 9 pages

点击查看摘要

Abstract:Manipulating unseen articulated objects through visual feedback is a critical but challenging task for real robots. Existing learning-based solutions mainly focus on visual affordance learning or other pre-trained visual models to guide manipulation policies, which face challenges for novel instances in real-world scenarios. In this paper, we propose a novel part-guided 3D RL framework, which can learn to manipulate articulated objects without demonstrations. We combine the strengths of 2D segmentation and 3D RL to improve the efficiency of RL policy training. To improve the stability of the policy on real robots, we design a Frame-consistent Uncertainty-aware Sampling (FUS) strategy to get a condensed and hierarchical 3D representation. In addition, a single versatile RL policy can be trained on multiple articulated object manipulation tasks simultaneously in simulation and shows great generalizability to novel categories and instances. Experimental results demonstrate the effectiveness of our framework in both simulation and real-world settings. Our code is available at this https URL.

CV-26-标题: Adversarial Reweighting with α-Power Maximization for Domain Adaptation

链接: https://arxiv.org/abs/2404.17275
作者: Xiang Gu, Xi Yu, Yan Yang, Jian Sun, Zongben Xu
备注: To appear in IJCV

点击查看摘要

Abstract:The practical Domain Adaptation (DA) tasks, e.g., Partial DA (PDA), open-set DA, universal DA, and test-time adaptation, have gained increasing attention in the machine learning community. In this paper, we propose a novel approach, dubbed Adversarial Reweighting with \alpha -Power Maximization (ARPM), for PDA where the source domain contains private classes absent in target domain. In ARPM, we propose a novel adversarial reweighting model that adversarially learns to reweight source domain data to identify source-private class samples by assigning smaller weights to them, for mitigating potential negative transfer. Based on the adversarial reweighting, we train the transferable recognition model on the reweighted source distribution to be able to classify common class data. To reduce the prediction uncertainty of the recognition model on the target domain for PDA, we present an \alpha -power maximization mechanism in ARPM, which enriches the family of losses for reducing the prediction uncertainty for PDA. Extensive experimental results on five PDA benchmarks, i.e., Office-31, Office-Home, VisDA-2017, ImageNet-Caltech, and DomainNet, show that our method is superior to recent PDA methods. Ablation studies also confirm the effectiveness of components in our approach. To theoretically analyze our method, we deduce an upper bound of target domain expected error for PDA, which is approximately minimized in our approach. We further extend ARPM to open-set DA, universal DA, and test time adaptation, and verify the usefulness through experiments.

CV-27-标题: 3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting

链接: https://arxiv.org/abs/2404.17273
作者: Xuri Ge, Songpei Xu, Fuhai Chen, Jie Wang, Guoxin Wang, Shan An, Joemon M. Jose
备注: Accepted Information Processing and Management (IP&M), 10 pages, 9 figures and 8 tables

点击查看摘要

Abstract:In this paper, we propose a novel visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality, thus allowing the integration of visual semantics-spatial interactions and maintaining independence between two modalities. This integration effectively combines object regions with the corresponding semantic and position layouts derived from segmentation to enhance the visual representation. And the modality-independence guarantees efficiency and generalization. Additionally, 3SHNet utilizes the structured contextual visual scene information from segmentation to conduct the local (region-based) or global (grid-based) guidance and achieve accurate hybrid-level retrieval. Extensive experiments conducted on MS-COCO and Flickr30K benchmarks substantiate the superior performances, inference efficiency and generalization of the proposed 3SHNet when juxtaposed with contemporary state-of-the-art methodologies. Specifically, on the larger MS-COCO 5K test set, we achieve 16.3%, 24.8%, and 18.3% improvements in terms of rSum score, respectively, compared with the state-of-the-art methods using different image representations, while maintaining optimal retrieval efficiency. Moreover, our performance on cross-dataset generalization improves by 18.6%. Data and code are available at this https URL.

CV-28-标题: SDFD: Building a Versatile Synthetic Face Image Dataset with Diverse Attributes

链接: https://arxiv.org/abs/2404.17255
作者: Georgia Baltsou, Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos
备注: 2024 18th International Conference on Automatic Face and Gesture Recognition (FG)

点击查看摘要

Abstract:AI systems rely on extensive training on large datasets to address various tasks. However, image-based systems, particularly those used for demographic attribute prediction, face significant challenges. Many current face image datasets primarily focus on demographic factors such as age, gender, and skin tone, overlooking other crucial facial attributes like hairstyle and accessories. This narrow focus limits the diversity of the data and consequently the robustness of AI systems trained on them. This work aims to address this limitation by proposing a methodology for generating synthetic face image datasets that capture a broader spectrum of facial diversity. Specifically, our approach integrates a systematic prompt formulation strategy, encompassing not only demographics and biometrics but also non-permanent traits like make-up, hairstyle, and accessories. These prompts guide a state-of-the-art text-to-image model in generating a comprehensive dataset of high-quality realistic images and can be used as an evaluation set in face analysis systems. Compared to existing datasets, our proposed dataset proves equally or more challenging in image classification tasks while being much smaller in size.

CV-29-标题: Trinity Detector:text-assisted and attention mechanisms based spectral fusion for diffusion generation image detection

链接: https://arxiv.org/abs/2404.17254
作者: Jiawei Song, Dengpan Ye, Yunming Zhang
备注:

点击查看摘要

Abstract:Artificial Intelligence Generated Content (AIGC) techniques, represented by text-to-image generation, have led to a malicious use of deep forgeries, raising concerns about the trustworthiness of multimedia content. Adapting traditional forgery detection methods to diffusion models proves challenging. Thus, this paper proposes a forgery detection method explicitly designed for diffusion models called Trinity Detector. Trinity Detector incorporates coarse-grained text features through a CLIP encoder, coherently integrating them with fine-grained artifacts in the pixel domain for comprehensive multimodal detection. To heighten sensitivity to diffusion-generated image features, a Multi-spectral Channel Attention Fusion Unit (MCAF) is designed, extracting spectral inconsistencies through adaptive fusion of diverse frequency bands and further integrating spatial co-occurrence of the two modalities. Extensive experimentation validates that our Trinity Detector method outperforms several state-of-the-art methods, our performance is competitive across all datasets and up to 17.6% improvement in transferability in the diffusion datasets.

CV-30-标题: Weakly Supervised Training for Hologram Verification in Identity Documents ICDAR2024

链接: https://arxiv.org/abs/2404.17253
作者: Glen Pouliquen (1 and 2), Guillaume Chiron (1), Joseph Chazalon (2), Thierry Géraud (2), Ahmad Montaser Awal (1) ((1) IDnow AI & ML Center of Excellence, France, (2) EPITA Research Lab. (LRE), EPITA, France)
备注: Accepted at the International Conference on Document Analysis and Recognition (ICDAR 2024)

点击查看摘要

Abstract:We propose a method to remotely verify the authenticity of Optically Variable Devices (OVDs), often referred to as ``holograms’', in identity documents. Our method processes video clips captured with smartphones under common lighting conditions, and is evaluated on two public datasets: MIDV-HOLO and MIDV-2020. Thanks to a weakly-supervised training, we optimize a feature extraction and decision pipeline which achieves a new leading performance on MIDV-HOLO, while maintaining a high recall on documents from MIDV-2020 used as attack samples. It is also the first method, to date, to effectively address the photo replacement attack task, and can be trained on either genuine samples, attack samples, or both for increased performance. By enabling to verify OVD shapes and dynamics with very little supervision, this work opens the way towards the use of massive amounts of unlabeled data to build robust remote identity document verification systems on commodity smartphones. Code is available at this https URL

CV-31-标题: Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition

链接: https://arxiv.org/abs/2404.17252
作者: Houtan Ghaffari, Paul Devos
备注:

点击查看摘要

Abstract:Transferring the weights of a pre-trained model to assist another task has become a crucial part of modern deep learning, particularly in data-scarce scenarios. Pre-training refers to the initial step of training models outside the current task of interest, typically on another dataset. It can be done via supervised models using human-annotated datasets or self-supervised models trained on unlabeled datasets. In both cases, many pre-trained models are available to fine-tune for the task of interest. Interestingly, research has shown that pre-trained models from ImageNet can be helpful for audio tasks despite being trained on image datasets. Hence, it’s unclear whether in-domain models would be advantageous compared to competent out-domain models, such as convolutional neural networks from ImageNet. Our experiments will demonstrate the usefulness of in-domain models and datasets for bird species recognition by leveraging VICReg, a recent and powerful self-supervised method.

CV-32-标题: Camera Motion Estimation from RGB-D-Inertial Scene Flow CVPR2024

链接: https://arxiv.org/abs/2404.17251
作者: Samuel Cerezo, Javier Civera
备注: Accepted to CVPR2024 Workshop on Visual Odometry and Computer Vision Applications

点击查看摘要

Abstract:In this paper, we introduce a novel formulation for camera motion estimation that integrates RGB-D images and inertial data through scene flow. Our goal is to accurately estimate the camera motion in a rigid 3D environment, along with the state of the inertial measurement unit (IMU). Our proposed method offers the flexibility to operate as a multi-frame optimization or to marginalize older data, thus effectively utilizing past measurements. To assess the performance of our method, we conducted evaluations using both synthetic data from the ICL-NUIM dataset and real data sequences from the OpenLORIS-Scene dataset. Our results show that the fusion of these two sensors enhances the accuracy of camera motion estimation when compared to using only visual data.

CV-33-标题: Parameter Efficient Fine-tuning of Self-supervised ViTs without Catastrophic Forgetting CVPR

链接: https://arxiv.org/abs/2404.17245
作者: Reza Akbarian Bafghi, Nidhin Harilal, Claire Monteleoni, Maziar Raissi
备注: Accepted at eLVM Workshop, CVPR, 2024

点击查看摘要

Abstract:Artificial neural networks often suffer from catastrophic forgetting, where learning new concepts leads to a complete loss of previously acquired knowledge. We observe that this issue is particularly magnified in vision transformers (ViTs), where post-pre-training and fine-tuning on new tasks can significantly degrade the model’s original general abilities. For instance, a DINO ViT-Base/16 pre-trained on ImageNet-1k loses over 70% accuracy on ImageNet-1k after just 10 iterations of fine-tuning on CIFAR-100. Overcoming this stability-plasticity dilemma is crucial for enabling ViTs to continuously learn and adapt to new domains while preserving their initial knowledge. In this work, we study two new parameter-efficient fine-tuning strategies: (1)~Block Expansion, and (2) Low-rank adaptation (LoRA). Our experiments reveal that using either Block Expansion or LoRA on self-supervised pre-trained ViTs surpass fully fine-tuned ViTs in new domains while offering significantly greater parameter efficiency. Notably, we find that Block Expansion experiences only a minimal performance drop in the pre-training domain, thereby effectively mitigating catastrophic forgetting in pre-trained ViTs.

CV-34-标题: Binarizing Documents by Leveraging both Space and Frequency ICDAR2024

链接: https://arxiv.org/abs/2404.17243
作者: Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara
备注: Accepted at ICDAR2024

点击查看摘要

Abstract:Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. One of the main challenges of this task is that documents generally exhibit degradations and acquisition artifacts that can greatly vary throughout the page. Nonetheless, even when dealing with a local patch of the document, taking into account the overall appearance of a wide portion of the page can ease the prediction by enriching it with semantic information on the ink and background conditions. In this respect, approaches able to model both local and global information have been proven suitable for this task. In particular, recent applications of Vision Transformer (ViT)-based models, able to model short and long-range dependencies via the attention mechanism, have demonstrated their superiority over standard Convolution-based models, which instead struggle to model global dependencies. In this work, we propose an alternative solution based on the recently introduced Fast Fourier Convolutions, which overcomes the limitation of standard convolutions in modeling global information while requiring fewer parameters than ViTs. We validate the effectiveness of our approach via extensive experimental analysis considering different types of degradations.

CV-35-标题: ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion ECCV2024

链接: https://arxiv.org/abs/2404.17230
作者: Ziyue Zhang, Mingbao Lin, Rongrong Ji
备注: 12 pages, submitted to ECCV2024

点击查看摘要

Abstract:We introduce ObjectAdd, a training-free diffusion modification method to add user-expected objects into user-specified area. The motive of ObjectAdd stems from: first, describing everything in one prompt can be difficult, and second, users often need to add objects into the generated image. To accommodate with real world, our ObjectAdd maintains accurate image consistency after adding objects with technical innovations in: (1) embedding-level concatenation to ensure correct text embedding coalesce; (2) object-driven layout control with latent and attention injection to ensure objects accessing user-specified area; (3) prompted image inpainting in an attention refocusing & object expansion fashion to ensure rest of the image stays the same. With a text-prompted image, our ObjectAdd allows users to specify a box and an object, and achieves: (1) adding object inside the box area; (2) exact content outside the box area; (3) flawless fusion between the two areas

CV-36-标题: SAGHOG: Self-Supervised Autoencoder for Generating HOG Features for Writer Retrieval ICDAR2024

链接: https://arxiv.org/abs/2404.17221
作者: Marco Peer, Florian Kleber, Robert Sablatnig
备注: accepted for ICDAR2024

点击查看摘要

Abstract:This paper introduces SAGHOG, a self-supervised pretraining strategy for writer retrieval using HOG features of the binarized input image. Our preprocessing involves the application of the Segment Anything technique to extract handwriting from various datasets, ending up with about 24k documents, followed by training a vision transformer on reconstructing masked patches of the handwriting. SAGHOG is then finetuned by appending NetRVLAD as an encoding layer to the pretrained encoder. Evaluation of our approach on three historical datasets, Historical-WI, HisFrag20, and GRK-Papyri, demonstrates the effectiveness of SAGHOG for writer retrieval. Additionally, we provide ablation studies on our architecture and evaluate un- and supervised finetuning. Notably, on HisFrag20, SAGHOG outperforms related work with a mAP of 57.2 % - a margin of 11.6 % to the current state of the art, showcasing its robustness on challenging data, and is competitive on even small datasets, e.g. GRK-Papyri, where we achieve a Top-1 accuracy of 58.0%.

CV-37-标题: SLAM for Indoor Mapping of Wide Area Construction Environments

链接: https://arxiv.org/abs/2404.17215
作者: Vincent Ress, Wei Zhang, David Skuddis, Norbert Haala, Uwe Soergel
备注:

点击查看摘要

Abstract:Simultaneous localization and mapping (SLAM), i.e., the reconstruction of the environment represented by a (3D) map and the concurrent pose estimation, has made astonishing progress. Meanwhile, large scale applications aiming at the data collection in complex environments like factory halls or construction sites are becoming feasible. However, in contrast to small scale scenarios with building interiors separated to single rooms, shop floors or construction areas require measures at larger distances in potentially texture less areas under difficult illumination. Pose estimation is further aggravated since no GNSS measures are available as it is usual for such indoor applications. In our work, we realize data collection in a large factory hall by a robot system equipped with four stereo cameras as well as a 3D laser scanner. We apply our state-of-the-art LiDAR and visual SLAM approaches and discuss the respective pros and cons of the different sensor types for trajectory estimation and dense map generation in such an environment. Additionally, dense and accurate depth maps are generated by 3D Gaussian splatting, which we plan to use in the context of our project aiming on the automatic construction and site monitoring.

CV-38-标题: Scrutinizing Data from Sky: An Examination of Its Veracity in Area Based Traffic Contexts

链接: https://arxiv.org/abs/2404.17212
作者: Yawar Ali (1), Krishnan K N (1), Debashis Ray Sarkar (1), K. Ramachandra Rao (1), Niladri Chatterjee (1), Ashish Bhaskar (2) ((1) Indian Institute of Technology Delhi, New Delhi, India (2) Queensland University of Technology, Brisbane, Australia)
备注:

点击查看摘要

Abstract:Traffic data collection has been an overwhelming task for researchers as well as authorities over the years. With the advancement in technology and introduction of various tools for processing and extracting traffic data the task has been made significantly convenient. Data from Sky (DFS) is one such tool, based on image processing and artificial intelligence (AI), that provides output for macroscopic as well as microscopic variables of the traffic streams. The company claims to provide 98 to 100 percent accuracy on the data exported using DFS tool. The tool is widely used in developed countries where the traffic is homogenous and has lane-based movements. In this study, authors have checked the veracity of DFS tool in heterogenous and area-based traffic movement that is prevailing in most developing countries. The validation is done using various methods using Classified Volume Count (CVC), Space Mean Speeds (SMS) of individual vehicle classes and microscopic trajectory of probe vehicle to verify DFS claim. The error for CVCs for each vehicle class present in the traffic stream is estimated. Mean Absolute Percentage Error (MAPE) values are calculated for average speeds of each vehicle class between manually and DFS extracted space mean speeds (SMSs), and the microscopic trajectories are validated using a GPS based tracker put on probe vehicles. The results are fairly accurate in the case of data taken from a bird eye view with least errors. The other configurations of data collection have some significant errors, that are majorly caused by the varied traffic composition, the view of camera angle, and the direction of traffic.

CV-39-标题: Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer

链接: https://arxiv.org/abs/2404.17205
作者: Xinpeng Li, Teng Wang, Jian Zhao, Shuyi Mao, Jinbao Wang, Feng Zheng, Xiaojiang Peng, Xuelong Li
备注:

点击查看摘要

Abstract:Emotion recognition aims to discern the emotional state of subjects within an image, relying on subject-centric and contextual visual cues. Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint training stages and limited interaction between fine-grained subject-context elements. To address the challenge, we present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT), for simultaneous subject localization and emotion classification. Rather than compartmentalizing training stages, we jointly leverage box and emotion signals as supervision to enrich subject-centric feature learning. Furthermore, we introduce DSCT to facilitate interactions between fine-grained subject-context cues in a decouple-then-fuse manner. The decoupled query token–subject queries and context queries–gradually intertwine across layers within DSCT, during which spatial and semantic relations are exploited and aggregated. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC. Our approach surpasses two-stage alternatives with fewer parameter numbers, achieving a 3.39% accuracy improvement and a 6.46% average precision gain on CAER-S and EMOTIC datasets, respectively.

CV-40-标题: Self-supervised visual learning in the low-data regime: a comparative evaluation

链接: https://arxiv.org/abs/2404.17202
作者: Sotirios Konstantakos, Despina Ioanna Chalkiadaki, Ioannis Mademlis, Yuki M. Asano, Efstratios Gavves, Georgios Th. Papadopoulos
备注:

点击查看摘要

Abstract:Self-Supervised Learning (SSL) is a valuable and robust training methodology for contemporary Deep Neural Networks (DNNs), enabling unsupervised pretraining on a pretext task' that does not require ground-truth labels/annotation. This allows efficient representation learning from massive amounts of unlabeled training data, which in turn leads to increased accuracy in a downstream task’ by exploiting supervised transfer learning. Despite the relatively straightforward conceptualization and applicability of SSL, it is not always feasible to collect and/or to utilize very large pretraining datasets, especially when it comes to real-world application settings. In particular, in cases of specialized and domain-specific application scenarios, it may not be achievable or practical to assemble a relevant image pretraining dataset in the order of millions of instances or it could be computationally infeasible to pretrain at this scale. This motivates an investigation on the effectiveness of common SSL pretext tasks, when the pretraining dataset is of relatively limited/constrained size. In this context, this work introduces a taxonomy of modern visual SSL methods, accompanied by detailed explanations and insights regarding the main categories of approaches, and, subsequently, conducts a thorough comparative experimental evaluation in the low-data regime, targeting to identify: a) what is learnt via low-data SSL pretraining, and b) how do different SSL categories behave in such training scenarios. Interestingly, for domain-specific downstream tasks, in-domain low-data SSL pretraining outperforms the common approach of large-scale pretraining on general datasets. Grounded on the obtained results, valuable insights are highlighted regarding the performance of each category of SSL methods, which in turn suggest straightforward future research directions in the field.

CV-41-标题: Few-shot Calligraphy Style Learning

链接: https://arxiv.org/abs/2404.17199
作者: Fangda Chen, Jiacheng Nie, Lichuan Jiang, Zhuoer Zeng
备注:

点击查看摘要

Abstract:We introduced “Presidifussion,” a novel approach to learning and replicating the unique style of calligraphy of President Xu, using a pretrained diffusion model adapted through a two-stage training process. Initially, our model is pretrained on a diverse dataset containing works from various calligraphers. This is followed by fine-tuning on a smaller, specialized dataset of President Xu’s calligraphy, comprising just under 200 images. Our method introduces innovative techniques of font image conditioning and stroke information conditioning, enabling the model to capture the intricate structural elements of Chinese characters. The effectiveness of our approach is demonstrated through a comparison with traditional methods like zi2zi and CalliGAN, with our model achieving comparable performance using significantly smaller datasets and reduced computational resources. This work not only presents a breakthrough in the digital preservation of calligraphic art but also sets a new standard for data-efficient generative modeling in the domain of cultural heritage digitization.

CV-42-标题: MCSDNet: Mesoscale Convective System Detection Network via Multi-scale Spatiotemporal Information

链接: https://arxiv.org/abs/2404.17186
作者: Jiajun Liang, Baoquan Zhang, Yunming Ye, Xutao Li, Chuyao Luo, Xukai Fu
备注:

点击查看摘要

Abstract:The accurate detection of Mesoscale Convective Systems (MCS) is crucial for meteorological monitoring due to their potential to cause significant destruction through severe weather phenomena such as hail, thunderstorms, and heavy rainfall. However, the existing methods for MCS detection mostly targets on single-frame detection, which just considers the static characteristics and ignores the temporal evolution in the life cycle of MCS. In this paper, we propose a novel encoder-decoder neural network for MCS detection(MCSDNet). MCSDNet has a simple architecture and is easy to expand. Different from the previous models, MCSDNet targets on multi-frames detection and leverages multi-scale spatiotemporal information for the detection of MCS regions in remote sensing imagery(RSI). As far as we know, it is the first work to utilize multi-scale spatiotemporal information to detect MCS regions. Firstly, we design a multi-scale spatiotemporal information module to extract multi-level semantic from different encoder levels, which makes our models can extract more detail spatiotemporal features. Secondly, a Spatiotemporal Mix Unit(STMU) is introduced to MCSDNet to capture both intra-frame features and inter-frame correlations, which is a scalable module and can be replaced by other spatiotemporal module, e.g., CNN, RNN, Transformer and our proposed Dual Spatiotemporal Attention(DSTA). This means that the future works about spatiotemporal modules can be easily integrated to our model. Finally, we present MCSRSI, the first publicly available dataset for multi-frames MCS detection based on visible channel images from the FY-4A satellite. We also conduct several experiments on MCSRSI and find that our proposed MCSDNet achieve the best performance on MCS detection task when comparing to other baseline methods.

CV-43-标题: Low-Rank Knowledge Decomposition for Medical Foundation Models CVPR2024

链接: https://arxiv.org/abs/2404.17184
作者: Yuhang Zhou, Haolin Li, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang
备注: CVPR 2024

点击查看摘要

Abstract:The popularity of large-scale pre-training has promoted the development of medical foundation models. However, some studies have shown that although foundation models exhibit strong general feature extraction capabilities, their performance on specific tasks is still inferior to task-specific methods. In this paper, we explore a new perspective called ``Knowledge Decomposition’’ to improve the performance on specific medical tasks, which deconstruct the foundation model into multiple lightweight expert models, each dedicated to a particular task, with the goal of improving specialization while concurrently mitigating resource expenditure. To accomplish the above objective, we design a novel framework named Low-Rank Knowledge Decomposition (LoRKD), which explicitly separates graidents by incorporating low-rank expert modules and the efficient knowledge separation convolution. Extensive experimental results demonstrate that the decomposed models perform well in terms of performance and transferability, even surpassing the original foundation models.

CV-44-标题: MovieChat: Question-aware Sparse Memory for Long Video Question Answering

链接: https://arxiv.org/abs/2404.17176
作者: Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, Gaoang Wang
备注:

点击查看摘要

Abstract:Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, and they only perform well on short videos. For long videos, the computational complexity and memory costs associated with long-term temporal connections are significantly increased, posing additional challenges.Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose MovieChat to overcome these challenges. We lift pre-trained multi-modal large language models for understanding long videos without incorporating additional trainable temporal modules, employing a zero-shot approach. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations for validation of the effectiveness of our method. The code along with the dataset can be accessed via the following this https URL.

CV-45-标题: Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification

链接: https://arxiv.org/abs/2404.17173
作者: Yanbiao Ma, Licheng Jiao, Fang Liu, Lingling Li, Shuyuan Yang, Xu Liu
备注:

点击查看摘要

Abstract:In semi-supervised learning, methods that rely on confidence learning to generate pseudo-labels have been widely proposed. However, increasing research finds that when faced with noisy and biased data, the model’s representation network is more reliable than the classification network. Additionally, label generation methods based on model predictions often show poor adaptability across different datasets, necessitating customization of the classification network. Therefore, we propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels. We also introduce an adaptive method for selecting hyperparameters in HDL, enhancing its versatility. Moreover, HDL can be combined with general image encoders (e.g., CLIP) to serve as a fundamental data processing module. We extract embeddings from datasets with class-balanced and long-tailed distributions using pre-trained semi-supervised models. Subsequently, samples are re-labeled using HDL, and the re-labeled samples are used to further train the semi-supervised models. Experiments demonstrate improved model performance, validating the motivation that representation networks are more reliable than classifiers or predictors. Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.

CV-46-标题: S-IQA Image Quality Assessment With Compressive Sampling

链接: https://arxiv.org/abs/2404.17170
作者: Ronghua Liao, Chen Hui, Lang Yuan, Feng Jiang
备注:

点击查看摘要

Abstract:No-Reference Image Quality Assessment (IQA) aims at estimating image quality in accordance with subjective human perception. However, most existing NR-IQA methods focus on exploring increasingly complex networks or components to improve the final performance. Such practice imposes great limitations and complexity on IQA methods, especially when they are applied to high-resolution (HR) images in the real world. Actually, most images own high spatial redundancy, especially for those HR data. To further exploit the characteristic and alleviate the issue above, we propose a new framework for Image Quality Assessment with compressive Sampling (dubbed S-IQA), which consists of three components: (1) The Flexible Sampling Module (FSM) samples the image to obtain measurements at an arbitrary ratio. (2) Vision Transformer with the Adaptive Embedding Module (AEM) makes measurements of uniform size and extracts deep features (3) Dual Branch (DB) allocates weight for every patch and predicts the final quality score. Experiments show that our proposed S-IQA achieves state-of-the-art result on various datasets with less data usage.

CV-47-标题: Phase-aggregated Dual-branch Network for Efficient Fingerprint Dense Registration

链接: https://arxiv.org/abs/2404.17159
作者: Xiongjun Guan, Jianjiang Feng, Jie Zhou
备注:

点击查看摘要

Abstract:Fingerprint dense registration aims to finely align fingerprint pairs at the pixel level, thereby reducing intra-class differences caused by distortion. Unfortunately, traditional methods exhibited subpar performance when dealing with low-quality fingerprints while suffering from slow inference speed. Although deep learning based approaches shows significant improvement in these aspects, their registration accuracy is still unsatisfactory. In this paper, we propose a Phase-aggregated Dual-branch Registration Network (PDRNet) to aggregate the advantages of both types of methods. A dual-branch structure with multi-stage interactions is introduced between correlation information at high resolution and texture feature at low resolution, to perceive local fine differences while ensuring global stability. Extensive experiments are conducted on more comprehensive databases compared to previous works. Experimental results demonstrate that our method reaches the state-of-the-art registration performance in terms of accuracy and robustness, while maintaining considerable competitiveness in efficiency.

CV-48-标题: CSCO: Connectivity Search of Convolutional Operators CVPR

链接: https://arxiv.org/abs/2404.17152
作者: Tunhou Zhang, Shiyu Li, Hsin-Pai Cheng, Feng Yan, Hai Li, Yiran Chen
备注: To appear on Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2024)

点击查看摘要

Abstract:Exploring dense connectivity of convolutional operators establishes critical “synapses” to communicate feature vectors from different levels and enriches the set of transformations on Computer Vision applications. Yet, even with heavy-machinery approaches such as Neural Architecture Search (NAS), discovering effective connectivity patterns requires tremendous efforts due to either constrained connectivity design space or a sub-optimal exploration process induced by an unconstrained search space. In this paper, we propose CSCO, a novel paradigm that fabricates effective connectivity of convolutional operators with minimal utilization of existing design motifs and further utilizes the discovered wiring to construct high-performing ConvNets. CSCO guides the exploration via a neural predictor as a surrogate of the ground-truth performance. We introduce Graph Isomorphism as data augmentation to improve sample efficiency and propose a Metropolis-Hastings Evolutionary Search (MH-ES) to evade locally optimal architectures and advance search quality. Results on ImageNet show ~0.6% performance improvement over hand-crafted and NAS-crafted dense connectivity. Our code is publicly available.

CV-49-标题: MorphText: Deep Morphology Regularized Arbitrary-shape Scene Text Detection

链接: https://arxiv.org/abs/2404.17151
作者: Chengpei Xu, Wenjing Jia, Ruomei Wang, Xiaonan Luo, Xiangjian He
备注: Accepted by Transaction on Multimedia

点击查看摘要

Abstract:Bottom-up text detection methods play an important role in arbitrary-shape scene text detection but there are two restrictions preventing them from achieving their great potential, i.e., 1) the accumulation of false text segment detections, which affects subsequent processing, and 2) the difficulty of building reliable connections between text segments. Targeting these two problems, we propose a novel approach, named ``MorphText", to capture the regularity of texts by embedding deep morphology for arbitrary-shape text detection. Towards this end, two deep morphological modules are designed to regularize text segments and determine the linkage between them. First, a Deep Morphological Opening (DMOP) module is constructed to remove false text segment detections generated in the feature extraction process. Then, a Deep Morphological Closing (DMCL) module is proposed to allow text instances of various shapes to stretch their morphology along their most significant orientation while deriving their connections. Extensive experiments conducted on four challenging benchmark datasets (CTW1500, Total-Text, MSRA-TD500 and ICDAR2017) demonstrate that our proposed MorphText outperforms both top-down and bottom-up state-of-the-art arbitrary-shape scene text detection approaches.

CV-50-标题: Pose-Specific 3D Fingerprint Unfolding

链接: https://arxiv.org/abs/2404.17149
作者: Xiongjun Guan, Jianjiang Feng, Jie Zhou
备注:

点击查看摘要

Abstract:In order to make 3D fingerprints compatible with traditional 2D flat fingerprints, a common practice is to unfold the 3D fingerprint into a 2D rolled fingerprint, which is then matched with the flat fingerprints by traditional 2D fingerprint recognition algorithms. The problem with this method is that there may be large elastic deformation between the unfolded rolled fingerprint and flat fingerprint, which affects the recognition rate. In this paper, we propose a pose-specific 3D fingerprint unfolding algorithm to unfold the 3D fingerprint using the same pose as the flat fingerprint. Our experiments show that the proposed unfolding algorithm improves the compatibility between 3D fingerprint and flat fingerprint and thus leads to higher genuine matching scores.

CV-51-标题: Direct Regression of Distortion Field from a Single Fingerprint Image

链接: https://arxiv.org/abs/2404.17148
作者: Xiongjun Guan, Yongjie Duan, Jianjiang Feng, Jie Zhou
备注:

点击查看摘要

Abstract:Skin distortion is a long standing challenge in fingerprint matching, which causes false non-matches. Previous studies have shown that the recognition rate can be improved by estimating the distortion field from a distorted fingerprint and then rectifying it into a normal fingerprint. However, existing rectification methods are based on principal component representation of distortion fields, which is not accurate and are very sensitive to finger pose. In this paper, we propose a rectification method where a self-reference based network is utilized to directly estimate the dense distortion field of distorted fingerprint instead of its low dimensional representation. This method can output accurate distortion fields of distorted fingerprints with various finger poses. Considering the limited number and variety of distorted fingerprints in the existing public dataset, we collected more distorted fingerprints with diverse finger poses and distortion patterns as a new database. Experimental results demonstrate that our proposed method achieves the state-of-the-art rectification performance in terms of distortion field estimation and rectified fingerprint matching.

CV-52-标题: On the Federated Learning Framework for Cooperative Perception

链接: https://arxiv.org/abs/2404.17147
作者: Zhenrong Zhang, Jianan Liu, Xi Zhou, Tao Huang, Qing-Long Han, Jingxin Liu, Hongbin Liu
备注:

点击查看摘要

Abstract:Cooperative perception is essential to enhance the efficiency and safety of future transportation systems, requiring extensive data sharing among vehicles on the road, which raises significant privacy concerns. Federated learning offers a promising solution by enabling data privacy-preserving collaborative enhancements in perception, decision-making, and planning among connected and autonomous vehicles (CAVs). However, federated learning is impeded by significant challenges arising from data heterogeneity across diverse clients, potentially diminishing model accuracy and prolonging convergence periods. This study introduces a specialized federated learning framework for CP, termed the federated dynamic weighted aggregation (FedDWA) algorithm, facilitated by dynamic adjusting loss (DALoss) function. This framework employs dynamic client weighting to direct model convergence and integrates a novel loss function that utilizes Kullback-Leibler divergence (KLD) to counteract the detrimental effects of non-independently and identically distributed (Non-IID) and unbalanced data. Utilizing the BEV transformer as the primary model, our rigorous testing on the OpenV2V dataset, augmented with FedBEVT data, demonstrates significant improvements in the average intersection over union (IoU). These results highlight the substantial potential of our federated learning framework to address data heterogeneity challenges in CP, thereby enhancing the accuracy of environmental perception models and facilitating more robust and efficient collaborative learning solutions in the transportation sector.

CV-53-标题: Localization of Pallets on Shelves Using Horizontal Plane Projection of a 360-degree Image

链接: https://arxiv.org/abs/2404.17118
作者: Yasuyo Kita, Yudai Fujieda, Ichiro Matsuda, Nobuyuki Kita
备注:

点击查看摘要

Abstract:In this paper, we propose a method for calculating the three-dimensional (3D) position and orientation of a pallet placed on a shelf on the side of a forklift truck using a 360-degree camera. By using a 360-degree camera mounted on the forklift truck, it is possible to observe both the pallet at the side of the forklift and one several meters ahead. However, the pallet on the obtained image is observed with different distortion depending on its 3D position, so that it is difficult to extract the pallet from the image. To solve this problem, a method [1] has been proposed for detecting a pallet by projecting a 360-degree image on a vertical plane that coincides with the front of the shelf to calculate an image similar to the image seen from the front of the shelf. At the same time as the detection, the approximate position and orientation of the detected pallet can be obtained, but the accuracy is not sufficient for automatic control of the forklift truck. In this paper, we propose a method for accurately detecting the yaw angle, which is the angle of the front surface of the pallet in the horizontal plane, by projecting the 360-degree image on a horizontal plane including the boundary line of the front surface of the detected pallet. The position of the pallet is also determined by moving the vertical plane having the detected yaw angle back and forth, and finding the position at which the degree of coincidence between the projection image on the vertical plane and the actual size of the front surface of the pallet is maximized. Experiments using real images taken in a laboratory and an actual warehouse have confirmed that the proposed method can calculate the position and orientation of a pallet within a reasonable calculation time and with the accuracy necessary for inserting the fork into the hole in the front of the pallet.

CV-54-标题: Synthesizing Iris Images using Generative Adversarial Networks: Survey and Comparative Analysis

链接: https://arxiv.org/abs/2404.17105
作者: Shivangi Yadav, Arun Ross
备注:

点击查看摘要

Abstract:Biometric systems based on iris recognition are currently being used in border control applications and mobile devices. However, research in iris recognition is stymied by various factors such as limited datasets of bonafide irides and presentation attack instruments; restricted intra-class variations; and privacy concerns. Some of these issues can be mitigated by the use of synthetic iris data. In this paper, we present a comprehensive review of state-of-the-art GAN-based synthetic iris image generation techniques, evaluating their strengths and limitations in producing realistic and useful iris images that can be used for both training and testing iris recognition systems and presentation attack detectors. In this regard, we first survey the various methods that have been used for synthetic iris generation and specifically consider generators based on StyleGAN, RaSGAN, CIT-GAN, iWarpGAN, StarGAN, etc. We then analyze the images generated by these models for realism, uniqueness, and biometric utility. This comprehensive analysis highlights the pros and cons of various GANs in the context of developing robust iris matchers and presentation attack detectors.

CV-55-标题: Dont Look at the Camera: Achieving Perceived Eye Contact

链接: https://arxiv.org/abs/2404.17104
作者: Alice Gao, Samyukta Jayakumar, Marcello Maniglia, Brian Curless, Ira Kemelmacher-Shlizerman, Aaron R. Seitz, Steven M. Seitz
备注:

点击查看摘要

Abstract:We consider the question of how to best achieve the perception of eye contact when a person is captured by camera and then rendered on a 2D display. For single subjects photographed by a camera, conventional wisdom tells us that looking directly into the camera achieves eye contact. Through empirical user studies, we show that it is instead preferable to \em look just below the camera lens. We quantitatively assess where subjects should direct their gaze relative to a camera lens to optimize the perception that they are making eye contact.

CV-56-标题: Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

链接: https://arxiv.org/abs/2404.17100
作者: Yuanyuan Liu, Yuxuan Huang, Shuyang Liu, Yibing Zhan, Zijing Chen, Zhe Chen
备注:

点击查看摘要

Abstract:In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these V-FER models cannot deal with unknown classes that are prevalent in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming at identifying not only known classes but also new, unknown human facial expressions not encountered during training. While existing approaches address open-set recognition by leveraging large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the nuanced and subtle human expression patterns required by the OV-FER task. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP’s ability to model video-based facial expression details effectively, thereby presenting a new CLIP-based OV-FER approach. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompt representations to complement the original CLIP textual prompts and enhance the textual representations of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, 3) a delicately designed open-set multi-task learning scheme that facilitates prompt learning and encourages interactions between the textual and visual prompting modules. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP’s performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin.

CV-57-标题: Defending Spiking Neural Networks against Adversarial Attacks through Image Purification ECAI2024

链接: https://arxiv.org/abs/2404.17092
作者: Weiran Chen, Qi Sun, Qi Xu
备注: 8 pages, 5 figures, ECAI2024 under review

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) aim to bridge the gap between neuroscience and machine learning by emulating the structure of the human nervous system. However, like convolutional neural networks, SNNs are vulnerable to adversarial attacks. To tackle the challenge, we propose a biologically inspired methodology to enhance the robustness of SNNs, drawing insights from the visual masking effect and filtering theory. First, an end-to-end SNN-based image purification model is proposed to defend against adversarial attacks, including a noise extraction network and a non-blind denoising network. The former network extracts noise features from noisy images, while the latter component employs a residual U-Net structure to reconstruct high-quality noisy images and generate clean images. Simultaneously, a multi-level firing SNN based on Squeeze-and-Excitation Network is introduced to improve the robustness of the classifier. Crucially, the proposed image purification network serves as a pre-processing module, avoiding modifications to classifiers. Unlike adversarial training, our method is highly flexible and can be seamlessly integrated with other defense strategies. Experimental results on various datasets demonstrate that the proposed methodology outperforms state-of-the-art baselines in terms of defense effectiveness, training time, and resource consumption.

CV-58-标题: WheelPose: Data Synthesis Techniques to Improve Pose Estimation Performance on Wheelchair Users

链接: https://arxiv.org/abs/2404.17063
作者: William Huang, Sam Ghahremani, Siyou Pei, Yang Zhang
备注: Published for ACM CHI 2024. For source files, see this https URL

点击查看摘要

Abstract:Existing pose estimation models perform poorly on wheelchair users due to a lack of representation in training data. We present a data synthesis pipeline to address this disparity in data collection and subsequently improve pose estimation performance for wheelchair users. Our configurable pipeline generates synthetic data of wheelchair users using motion capture data and motion generation outputs simulated in the Unity game engine. We validated our pipeline by conducting a human evaluation, investigating perceived realism, diversity, and an AI performance evaluation on a set of synthetic datasets from our pipeline that synthesized different backgrounds, models, and postures. We found our generated datasets were perceived as realistic by human evaluators, had more diversity than existing image datasets, and had improved person detection and pose estimation performance when fine-tuned on existing pose estimation models. Through this work, we hope to create a foothold for future efforts in tackling the inclusiveness of AI in a data-centric and human-centric manner with the data synthesis techniques demonstrated in this work. Finally, for future works to extend upon, we open source all code in this research and provide a fully configurable Unity Environment used to generate our datasets. In the case of any models we are unable to share due to redistribution and licensing policies, we provide detailed instructions on how to source and replace said models.

CV-59-标题: Nuclei-Location Based Point Set Registration of Multi-Stained Whole Slide Images

链接: https://arxiv.org/abs/2404.17041
作者: Adith Jeyasangar, Abdullah Alsalemi, Shan E Ahmed Raza
备注: 15 pages, 5 figures, Submitted to Medical Image Understanding and Analysis Conference 2024

点击查看摘要

Abstract:Whole Slide Images (WSIs) provide exceptional detail for studying tissue architecture at the cell level. To study tumour microenvironment (TME) with the context of various protein biomarkers and cell sub-types, analysis and registration of features using multi-stained WSIs is often required. Multi-stained WSI pairs normally suffer from rigid and non-rigid deformities in addition to slide artefacts and control tissue which present challenges at precise registration. Traditional registration methods mainly focus on global rigid/non-rigid registration but struggle with aligning slides with complex tissue deformations at the nuclei level. However, nuclei level non-rigid registration is essential for downstream tasks such as cell sub-type analysis in the context of protein biomarker signatures. This paper focuses on local level non-rigid registration using a nuclei-location based point set registration approach for aligning multi-stained WSIs. We exploit the spatial distribution of nuclei that is prominent and consistent (to a large level) across different stains to establish a spatial correspondence. We evaluate our approach using the HYRECO dataset consisting of 54 re-stained images of H&E and PHH3 image pairs. The approach can be extended to other IHC and IF stained WSIs considering a good nuclei detection algorithm is accessible. The performance of the model is tested against established registration algorithms and is shown to outperform the model for nuclei level registration.

CV-60-标题: Auto-Generating Weak Labels for Real & Synthetic Data to Improve Label-Scarce Medical Image Segmentation

链接: https://arxiv.org/abs/2404.17033
作者: Tanvi Deshpande, Eva Prakash, Elsie Gyang Ross, Curtis Langlotz, Andrew Ng, Jeya Maria Jose Valanarasu
备注: Accepted at MIDL 2024

点击查看摘要

Abstract:The high cost of creating pixel-by-pixel gold-standard labels, limited expert availability, and presence of diverse tasks make it challenging to generate segmentation labels to train deep learning models for medical imaging tasks. In this work, we present a new approach to overcome the hurdle of costly medical image labeling by leveraging foundation models like Segment Anything Model (SAM) and its medical alternate MedSAM. Our pipeline has the ability to generate weak labels for any unlabeled medical image and subsequently use it to augment label-scarce datasets. We perform this by leveraging a model trained on a few gold-standard labels and using it to intelligently prompt MedSAM for weak label generation. This automation eliminates the manual prompting step in MedSAM, creating a streamlined process for generating labels for both real and synthetic images, regardless of quantity. We conduct experiments on label-scarce settings for multiple tasks pertaining to modalities ranging from ultrasound, dermatology, and X-rays to demonstrate the usefulness of our pipeline. The code is available at this https URL.

CV-61-标题: Motor Focus: Ego-Motion Prediction with All-Pixel Matching

链接: https://arxiv.org/abs/2404.17031
作者: Hao Wang, Jiayou Qin, Xiwen Chen, Ashish Bastola, John Suchanek, Zihao Gong, Abolfazl Razi
备注:

点击查看摘要

Abstract:Motion analysis plays a critical role in various applications, from virtual reality and augmented reality to assistive visual navigation. Traditional self-driving technologies, while advanced, typically do not translate directly to pedestrian applications due to their reliance on extensive sensor arrays and non-feasible computational frameworks. This highlights a significant gap in applying these solutions to human users since human navigation introduces unique challenges, including the unpredictable nature of human movement, limited processing capabilities of portable devices, and the need for directional responsiveness due to the limited perception range of humans. In this project, we introduce an image-only method that applies motion analysis using optical flow with ego-motion compensation to predict Motor Focus-where and how humans or machines focus their movement intentions. Meanwhile, this paper addresses the camera shaking issue in handheld and body-mounted devices which can severely degrade performance and accuracy, by applying a Gaussian aggregation to stabilize the predicted motor focus area and enhance the prediction accuracy of movement direction. This also provides a robust, real-time solution that adapts to the user’s immediate environment. Furthermore, in the experiments part, we show the qualitative analysis of motor focus estimation between the conventional dense optical flow-based method and the proposed method. In quantitative tests, we show the performance of the proposed method on a collected small dataset that is specialized for motor focus estimation tasks.

CV-62-标题: Dr-SAM: An End-to-End Framework for Vascular Segmentation Diameter Estimation and Anomaly Detection on Angiography Images

链接: https://arxiv.org/abs/2404.17029
作者: Vazgen Zohranyan, Vagner Navasardyan, Hayk Navasardyan, Jan Borggrefe, Shant Navasardyan
备注:

点击查看摘要

Abstract:Recent advancements in AI have significantly transformed medical imaging, particularly in angiography, by enhancing diagnostic precision and patient care. However existing works are limited in analyzing the aorta and iliac arteries, above all for vascular anomaly detection and characterization. To close this gap, we propose Dr-SAM, a comprehensive multi-stage framework for vessel segmentation, diameter estimation, and anomaly analysis aiming to examine the peripheral vessels through angiography images. For segmentation we introduce a customized positive/negative point selection mechanism applied on top of the Segment Anything Model (SAM), specifically for medical (Angiography) images. Then we propose a morphological approach to determine the vessel diameters followed by our histogram-driven anomaly detection approach. Moreover, we introduce a new benchmark dataset for the comprehensive analysis of peripheral vessel angiography images which we hope can boost the upcoming research in this direction leading to enhanced diagnostic precision and ultimately better health outcomes for individuals facing vascular issues.

CV-63-标题: PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

链接: https://arxiv.org/abs/2404.16994
作者: Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng
备注:

点击查看摘要

Abstract:Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the progress of video-language models. This paper investigates a straightforward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for dense video understanding. Our preliminary experiments reveal that directly fine-tuning pre-trained image-language models with multiple frames as inputs on video datasets leads to performance saturation or even a drop. Our further investigation reveals that it is largely attributed to the bias of learned high-norm visual features. Motivated by this finding, we propose a simple but effective pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from the extreme features. The new model is termed Pooling LLaVA, or \nameofmethod in short. \nameofmethod achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks. Notably, on the recent popular Video ChatGPT benchmark, PLLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, exceeding the previous SOTA results from GPT4V (IG-VLM) by 9%. On the latest multi-choice benchmark MVBench, PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, 14.5% higher than GPT4V (IG-VLM). Code is available at \urlthis https URL.

CV-64-标题: CriSp: Leveraging Tread Depth Maps for Enhanced Crime-Scene Shoeprint Matching

链接: https://arxiv.org/abs/2404.16972
作者: Samia Shafique, Shu Kong, Charless Fowlkes
备注:

点击查看摘要

Abstract:Shoeprints are a common type of evidence found at crime scenes and are used regularly in forensic investigations. However, existing methods cannot effectively employ deep learning techniques to match noisy and occluded crime-scene shoeprints to a shoe database due to a lack of training data. Moreover, all existing methods match crime-scene shoeprints to clean reference prints, yet our analysis shows matching to more informative tread depth maps yields better retrieval results. The matching task is further complicated by the necessity to identify similarities only in corresponding regions (heels, toes, etc) of prints and shoe treads. To overcome these challenges, we leverage shoe tread images from online retailers and utilize an off-the-shelf predictor to estimate depth maps and clean prints. Our method, named CriSp, matches crime-scene shoeprints to tread depth maps by training on this data. CriSp incorporates data augmentation to simulate crime-scene shoeprints, an encoder to learn spatially-aware features, and a masking module to ensure only visible regions of crime-scene prints affect retrieval results. To validate our approach, we introduce two validation sets by reprocessing existing datasets of crime-scene shoeprints and establish a benchmarking protocol for comparison. On this benchmark, CriSp significantly outperforms state-of-the-art methods in both automated shoeprint matching and image retrieval tailored to this task.

CV-65-标题: Constellation Dataset : Benchmarking High-Altitude Object Detection for an Urban Intersection

链接: https://arxiv.org/abs/2404.16944
作者: Mehmet Kerem Turkcan, Sanjeev Narasimhan, Chengbo Zang, Gyung Hyun Je, Bo Yu, Mahshid Ghasemi, Javad Ghaderi, Gil Zussman, Zoran Kostic
备注:

点击查看摘要

Abstract:We introduce Constellation, a dataset of 13K images suitable for research on detection of objects in dense urban streetscapes observed from high-elevation cameras, collected for a variety of temporal conditions. The dataset addresses the need for curated data to explore problems in small object detection exemplified by the limited pixel footprint of pedestrians observed tens of meters from above. It enables the testing of object detection models for variations in lighting, building shadows, weather, and scene dynamics. We evaluate contemporary object detection architectures on the dataset, observing that state-of-the-art methods have lower performance in detecting small pedestrians compared to vehicles, corresponding to a 10% difference in average precision (AP). Using structurally similar datasets for pretraining the models results in an increase of 1.8% mean AP (mAP). We further find that incorporating domain-specific data augmentations helps improve model performance. Using pseudo-labeled data, obtained from inference outcomes of the best-performing models, improves the performance of the models. Finally, comparing the models trained using the data collected in two different time intervals, we find a performance drift in models due to the changes in intersection conditions over time. The best-performing model achieves a pedestrian AP of 92.0% with 11.5 ms inference time on NVIDIA A100 GPUs, and an mAP of 95.4%.

CV-66-标题: Grad Queue : A probabilistic framework to reinforce sparse gradients

链接: https://arxiv.org/abs/2404.16917
作者: Irfan Mohammad Al Hasib
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Informative gradients are often lost in large batch updates. We propose a robust mechanism to reinforce the sparse components within a random batch of data points. A finite queue of online gradients is used to determine their expected instantaneous statistics. We propose a function to measure the scarcity of incoming gradients using these statistics and establish the theoretical ground of this mechanism. To minimize conflicting components within large mini-batches, samples are grouped with aligned objectives by clustering based on inherent feature space. Sparsity is measured for each centroid and weighted accordingly. A strong intuitive criterion to squeeze out redundant information from each cluster is the backbone of the system. It makes rare information indifferent to aggressive momentum also exhibits superior performance with larger mini-batch horizon. The effective length of the queue kept variable to follow the local loss pattern. The contribution of our method is to restore intra-mini-batch diversity at the same time widening the optimal batch boundary. Both of these collectively drive it deeper towards the minima. Our method has shown superior performance for CIFAR10, MNIST, and Reuters News category dataset compared to mini-batch gradient descent.

CV-67-标题: Exploring Learngene via Stage-wise Weight Sharing for Initializing Variable-sized Models

链接: https://arxiv.org/abs/2404.16897
作者: Shi-Yu Xia, Wenxuan Zhu, Xu Yang, Xin Geng
备注:

点击查看摘要

Abstract:In practice, we usually need to build variable-sized models adapting for diverse resource constraints in different application scenarios, where weight initialization is an important step prior to training. The Learngene framework, introduced recently, firstly learns one compact part termed as learngene from a large well-trained model, after which learngene is expanded to initialize variable-sized models. In this paper, we start from analysing the importance of guidance for the expansion of well-trained learngene layers, inspiring the design of a simple but highly effective Learngene approach termed SWS (Stage-wise Weight Sharing), where both learngene layers and their learning process critically contribute to providing knowledge and guidance for initializing models at varying scales. Specifically, to learn learngene layers, we build an auxiliary model comprising multiple stages where the layer weights in each stage are shared, after which we train it through distillation. Subsequently, we expand these learngene layers containing stage information at their corresponding stage to initialize models of variable depths. Extensive experiments on ImageNet-1K demonstrate that SWS achieves consistent better performance compared to many models trained from scratch, while reducing around 6.6x total training costs. In some cases, SWS performs better only after 1 epoch tuning. When initializing variable-sized models adapting for different resource constraints, SWS achieves better results while reducing around 20x parameters stored to initialize these models and around 10x pre-training costs, in contrast to the pre-training and fine-tuning approach.

CV-68-标题: Adapting an Artificial Intelligence Sexually Transmitted Diseases Symptom Checker Tool for Mpox Detection: The HeHealth Experience

链接: https://arxiv.org/abs/2404.16885
作者: Rayner Kay Jin Tan, Dilruk Perera, Salomi Arasaratnam, Yudara Kularathne
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Artificial Intelligence applications have shown promise in the management of pandemics and have been widely used to assist the identification, classification, and diagnosis of medical images. In response to the global outbreak of Monkeypox (Mpox), the this http URL team leveraged an existing tool to screen for sexually transmitted diseases to develop a digital screening test for symptomatic Mpox through AI approaches. Prior to the global outbreak of Mpox, the team developed a smartphone app, where app users can use their own smartphone cameras to take pictures of their own penises to screen for symptomatic STD. The AI model was initially developed using 5000 cases and use a modified convolutional neural network to output prediction scores across visually diagnosable penis pathologies including Syphilis, Herpes Simplex Virus, and Human Papilloma Virus. From June 2022 to October 2022, a total of about 22,000 users downloaded the HeHealth app, and about 21,000 images have been analyzed using HeHealth AI technology. We then engaged in formative research, stakeholder engagement, rapid consolidation images, a validation study, and implementation of the tool from July 2022. From July 2022 to October 2022, a total of 1000 Mpox related images had been used to train the Mpox symptom checker tool. Our digital symptom checker tool showed accuracy of 87% to rule in Mpox and 90% to rule out symptomatic Mpox. Several hurdles identified included issues of data privacy and security for app users, initial lack of data to train the AI tool, and the potential generalizability of input data. We offer several suggestions to help others get started on similar projects in emergency situations, including engaging a wide range of stakeholders, having a multidisciplinary team, prioritizing pragmatism, as well as the concept that big data in fact is made up of small data.

CV-69-标题: ThermoPore: Predicting Part Porosity Based on Thermal Images Using Deep Learning

链接: https://arxiv.org/abs/2404.16882
作者: Peter Myung-Won Pak, Francis Ogoke, Andrew Polonsky, Anthony Garland, Dan S. Bolintineanu, Dan R. Moser, Michael J. Heiden, Amir Barati Farimani
备注:

点击查看摘要

Abstract:We present a deep learning approach for quantifying and localizing ex-situ porosity within Laser Powder Bed Fusion fabricated samples utilizing in-situ thermal image monitoring data. Our goal is to build the real time porosity map of parts based on thermal images acquired during the build. The quantification task builds upon the established Convolutional Neural Network model architecture to predict pore count and the localization task leverages the spatial and temporal attention mechanisms of the novel Video Vision Transformer model to indicate areas of expected porosity. Our model for porosity quantification achieved a R^2 score of 0.57 and our model for porosity localization produced an average IoU score of 0.32 and a maximum of 1.0. This work is setting the foundations of part porosity “Digital Twins” based on additive manufacturing monitoring data and can be applied downstream to reduce time-intensive post-inspection and testing activities during part qualification and certification. In addition, we seek to accelerate the acquisition of crucial insights normally only available through ex-situ part evaluation by means of machine learning analysis of in-situ process monitoring data.

CV-70-标题: HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections

链接: https://arxiv.org/abs/2404.16845
作者: Chen Dudai, Morris Alper, Hana Bezalel, Rana Hanocka, Itai Lang, Hadar Averbuch-Elor
备注: Eurographics 2024. Project page: this https URL

点击查看摘要

Abstract:Internet image collections containing photos captured by crowds of photographers show promise for enabling digital exploration of large-scale tourist landmarks. However, prior works focus primarily on geometric reconstruction and visualization, neglecting the key role of language in providing a semantic interface for navigation and fine-grained understanding. In constrained 3D domains, recent methods have leveraged vision-and-language models as a strong prior of 2D visual semantics. While these models display an excellent understanding of broad visual semantics, they struggle with unconstrained photo collections depicting such tourist landmarks, as they lack expert knowledge of the architectural domain. In this work, we present a localization system that connects neural representations of scenes depicting large-scale landmarks with text describing a semantic region within the scene, by harnessing the power of SOTA vision-and-language models with adaptations for understanding landmark scene semantics. To bolster such models with fine-grained knowledge, we leverage large-scale Internet data containing images of similar landmarks along with weakly-related textual information. Our approach is built upon the premise that images physically grounded in space can provide a powerful supervision signal for localizing new concepts, whose semantics may be unlocked from Internet textual metadata with large language models. We use correspondences between views of scenes to bootstrap spatial understanding of these semantics, providing guidance for 3D-compatible segmentation that ultimately lifts to a volumetric scene representation. Our results show that HaLo-NeRF can accurately localize a variety of semantic concepts related to architectural landmarks, surpassing the results of other 3D models as well as strong 2D segmentation baselines. Our project page is at this https URL.

CV-71-标题: Sugarcane Health Monitoring With Satellite Spectroscopy and Machine Learning: A Review

链接: https://arxiv.org/abs/2404.16844
作者: Ethan Kane Waters, Carla Chia-Ming Chen, Mostafa Rahimi Azghadi
备注: 22 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Research into large-scale crop monitoring has flourished due to increased accessibility to satellite imagery. This review delves into previously unexplored and under-explored areas in sugarcane health monitoring and disease/pest detection using satellite-based spectroscopy and Machine Learning (ML). It discusses key considerations in system development, including relevant satellites, vegetation indices, ML methods, factors influencing sugarcane reflectance, optimal growth conditions, common diseases, and traditional detection methods. Many studies highlight how factors like crop age, soil type, viewing angle, water content, recent weather patterns, and sugarcane variety can impact spectral reflectance, affecting the accuracy of health assessments via spectroscopy. However, these variables have not been fully considered in the literature. In addition, the current literature lacks comprehensive comparisons between ML techniques and vegetation indices. We address these gaps in this review. We discuss that, while current findings suggest the potential for an ML-driven satellite spectroscopy system for monitoring sugarcane health, further research is essential. This paper offers a comprehensive analysis of previous research to aid in unlocking this potential and advancing the development of an effective sugarcane health monitoring system using satellite technology.

CV-72-标题: Leaf-Based Plant Disease Detection and Explainable AI

链接: https://arxiv.org/abs/2404.16833
作者: Saurav Sagar, Mohammed Javed, David S Doermann
备注: To appear in a Journal/Conference

点击查看摘要

Abstract:The agricultural sector plays an essential role in the economic growth of a country. Specifically, in an Indian context, it is the critical source of livelihood for millions of people living in rural areas. Plant Disease is one of the significant factors affecting the agricultural sector. Plants get infected with diseases for various reasons, including synthetic fertilizers, archaic practices, environmental conditions, etc., which impact the farm yield and subsequently hinder the economy. To address this issue, researchers have explored many applications based on AI and Machine Learning techniques to detect plant diseases. This research survey provides a comprehensive understanding of common plant leaf diseases, evaluates traditional and deep learning techniques for disease detection, and summarizes available datasets. It also explores Explainable AI (XAI) to enhance the interpretability of deep learning models’ decisions for end-users. By consolidating this knowledge, the survey offers valuable insights to researchers, practitioners, and stakeholders in the agricultural sector, fostering the development of efficient and transparent solutions for combating plant diseases and promoting sustainable agricultural practices.

CV-73-标题: One-Shot Image Restoration

链接: https://arxiv.org/abs/2404.17426
作者: Deborah Pereg
备注: arXiv admin note: text overlap with arXiv:2209.14267

点击查看摘要

Abstract:Image restoration, or inverse problems in image processing, has long been an extensively studied topic. In recent years supervised learning approaches have become a popular strategy attempting to tackle this task. Unfortunately, most supervised learning-based methods are highly demanding in terms of computational resources and training data (sample complexity). In addition, trained models are sensitive to domain changes, such as varying acquisition systems, signal sampling rates, resolution and contrast. In this work, we try to answer a fundamental question: Can supervised learning models generalize well solely by learning from one image or even part of an image? If so, then what is the minimal amount of patches required to achieve acceptable generalization? To this end, we focus on an efficient patch-based learning framework that requires a single image input-output pair for training. Experimental results demonstrate the applicability, robustness and computational efficiency of the proposed approach for supervised image deblurring and super-resolution. Our results showcase significant improvement of learning models’ sample efficiency, generalization and time complexity, that can hopefully be leveraged for future real-time applications, and applied to other signals and modalities.

CV-74-标题: Simultaneous Tri-Modal Medical Image Fusion and Super-Resolution using Conditional Diffusion Model

链接: https://arxiv.org/abs/2404.17357
作者: Yushen Xu, Xiaosong Li, Yuchan Jie, Haishu Tan
备注:

点击查看摘要

Abstract:In clinical practice, tri-modal medical image fusion, compared to the existing dual-modal technique, can provide a more comprehensive view of the lesions, aiding physicians in evaluating the disease’s shape, location, and biological activity. However, due to the limitations of imaging equipment and considerations for patient safety, the quality of medical images is usually limited, leading to sub-optimal fusion performance, and affecting the depth of image analysis by the physician. Thus, there is an urgent need for a technology that can both enhance image resolution and integrate multi-modal information. Although current image processing methods can effectively address image fusion and super-resolution individually, solving both problems synchronously remains extremely challenging. In this paper, we propose TFS-Diff, a simultaneously realize tri-modal medical image fusion and super-resolution model. Specially, TFS-Diff is based on the diffusion model generation of a random iterative denoising process. We also develop a simple objective function and the proposed fusion super-resolution loss, effectively evaluates the uncertainty in the fusion and ensures the stability of the optimization process. And the channel attention module is proposed to effectively integrate key information from different modalities for clinical diagnosis, avoiding information loss caused by multiple image processing. Extensive experiments on public Harvard datasets show that TFS-Diff significantly surpass the existing state-of-the-art methods in both quantitative and visual evaluations. The source code will be available at GitHub.

CV-75-标题: Optimizing Universal Lesion Segmentation: State Space Model-Guided Hierarchical Networks with Feature Importance Adjustment

链接: https://arxiv.org/abs/2404.17235
作者: Kazi Shahriar Sanjid, Md. Tanzim Hossain, Md. Shakib Shahariar Junayed, M. Monir Uddin
备注:

点击查看摘要

Abstract:Deep learning has revolutionized medical imaging by providing innovative solutions to complex healthcare challenges. Traditional models often struggle to dynamically adjust feature importance, resulting in suboptimal representation, particularly in tasks like semantic segmentation crucial for accurate structure delineation. Moreover, their static nature incurs high computational costs. To tackle these issues, we introduce Mamba-Ahnet, a novel integration of State Space Model (SSM) and Advanced Hierarchical Network (AHNet) within the MAMBA framework, specifically tailored for semantic segmentation in medical imaging.Mamba-Ahnet combines SSM’s feature extraction and comprehension with AHNet’s attention mechanisms and image reconstruction, aiming to enhance segmentation accuracy and robustness. By dissecting images into patches and refining feature comprehension through self-attention mechanisms, the approach significantly improves feature resolution. Integration of AHNet into the MAMBA framework further enhances segmentation performance by selectively amplifying informative regions and facilitating the learning of rich hierarchical representations. Evaluation on the Universal Lesion Segmentation dataset demonstrates superior performance compared to state-of-the-art techniques, with notable metrics such as a Dice similarity coefficient of approximately 98% and an Intersection over Union of about 83%. These results underscore the potential of our methodology to enhance diagnostic accuracy, treatment planning, and ultimately, patient outcomes in clinical practice. By addressing the limitations of traditional models and leveraging the power of deep learning, our approach represents a significant step forward in advancing medical imaging technology.

CV-76-标题: Calculation of Femur Caput Collum Diaphyseal angle for X-Rays images using Semantic Segmentation

链接: https://arxiv.org/abs/2404.17083
作者: Deepak Bhatia, Muhammad Abdullah, Anne Querfurth, Mahdi Mantash
备注:

点击查看摘要

Abstract:This paper investigates the use of deep learning approaches to estimate the femur caput-collum-diaphyseal (CCD) angle from X-ray images. The CCD angle is an important measurement in the diagnosis of hip problems, and correct prediction can help in the planning of surgical procedures. Manual measurement of this angle, on the other hand, can be time-intensive and vulnerable to inter-observer variability. In this paper, we present a deep-learning algorithm that can reliably estimate the femur CCD angle from X-ray images. To train and test the performance of our model, we employed an X-ray image dataset with associated femur CCD angle measurements. Furthermore, we built a prototype to display the resulting predictions and to allow the user to interact with the predictions. As this is happening in a sterile setting during surgery, we expanded our interface to the possibility of being used only by voice commands. Our results show that our deep learning model predicts the femur CCD angle on X-ray images with great accuracy, with a mean absolute error of 4.3 degrees on the left femur and 4.9 degrees on the right femur on the test dataset. Our results suggest that deep learning has the potential to give a more efficient and accurate technique for predicting the femur CCD angle, which might have substantial therapeutic implications for the diagnosis and management of hip problems.

CV-77-标题: Detection of Peri-Pancreatic Edema using Deep Learning and Radiomics Techniques

链接: https://arxiv.org/abs/2404.17064
作者: Ziliang Hong, Debesh Jha, Koushik Biswas, Zheyuan Zhang, Yury Velichko, Cemal Yazici, Temel Tirkes, Amir Borhani, Baris Turkbey, Alpay Medetalibeyoglu, Gorkem Durak, Ulas Bagci
备注:

点击查看摘要

Abstract:Identifying peri-pancreatic edema is a pivotal indicator for identifying disease progression and prognosis, emphasizing the critical need for accurate detection and assessment in pancreatitis diagnosis and management. This study \textitintroduces a novel CT dataset sourced from 255 patients with pancreatic diseases, featuring annotated pancreas segmentation masks and corresponding diagnostic labels for peri-pancreatic edema condition. With the novel dataset, we first evaluate the efficacy of the \textitLinTransUNet model, a linear Transformer based segmentation algorithm, to segment the pancreas accurately from CT imaging data. Then, we use segmented pancreas regions with two distinctive machine learning classifiers to identify existence of peri-pancreatic edema: deep learning-based models and a radiomics-based eXtreme Gradient Boosting (XGBoost). The LinTransUNet achieved promising results, with a dice coefficient of 80.85%, and mIoU of 68.73%. Among the nine benchmarked classification models for peri-pancreatic edema detection, \textitSwin-Tiny transformer model demonstrated the highest recall of 98.85 \pm 0.42 and precision of 98.38\pm 0.17 . Comparatively, the radiomics-based XGBoost model achieved an accuracy of 79.61\pm4.04 and recall of 91.05\pm3.28 , showcasing its potential as a supplementary diagnostic tool given its rapid processing speed and reduced training time. Our code is available \urlthis https URL.

信息检索

IR-0-标题: Towards Group-aware Search Success

链接: https://arxiv.org/abs/2404.17313
作者: Haolun Wu, Bhaskar Mitra, Nick Craswell
备注:

点击查看摘要

Abstract:Traditional measures of search success often overlook the varying information needs of different demographic groups. To address this gap, we introduce a novel metric, named Group-aware Search Success (GA-SS). GA-SS redefines search success to ensure that all demographic groups achieve satisfaction from search outcomes. We introduce a comprehensive mathematical framework to calculate GA-SS, incorporating both static and stochastic ranking policies and integrating user browsing models for a more accurate assessment. In addition, we have proposed Group-aware Most Popular Completion (gMPC) ranking model to account for demographic variances in user intent, aligning more closely with the diverse needs of all user groups. We empirically validate our metric and approach with two real-world datasets: one focusing on query auto-completion and the other on movie recommendations, where the results highlight the impact of stochasticity and the complex interplay among various search success metrics. Our findings advocate for a more inclusive approach in measuring search success, as well as inspiring future investigations into the quality of service of search.

IR-1-标题: ExcluIR: Exclusionary Neural Information Retrieval

链接: https://arxiv.org/abs/2404.17288
作者: Wenhao Zhang, Mengqi Zhang, Shiguang Wu, Jiahuan Pei, Zhaochun Ren, Maarten de Rijke, Zhumin Chen, Pengjie Ren
备注:

点击查看摘要

Abstract:Exclusion is an important and universal linguistic skill that humans use to express what they do not want. However, in information retrieval community, there is little research on exclusionary retrieval, where users express what they do not want in their queries. In this work, we investigate the scenario of exclusionary retrieval in document retrieval for the first time. We present ExcluIR, a set of resources for exclusionary retrieval, consisting of an evaluation benchmark and a training set for helping retrieval models to comprehend exclusionary queries. The evaluation benchmark includes 3,452 high-quality exclusionary queries, each of which has been manually annotated. The training set contains 70,293 exclusionary queries, each paired with a positive document and a negative document. We conduct detailed experiments and analyses, obtaining three main observations: (1) Existing retrieval models with different architectures struggle to effectively comprehend exclusionary queries; (2) Although integrating our training data can improve the performance of retrieval models on exclusionary retrieval, there still exists a gap compared to human performance; (3) Generative retrieval models have a natural advantage in handling exclusionary queries. To facilitate future research on exclusionary retrieval, we share the benchmark and evaluation scripts on \urlthis https URL.

IR-2-标题: TruthSR: Trustworthy Sequential Recommender Systems via User-generated Multimodal Content

链接: https://arxiv.org/abs/2404.17238
作者: Meng Yan, Haibin Huang, Ying Liu, Juan Zhao, Xiyue Gao, Cai Xu, Ziyu Guan, Wei Zhao
备注:

点击查看摘要

Abstract:Sequential recommender systems explore users’ preferences and behavioral patterns from their historically generated data. Recently, researchers aim to improve sequential recommendation by utilizing massive user-generated multi-modal content, such as reviews, images, etc. This content often contains inevitable noise. Some studies attempt to reduce noise interference by suppressing cross-modal inconsistent information. However, they could potentially constrain the capturing of personalized user preferences. In addition, it is almost impossible to entirely eliminate noise in diverse user-generated multi-modal content. To solve these problems, we propose a trustworthy sequential recommendation method via noisy user-generated multi-modal content. Specifically, we explicitly capture the consistency and complementarity of user-generated multi-modal content to mitigate noise interference. We also achieve the modeling of the user’s multi-modal sequential preferences. In addition, we design a trustworthy decision mechanism that integrates subjective user perspective and objective item perspective to dynamically evaluate the uncertainty of prediction results. Experimental evaluation on four widely-used datasets demonstrates the superior performance of our model compared to state-of-the-art methods. The code is released at this https URL.

IR-3-标题: Rank-Preference Consistency as the Appropriate Metric for Recommender Systems

链接: https://arxiv.org/abs/2404.17097
作者: Tung Nguyen, Jeffrey Uhlmann
备注:

点击查看摘要

Abstract:In this paper we argue that conventional unitary-invariant measures of recommender system (RS) performance based on measuring differences between predicted ratings and actual user ratings fail to assess fundamental RS properties. More specifically, posing the optimization problem as one of predicting exact user ratings provides only an indirect suboptimal approximation for what RS applications typically need, which is an ability to accurately predict user preferences. We argue that scalar measures such as RMSE and MAE with respect to differences between actual and predicted ratings are only proxies for measuring RS ability to accurately estimate user preferences. We propose what we consider to be a measure that is more fundamentally appropriate for assessing RS performance, rank-preference consistency, which simply counts the number of prediction pairs that are inconsistent with the user’s expressed product preferences. For example, if an RS predicts the user will prefer product A over product B, but the user’s withheld ratings indicate s/he prefers product B over A, then rank-preference consistency has been violated. Our test results conclusively demonstrate that methods tailored to optimize arbitrary measures such as RMSE are not generally effective at accurately predicting user preferences. Thus, we conclude that conventional methods used for assessing RS performance are arbitrary and misleading.

人工智能

AI-0-标题: Enhancing Legal Compliance and Regulation Analysis with Large Language Models

链接: https://arxiv.org/abs/2404.17522
作者: Shabnam Hassani
备注: to be published in 32nd IEEE International Requirements Engineering 2024 Conference (RE’24) - Doctoral Symposium. arXiv admin note: text overlap with arXiv:2404.14356

点击查看摘要

Abstract:This research explores the application of Large Language Models (LLMs) for automating the extraction of requirement-related legal content in the food safety domain and checking legal compliance of regulatory artifacts. With Industry 4.0 revolutionizing the food industry and with the General Data Protection Regulation (GDPR) reshaping privacy policies and data processing agreements, there is a growing gap between regulatory analysis and recent technological advancements. This study aims to bridge this gap by leveraging LLMs, namely BERT and GPT models, to accurately classify legal provisions and automate compliance checks. Our findings demonstrate promising results, indicating LLMs’ significant potential to enhance legal compliance and regulatory analysis efficiency, notably by reducing manual workload and improving accuracy within reasonable time and financial constraints.

AI-1-标题: " ChatGPT Is Here to Help Not to Replace Anybody" – An Evaluation of Students Opinions On Integrating ChatGPT In CS Courses

链接: https://arxiv.org/abs/2404.17443
作者: Bruno Pereira Cipriano, Pedro Alves
备注: Author’s version: this is a paper under revision

点击查看摘要

Abstract:Large Language Models (LLMs) like GPT and Bard are capable of producing code based on textual descriptions, with remarkable efficacy. Such technology will have profound implications for computing education, raising concerns about cheating, excessive dependence, and a decline in computational thinking skills, among others. There has been extensive research on how teachers should handle this challenge but it is also important to understand how students feel about this paradigm shift. In this research, 52 first-year CS students were surveyed in order to assess their views on technologies with code-generation capabilities, both from academic and professional perspectives. Our findings indicate that while students generally favor the academic use of GPT, they don’t over rely on it, only mildly asking for its help. Although most students benefit from GPT, some struggle to use it effectively, urging the need for specific GPT training. Opinions on GPT’s impact on their professional lives vary, but there is a consensus on its importance in academic practice.

AI-2-标题: Real-World Deployment of a Hierarchical Uncertainty-Aware Collaborative Multi agent Planning System ICRA

链接: https://arxiv.org/abs/2404.17438
作者: Martina Stadler Kurtz, Samuel Prentice, Yasmin Veys, Long Quang, Carlos Nieto-Granda, Michael Novitzky, Ethan Stump, Nicholas Roy
备注: Accepted to the IEEE ICRA Workshop on Field Robotics 2024

点击查看摘要

Abstract:We would like to enable a collaborative multiagent team to navigate at long length scales and under uncertainty in real-world environments. In practice, planning complexity scales with the number of agents in the team, with the length scale of the environment, and with environmental uncertainty. Enabling tractable planning requires developing abstract models that can represent complex, high-quality plans. However, such models often abstract away information needed to generate directly-executable plans for real-world agents in real-world environments, as planning in such detail, especially in the presence of real-world uncertainty, would be computationally intractable. In this paper, we describe the deployment of a planning system that used a hierarchy of planners to execute collaborative multiagent navigation tasks in real-world, unknown environments. By developing a planning system that was robust to failures at every level of the planning hierarchy, we enabled the team to complete collaborative navigation tasks, even in the presence of imperfect planning abstractions and real-world uncertainty. We deployed our approach on a Clearpath Husky-Jackal team navigating in a structured outdoor environment, and demonstrated that the system enabled the agents to successfully execute collaborative plans.

AI-3-标题: How Could AI Support Design Education? A Study Across Fields Fuels Situating Analytics

链接: https://arxiv.org/abs/2404.17390
作者: Ajit Jain, Andruid Kerne, Hannah Fowler, Jinsil Seo, Galen Newman, Nic Lupfer, Aaron Perrine
备注: 31 pages, 3 figures, Submitted to ACM

点击查看摘要

Abstract:We use the process and findings from a case study of design educators’ practices of assessment and feedback to fuel theorizing about how to make AI useful in service of human experience. We build on Suchman’s theory of situated actions. We perform a qualitative study of 11 educators in 5 fields, who teach design processes situated in project-based learning contexts. Through qualitative data gathering and analysis, we derive codes: design process; assessment and feedback challenges; and computational support. We twice invoke creative cognition’s family resemblance principle. First, to explain how design instructors already use assessment rubrics and second, to explain the analogous role for design creativity analytics: no particular trait is necessary or sufficient; each only tends to indicate good design work. Human teachers remain essential. We develop a set of situated design creativity analytics–Fluency, Flexibility, Visual Consistency, Multiscale Organization, and Legible Contrast–to support instructors’ efforts, by providing on-demand, learning objectives-based assessment and feedback to students. We theorize a methodology, which we call situating analytics, firstly because making AI support living human activity depends on aligning what analytics measure with situated practices. Further, we realize that analytics can become most significant to users by situating them through interfaces that integrate them into the material contexts of their use. Here, this means situating design creativity analytics into actual design environments. Through the case study, we identify situating analytics as a methodology for explaining analytics to users, because the iterative process of alignment with practice has the potential to enable data scientists to derive analytics that make sense as part of and support situated human experiences.

AI-4-标题: Certified MaxSAT Preprocessing

链接: https://arxiv.org/abs/2404.17316
作者: Hannes Ihalainen, Andy Oertel, Yong Kiam Tan, Jeremias Berg, Matti Järvisalo, Jakob Nordström
备注:

点击查看摘要

Abstract:Building on the progress in Boolean satisfiability (SAT) solving over the last decades, maximum satisfiability (MaxSAT) has become a viable approach for solving NP-hard optimization problems, but ensuring correctness of MaxSAT solvers has remained an important concern. For SAT, this is largely a solved problem thanks to the use of proof logging, meaning that solvers emit machine-verifiable proofs of (un)satisfiability to certify correctness. However, for MaxSAT, proof logging solvers have started being developed only very recently. Moreover, these nascent efforts have only targeted the core solving process, ignoring the preprocessing phase where input problem instances can be substantially reformulated before being passed on to the solver proper. In this work, we demonstrate how pseudo-Boolean proof logging can be used to certify the correctness of a wide range of modern MaxSAT preprocessing techniques. By combining and extending the VeriPB and CakePB tools, we provide formally verified, end-to-end proof checking that the input and preprocessed output MaxSAT problem instances have the same optimal value. An extensive evaluation on applied MaxSAT benchmarks shows that our approach is feasible in practice.

AI-5-标题: Enhancing Privacy and Security of Autonomous UAV Navigation

链接: https://arxiv.org/abs/2404.17225
作者: Vatsal Aggarwal, Arjun Ramesh Kaushik, Charanjit Jutla, Nalini Ratha
备注:

点击查看摘要

Abstract:Autonomous Unmanned Aerial Vehicles (UAVs) have become essential tools in defense, law enforcement, disaster response, and product delivery. These autonomous navigation systems require a wireless communication network, and of late are deep learning based. In critical scenarios such as border protection or disaster response, ensuring the secure navigation of autonomous UAVs is paramount. But, these autonomous UAVs are susceptible to adversarial attacks through the communication network or the deep learning models - eavesdropping / man-in-the-middle / membership inference / reconstruction. To address this susceptibility, we propose an innovative approach that combines Reinforcement Learning (RL) and Fully Homomorphic Encryption (FHE) for secure autonomous UAV navigation. This end-to-end secure framework is designed for real-time video feeds captured by UAV cameras and utilizes FHE to perform inference on encrypted input images. While FHE allows computations on encrypted data, certain computational operators are yet to be implemented. Convolutional neural networks, fully connected neural networks, activation functions and OpenAI Gym Library are meticulously adapted to the FHE domain to enable encrypted data processing. We demonstrate the efficacy of our proposed approach through extensive experimentation. Our proposed approach ensures security and privacy in autonomous UAV navigation with negligible loss in performance.

AI-6-标题: Human-Imperceptible Retrieval Poisoning Attacks in LLM -Powered Applications

链接: https://arxiv.org/abs/2404.17196
作者: Quan Zhang, Binqi Zeng, Chijin Zhou, Gwihwan Go, Heyuan Shi, Yu Jiang
备注:

点击查看摘要

Abstract:Presently, with the assistance of advanced LLM application development frameworks, more and more LLM-powered applications can effortlessly augment the LLMs’ knowledge with external content using the retrieval augmented generation (RAG) technique. However, these frameworks’ designs do not have sufficient consideration of the risk of external content, thereby allowing attackers to undermine the applications developed with these frameworks. In this paper, we reveal a new threat to LLM-powered applications, termed retrieval poisoning, where attackers can guide the application to yield malicious responses during the RAG process. Specifically, through the analysis of LLM application frameworks, attackers can craft documents visually indistinguishable from benign ones. Despite the documents providing correct information, once they are used as reference sources for RAG, the application is misled into generating incorrect responses. Our preliminary experiments indicate that attackers can mislead LLMs with an 88.33% success rate, and achieve a 66.67% success rate in the real-world application, demonstrating the potential impact of retrieval poisoning.

AI-7-标题: Process Mining Embeddings: Learning Vector Representations for Petri Nets

链接: https://arxiv.org/abs/2404.17129
作者: Juan G. Colonna, Ahmed A. Fares, Márcio Duarte, Ricardo Sousa
备注:

点击查看摘要

Abstract:Process mining offers powerful techniques for discovering, analyzing, and enhancing real-world business processes. In this context, Petri nets provide an expressive means of modeling process behavior. However, directly analyzing and comparing intricate Petri net presents challenges. This study introduces PetriNet2Vec, a novel unsupervised methodology based on Natural Language Processing concepts inspired by Doc2Vec and designed to facilitate the effective comparison, clustering, and classification of process models represented as embedding vectors. These embedding vectors allow us to quantify similarities and relationships between different process models. Our methodology was experimentally validated using the PDC Dataset, featuring 96 diverse Petri net models. We performed cluster analysis, created UMAP visualizations, and trained a decision tree to provide compelling evidence for the capability of PetriNet2Vec to discern meaningful patterns and relationships among process models and their constituent tasks. Through a series of experiments, we demonstrated that PetriNet2Vec was capable of learning the structure of Petri nets, as well as the main properties used to simulate the process models of our dataset. Furthermore, our results showcase the utility of the learned embeddings in two crucial downstream tasks within process mining enhancement: process classification and process retrieval.

AI-8-标题: CLARE: Cognitive Load Assessment in REaltime with Multimodal Data

链接: https://arxiv.org/abs/2404.17098
作者: Anubhav Bhatti, Prithila Angkan, Behnam Behinaein, Zunayed Mahmud, Dirk Rodenburg, Heather Braund, P.James Mclellan, Aaron Ruberto, Geoffery Harrison, Daryl Wilson, Adam Szulewski, Dan Howes, Ali Etemad, Paul Hungler
备注: 12 pages, 10 figures, 6 tables

点击查看摘要

Abstract:We present a novel multimodal dataset for Cognitive Load Assessment in REaltime (CLARE). The dataset contains physiological and gaze data from 24 participants with self-reported cognitive load scores as ground-truth labels. The dataset consists of four modalities, namely, Electrocardiography (ECG), Electrodermal Activity (EDA), Electroencephalogram (EEG), and Gaze tracking. To map diverse levels of mental load on participants during experiments, each participant completed four nine-minutes sessions on a computer-based operator performance and mental workload task (the MATB-II software) with varying levels of complexity in one minute segments. During the experiment, participants reported their cognitive load every 10 seconds. For the dataset, we also provide benchmark binary classification results with machine learning and deep learning models on two different evaluation schemes, namely, 10-fold and leave-one-subject-out (LOSO) cross-validation. Benchmark results show that for 10-fold evaluation, the convolutional neural network (CNN) based deep learning model achieves the best classification performance with ECG, EDA, and Gaze. In contrast, for LOSO, the best performance is achieved by the deep learning model with ECG, EDA, and EEG.

AI-9-标题: CyNetDiff – A Python Library for Accelerated Implementation of Network Diffusion Models

链接: https://arxiv.org/abs/2404.17059
作者: Eliot W. Robson, Dhemath Reddy, Abhishek K. Umrawal
备注: 4 pages, 3 figures, and 2 tables

点击查看摘要

Abstract:In recent years, there has been increasing interest in network diffusion models and related problems. The most popular of these are the independent cascade and linear threshold models. Much of the recent experimental work done on these models requires a large number of simulations conducted on large graphs, a computationally expensive task suited for low-level languages. However, many researchers prefer the use of higher-level languages (such as Python) for their flexibility and shorter development times. Moreover, in many research tasks, these simulations are the most computationally intensive task, so it would be desirable to have a library for these with an interface to a high-level language with the performance of a low-level language. To fill this niche, we introduce CyNetDiff, a Python library with components written in Cython to provide improved performance for these computationally intensive diffusion tasks.

AI-10-标题: Agent ive Permissions in Multi agent Systems IJCAI-24

链接: https://arxiv.org/abs/2404.17053
作者: Qi Shi
备注: The 33rd International Joint Conference on Artificial Intelligence (IJCAI-24)

点击查看摘要

Abstract:This paper proposes to distinguish four forms of agentive permissions in multiagent settings. The main technical results are the complexity analysis of model checking, the semantic undefinability of modalities that capture these forms of permissions through each other, and a complete logical system capturing the interplay between these modalities.

AI-11-标题: Unraveling Code Clone Dynamics in Deep Learning Frameworks

链接: https://arxiv.org/abs/2404.17046
作者: Maram Assi, Safwat Hassan, Ying Zou
备注: 37 pages

点击查看摘要

Abstract:Deep Learning (DL) frameworks play a critical role in advancing artificial intelligence, and their rapid growth underscores the need for a comprehensive understanding of software quality and maintainability. DL frameworks, like other systems, are prone to code clones. Code clones refer to identical or highly similar source code fragments within the same project or even across different projects. Code cloning can have positive and negative implications for software development, influencing maintenance, readability, and bug propagation. In this paper, we aim to address the knowledge gap concerning the evolutionary dimension of code clones in DL frameworks and the extent of code reuse across these frameworks. We empirically analyze code clones in nine popular DL frameworks, i.e., TensorFlow, Paddle, PyTorch, Aesara, Ray, MXNet, Keras, Jax and BentoML, to investigate (1) the characteristics of the long-term code cloning evolution over releases in each framework, (2) the short-term, i.e., within-release, code cloning patterns and their influence on the long-term trends, and (3) the file-level code clones within the DL frameworks. Our findings reveal that DL frameworks adopt four distinct cloning trends and that these trends present some common and distinct characteristics. For instance, bug-fixing activities persistently happen in clones irrespective of the clone evolutionary trend but occur more in the “Serpentine” trend. Moreover, the within release level investigation demonstrates that short-term code cloning practices impact long-term cloning trends. The cross-framework code clone investigation reveals the presence of functional and architectural adaptation file-level cross-framework code clones across the nine studied frameworks. We provide insights that foster robust clone practices and collaborative maintenance in the development of DL frameworks.

AI-12-标题: Generative AI in Color-Changing Systems: Re-Programmable 3D Object Textures with Material and Design Constraints

链接: https://arxiv.org/abs/2404.17028
作者: Yunyi Zhu, Faraz Faruqi, Stefanie Mueller
备注:

点击查看摘要

Abstract:Advances in Generative AI tools have allowed designers to manipulate existing 3D models using text or image-based prompts, enabling creators to explore different design goals. Photochromic color-changing systems, on the other hand, allow for the reprogramming of surface texture of 3D models, enabling easy customization of physical objects and opening up the possibility of using object surfaces for data display. However, existing photochromic systems require the user to manually design the desired texture, inspect the simulation of the pattern on the object, and verify the efficacy of the generated pattern. These manual design, inspection, and verification steps prevent the user from efficiently exploring the design space of possible patterns. Thus, by designing an automated workflow desired for an end-to-end texture application process, we can allow rapid iteration on different practicable patterns. In this workshop paper, we discuss the possibilities of extending generative AI systems, with material and design constraints for reprogrammable surfaces with photochromic materials. By constraining generative AI systems to colors and materials possible to be physically realized with photochromic dyes, we can create tools that would allow users to explore different viable patterns, with text and image-based prompts. We identify two focus areas in this topic: photochromic material constraints and design constraints for data-encoded textures. We highlight the current limitations of using generative AI tools to create viable textures using photochromic material. Finally, we present possible approaches to augment generative AI methods to take into account the photochromic material constraints, allowing for the creation of viable photochromic textures rapidly and easily.

AI-13-标题: Generating Minimalist Adversarial Perturbations to Test Object-Detection Models: An Adaptive Multi-Metric Evolutionary Search Approach

链接: https://arxiv.org/abs/2404.17020
作者: Cristopher McIntyre-Garcia, Adrien Heymans, Beril Borali, Won-Sook Lee, Shiva Nejati
备注:

点击查看摘要

Abstract:Deep Learning (DL) models excel in computer vision tasks but can be susceptible to adversarial examples. This paper introduces Triple-Metric EvoAttack (TM-EVO), an efficient algorithm for evaluating the robustness of object-detection DL models against adversarial attacks. TM-EVO utilizes a multi-metric fitness function to guide an evolutionary search efficiently in creating effective adversarial test inputs with minimal perturbations. We evaluate TM-EVO on widely-used object-detection DL models, DETR and Faster R-CNN, and open-source datasets, COCO and KITTI. Our findings reveal that TM-EVO outperforms the state-of-the-art EvoAttack baseline, leading to adversarial tests with less noise while maintaining efficiency.

AI-14-标题: Leveraging AI to Generate Audio for User-generated Content in Video Games

链接: https://arxiv.org/abs/2404.17018
作者: Thomas Marrinan, Pakeeza Akram, Oli Gurmessa, Anthony Shishkin
备注:

点击查看摘要

Abstract:In video game design, audio (both environmental background music and object sound effects) play a critical role. Sounds are typically pre-created assets designed for specific locations or objects in a game. However, user-generated content is becoming increasingly popular in modern games (e.g. building custom environments or crafting unique objects). Since the possibilities are virtually limitless, it is impossible for game creators to pre-create audio for user-generated content. We explore the use of generative artificial intelligence to create music and sound effects on-the-fly based on user-generated content. We investigate two avenues for audio generation: 1) text-to-audio: using a text description of user-generated content as input to the audio generator, and 2) image-to-audio: using a rendering of the created environment or object as input to an image-to-text generator, then piping the resulting text description into the audio generator. In this paper we discuss ethical implications of using generative artificial intelligence for user-generated content and highlight two prototype games where audio is generated for user-created environments and objects.

AI-15-标题: Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability

链接: https://arxiv.org/abs/2404.16957
作者: Yunfei Ge, Quanyan Zhu
备注:

点击查看摘要

Abstract:The pervasive integration of Artificial Intelligence (AI) has introduced complex challenges in the responsibility and accountability in the event of incidents involving AI-enabled systems. The interconnectivity of these systems, ethical concerns of AI-induced incidents, coupled with uncertainties in AI technology and the absence of corresponding regulations, have made traditional responsibility attribution challenging. To this end, this work proposes a Computational Reflective Equilibrium (CRE) approach to establish a coherent and ethically acceptable responsibility attribution framework for all stakeholders. The computational approach provides a structured analysis that overcomes the limitations of conceptual approaches in dealing with dynamic and multifaceted scenarios, showcasing the framework’s explainability, coherence, and adaptivity properties in the responsibility attribution process. We examine the pivotal role of the initial activation level associated with claims in equilibrium computation. Using an AI-assisted medical decision-support system as a case study, we illustrate how different initializations lead to diverse responsibility distributions. The framework offers valuable insights into accountability in AI-induced incidents, facilitating the development of a sustainable and resilient system through continuous monitoring, revision, and reflection.

AI-16-标题: Evolve Cost-aware Acquisition Functions Using Large Language Models

链接: https://arxiv.org/abs/2404.16906
作者: Yiming Yao, Fei Liu, Ji Cheng, Qingfu Zhang
备注:

点击查看摘要

Abstract:Many real-world optimization scenarios involve expensive evaluation with unknown and heterogeneous costs. Cost-aware Bayesian optimization stands out as a prominent solution in addressing these challenges. To approach the global optimum within a limited budget in a cost-efficient manner, the design of cost-aware acquisition functions (AFs) becomes a crucial step. However, traditional manual design paradigm typically requires extensive domain knowledge and involves a labor-intensive trial-and-error process. This paper introduces EvolCAF, a novel framework that integrates large language models (LLMs) with evolutionary computation (EC) to automatically design cost-aware AFs. Leveraging the crossover and mutation in the algorithm space, EvolCAF offers a novel design paradigm, significantly reduces the reliance on domain expertise and model training. The designed cost-aware AF maximizes the utilization of available information from historical data, surrogate models and budget details. It introduces novel ideas not previously explored in the existing literature on acquisition function design, allowing for clear interpretations to provide insights into its behavior and decision-making process. In comparison to the well-known EIpu and EI-cool methods designed by human experts, our approach showcases remarkable efficiency and generalization across various tasks, including 12 synthetic problems and 3 real-world hyperparameter tuning test sets.

AI-17-标题: Fiper: a Visual-based Explanation Combining Rules and Feature Importance KDD ECML

链接: https://arxiv.org/abs/2404.16903
作者: Eleonora Cappuccio, Daniele Fadda, Rosa Lanzilotti, Salvatore Rinzivillo
备注: 15 pages, 4 figures, to be published in ECML PKDD International Workshop on eXplainable Knowledge Discovery in Data Mining

点击查看摘要

Abstract:Artificial Intelligence algorithms have now become pervasive in multiple high-stakes domains. However, their internal logic can be obscure to humans. Explainable Artificial Intelligence aims to design tools and techniques to illustrate the predictions of the so-called black-box algorithms. The Human-Computer Interaction community has long stressed the need for a more user-centered approach to Explainable AI. This approach can benefit from research in user interface, user experience, and visual analytics. This paper proposes a visual-based method to illustrate rules paired with feature importance. A user study with 15 participants was conducted comparing our visual method with the original output of the algorithm and textual representation to test its effectiveness with users.

AI-18-标题: State-of-the-Art Approaches to Enhancing Privacy Preservation of Machine Learning Dataset s: A Survey

链接: https://arxiv.org/abs/2404.16847
作者: Chaoyu Zhang
备注:

点击查看摘要

Abstract:This paper examines the evolving landscape of machine learning (ML) and its profound impact across various sectors, with a special focus on the emerging field of Privacy-preserving Machine Learning (PPML). As ML applications become increasingly integral to industries like telecommunications, financial technology, and surveillance, they raise significant privacy concerns, necessitating the development of PPML strategies. The paper highlights the unique challenges in safeguarding privacy within ML frameworks, which stem from the diverse capabilities of potential adversaries, including their ability to infer sensitive information from model outputs or training data. We delve into the spectrum of threat models that characterize adversarial intentions, ranging from membership and attribute inference to data reconstruction. The paper emphasizes the importance of maintaining the confidentiality and integrity of training data, outlining current research efforts that focus on refining training data to minimize privacy-sensitive information and enhancing data processing techniques to uphold privacy. Through a comprehensive analysis of privacy leakage risks and countermeasures in both centralized and collaborative learning settings, this paper aims to provide a thorough understanding of effective strategies for protecting ML training data against privacy intrusions. It explores the balance between data privacy and model utility, shedding light on privacy-preserving techniques that leverage cryptographic methods, Differential Privacy, and Trusted Execution Environments. The discussion extends to the application of these techniques in sensitive domains, underscoring the critical role of PPML in ensuring the privacy and security of ML systems.

AI-19-标题: Biometrics Employing Neural Network

链接: https://arxiv.org/abs/2404.16840
作者: Sajjad Bhuiyan
备注: 14 Pages, 10 figures, Survey Paper

点击查看摘要

Abstract:Biometrics involves using unique human traits, both physical and behavioral, for the digital identification of individuals to provide access to systems, devices, or information. Within the field of computer science, it acts as a method for identifying and verifying individuals and controlling access. While the conventional method for personal authentication involves passwords, the vulnerability arises when passwords are compromised, allowing unauthorized access to sensitive actions. Biometric authentication presents a viable answer to this problem and is the most secure and user-friendly authentication method. Today, fingerprints, iris and retina patterns, facial recognition, hand shapes, palm prints, and voice recognition are frequently used forms of biometrics. Despite the diverse nature of these biometric identifiers, the core objective remains consistent ensuring security, recognizing authorized users, and rejecting impostors. Hence, it is crucial to determine accurately whether the characteristics belong to the rightful person. For systems to be effective and widely accepted, the error rate in recognition and verification must approach zero. It is acknowledged that current biometric techniques, while advanced, are not infallible and require continuous improvement. A more refined classifier is deemed necessary to classify patterns accurately. Artificial Neural Networks, which simulate the human brain’s operations, present themselves as a promising approach. The survey presented herein explores various biometric techniques based on neural networks, emphasizing the ongoing quest for enhanced accuracy and reliability. It concludes that The utilization of neural networks along with biometric features not only enhances accuracy but also contributes to overall better security.

AI-20-标题: Quantum Adjoint Convolutional Layers for Effective Data Representation

链接: https://arxiv.org/abs/2404.17378
作者: Ren-Xin Zhao, Shi Wang, Yaonan Wang
备注:

点击查看摘要

Abstract:Quantum Convolutional Layer (QCL) is considered as one of the core of Quantum Convolutional Neural Networks (QCNNs) due to its efficient data feature extraction capability. However, the current principle of QCL is not as mathematically understandable as Classical Convolutional Layer (CCL) due to its black-box structure. Moreover, classical data mapping in many QCLs is inefficient. To this end, firstly, the Quantum Adjoint Convolution Operation (QACO) consisting of a quantum amplitude encoding and its inverse is theoretically shown to be equivalent to the quantum normalization of the convolution operation based on the Frobenius inner product while achieving an efficient characterization of the data. Subsequently, QACO is extended into a Quantum Adjoint Convolutional Layer (QACL) by Quantum Phase Estimation (QPE) to compute all Frobenius inner products in parallel. At last, comparative simulation experiments are carried out on PennyLane and TensorFlow platforms, mainly for the two cases of kernel fixed and unfixed in QACL. The results demonstrate that QACL with the insight of special quantum properties for the same images, provides higher training accuracy in MNIST and Fashion MNIST classification experiments, but sacrifices the learning performance to some extent. Predictably, our research lays the foundation for the development of efficient and interpretable quantum convolutional networks and also advances the field of quantum machine vision.

AI-21-标题: Assessing the Potential of AI for Spatially Sensitive Nature-Related Financial Risks KR

链接: https://arxiv.org/abs/2404.17369
作者: Steven Reece, Emma O donnell, Felicia Liu, Joanna Wolstenholme, Frida Arriaga, Giacomo Ascenzi, Richard Pywell
备注: 67 pages, 10 figures, UKRI (NERC) Integrated Finance and Biodiversity for a Nature Positive Future Programme

点击查看摘要

Abstract:There is growing recognition among financial institutions, financial regulators and policy makers of the importance of addressing nature-related risks and opportunities. Evaluating and assessing nature-related risks for financial institutions is challenging due to the large volume of heterogeneous data available on nature and the complexity of investment value chains and the various components’ relationship to nature. The dual problem of scaling data analytics and analysing complex systems can be addressed using Artificial Intelligence (AI). We address issues such as plugging existing data gaps with discovered data, data estimation under uncertainty, time series analysis and (near) real-time updates. This report presents potential AI solutions for models of two distinct use cases, the Brazil Beef Supply Use Case and the Water Utility Use Case. Our two use cases cover a broad perspective within sustainable finance. The Brazilian cattle farming use case is an example of greening finance - integrating nature-related considerations into mainstream financial decision-making to transition investments away from sectors with poor historical track records and unsustainable operations. The deployment of nature-based solutions in the UK water utility use case is an example of financing green - driving investment to nature-positive outcomes. The two use cases also cover different sectors, geographies, financial assets and AI modelling techniques, providing an overview on how AI could be applied to different challenges relating to nature’s integration into finance. This report is primarily aimed at financial institutions but is also of interest to ESG data providers, TNFD, systems modellers, and, of course, AI practitioners.

附件下载

点击下载今日全部论文列表