Face mask recognition from audio: The MASC database and an overview on the mask challenge
•Introduction of the Mask Augsburg Speech Corpus (MASC) database.•Summarising the Mask Sub-Challenge (MSC) and its baseline approaches.•Explanation and comparison of the approaches of top participants in the challenge.•Summarising the results of the Mask Sub-Challenge from ComParE 2020.•Introduction...
Gespeichert in:
Veröffentlicht in: | Pattern recognition 2022-02, Vol.122, p.108361-108361, Article 108361 |
---|---|
Hauptverfasser: | , , , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •Introduction of the Mask Augsburg Speech Corpus (MASC) database.•Summarising the Mask Sub-Challenge (MSC) and its baseline approaches.•Explanation and comparison of the approaches of top participants in the challenge.•Summarising the results of the Mask Sub-Challenge from ComParE 2020.•Introduction of novel fusion results, by way of fusing approaches from the best participants.•Conducting a discussion of the approaches and the results, regarding several aspects.•Introducing a proof of concept demonstration Android app.•Benchmarking the serving run-time of the top models.
The sudden outbreak of COVID-19 has resulted in tough challenges for the field of biometrics due to its spread via physical contact, and the regulations of wearing face masks. Given these constraints, voice biometrics can offer a suitable contact-less biometric solution; they can benefit from models that classify whether a speaker is wearing a mask or not. This article reviews the Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 COMputational PARalinguistics challengE (ComParE), which focused on the following classification task: Given an audio chunk of a speaker, classify whether the speaker is wearing a mask or not. First, we report the collection of the Mask Augsburg Speech Corpus (MASC) and the baseline approaches used to solve the problem, achieving a performance of 71.8% Unweighted Average Recall (UAR). We then summarise the methodologies explored in the submitted and accepted papers that mainly used two common patterns: (i) phonetic-based audio features, or (ii) spectrogram representations of audio combined with Convolutional Neural Networks (CNNs) typically used in image processing. Most approaches enhance their models by adapting ensembles of different models and attempting to increase the size of the training data using various techniques. We review and discuss the results of the participants of this sub-challenge, where the winner scored a UAR of 80.1%. Moreover, we present the results of fusing the approaches, leading to a UAR of 82.6%. Finally, we present a smartphone app that can be used as a proof of concept demonstration to detect in real-time whether users are wearing a face mask; we also benchmark the run-time of the best models. |
---|---|
ISSN: | 0031-3203 1873-5142 0031-3203 |
DOI: | 10.1016/j.patcog.2021.108361 |