Voice gender recognition under unconstrained environments using self-attention

•This paper presents two Self-Attention-based models to deliver an end-to-end voice gender recognition system under unconstrained environments•The first model consists of a stack of six self-attention layers and a dense layer.•The second model adds a set of convolution layers and six inception-resid...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied acoustics 2021-04, Vol.175, p.107823, Article 107823
Hauptverfasser:	Nasef, Mohammed M., Sauber, Amr M., Nabil, Mohammed M.
Format:	Artikel
Sprache:	eng
Schlagworte:	Inception Logistic regression MFCC Self-attention Voice gender recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•This paper presents two Self-Attention-based models to deliver an end-to-end voice gender recognition system under unconstrained environments•The first model consists of a stack of six self-attention layers and a dense layer.•The second model adds a set of convolution layers and six inception-residual blocks to the first model before the self-attention layers.•These models achieved superior performance in all criteria and are believed to be state-of-the-art for Voice Gender Recognition under unconstrained environments. Voice Gender Recognition is a non-trivial task that is extensively studied in the literature, however, when the voice gets surrounded by noises and unconstrained environments, the task becomes more challenging. This paper presents two Self-Attention-based models to deliver an end-to-end voice gender recognition system under unconstrained environments. The first model consists of a stack of six self-attention layers and a dense layer. The second model adds a set of convolution layers and six inception-residual blocks to the first model before the self-attention layers. These models depend on Mel-frequency cepstral coefficients (MFCC) as a representation of the audio data, and Logistic Regression for classification. The experiments were done under unconstrained environments such as background noise and different languages, accents, ages and emotional states of the speakers. The results demonstrate that the proposed models were able to achieve an accuracy of 95.11%, 96.23%, respectively. These models achieved superior performance in all criteria and are believed to be state-of-the-art for Voice Gender Recognition under unconstrained environments.
ISSN:	0003-682X 1872-910X
DOI:	10.1016/j.apacoust.2020.107823