Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection
Joint sound event localization and detection (SELD) is an emerging audio signal processing task adding spatial dimensions to acoustic scene analysis and sound event detection. A popular approach to modeling SELD jointly is using convolutional recurrent neural network (CRNN) models, where CNNs learn...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Joint sound event localization and detection (SELD) is an emerging audio
signal processing task adding spatial dimensions to acoustic scene analysis and
sound event detection. A popular approach to modeling SELD jointly is using
convolutional recurrent neural network (CRNN) models, where CNNs learn
high-level features from multi-channel audio input and the RNNs learn temporal
relationships from these high-level features. However, RNNs have some
drawbacks, such as a limited capability to model long temporal dependencies and
slow training and inference times due to their sequential processing nature.
Recently, a few SELD studies used multi-head self-attention (MHSA), among other
innovations in their models. MHSA and the related transformer networks have
shown state-of-the-art performance in various domains. While they can model
long temporal dependencies, they can also be parallelized efficiently. In this
paper, we study in detail the effect of MHSA on the SELD task. Specifically, we
examined the effects of replacing the RNN blocks with self-attention layers. We
studied the influence of stacking multiple self-attention blocks, using
multiple attention heads in each self-attention block, and the effect of
position embeddings and layer normalization. Evaluation on the DCASE 2021 SELD
(task 3) development data set shows a significant improvement in all employed
metrics compared to the baseline CRNN accompanying the task. |
---|---|
DOI: | 10.48550/arxiv.2107.09388 |