STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events
This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This report presents the Sony-TAu Realistic Spatial Soundscapes 2022
(STARS22) dataset for sound event localization and detection, comprised of
spatial recordings of real scenes collected in various interiors of two
different sites. The dataset is captured with a high resolution spherical
microphone array and delivered in two 4-channel formats, first-order Ambisonics
and tetrahedral microphone array. Sound events in the dataset belonging to 13
target sound classes are annotated both temporally and spatially through a
combination of human annotation and optical tracking. The dataset serves as the
development and evaluation dataset for the Task 3 of the DCASE2022 Challenge on
Sound Event Localization and Detection and introduces significant new
challenges for the task compared to the previous iterations, which were based
on synthetic spatialized sound scene recordings. Dataset specifications are
detailed including recording and annotation process, target classes and their
presence, and details on the development and evaluation splits. Additionally,
the report presents the baseline system that accompanies the dataset in the
challenge with emphasis on the differences with the baseline of the previous
iterations; namely, introduction of the multi-ACCDOA representation to handle
multiple simultaneous occurences of events of the same class, and support for
additional improved input features for the microphone array format. Results of
the baseline indicate that with a suitable training strategy a reasonable
detection and localization performance can be achieved on real sound scene
recordings. The dataset is available in https://zenodo.org/record/6387880. |
---|---|
DOI: | 10.48550/arxiv.2206.01948 |