SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams
We present SpeakingFaces as a publicly-available large-scale multimodal dataset developed to support machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; examples include human-computer interaction, biometric authentication, recognition systems...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present SpeakingFaces as a publicly-available large-scale multimodal
dataset developed to support machine learning research in contexts that utilize
a combination of thermal, visual, and audio data streams; examples include
human-computer interaction, biometric authentication, recognition systems,
domain transfer, and speech recognition. SpeakingFaces is comprised of aligned
high-resolution thermal and visual spectra image streams of fully-framed faces
synchronized with audio recordings of each subject speaking approximately 100
imperative phrases. Data were collected from 142 subjects, yielding over 13,000
instances of synchronized data (~3.8 TB). For technical validation, we
demonstrate two baseline examples. The first baseline shows classification by
gender, utilizing different combinations of the three data streams in both
clean and noisy environments. The second example consists of thermal-to-visual
facial image translation, as an instance of domain transfer. |
---|---|
DOI: | 10.48550/arxiv.2012.02961 |