Large scale weakly and semi-supervised learning for low-resource video ASR
Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Many semi- and weakly-supervised approaches have been investigated for
overcoming the labeling cost of building high quality speech recognition
systems. On the challenging task of transcribing social media videos in
low-resource conditions, we conduct a large scale systematic comparison between
two self-labeling methods on one hand, and weakly-supervised pretraining using
contextual metadata on the other. We investigate distillation methods at the
frame level and the sequence level for hybrid, encoder-only CTC-based, and
encoder-decoder speech recognition systems on Dutch and Romanian languages
using 27,000 and 58,000 hours of unlabeled audio respectively. Although all
approaches improved upon their respective baseline WERs by more than 8%,
sequence-level distillation for encoder-decoder models provided the largest
relative WER reduction of 20% compared to the strongest data-augmented
supervised baseline. |
---|---|
DOI: | 10.48550/arxiv.2005.07850 |