Weakly Supervised Construction of ASR Systems with Massive Video Data
Building Automatic Speech Recognition (ASR) systems from scratch is significantly challenging, mostly due to the time-consuming and financially-expensive process of annotating a large amount of audio data with transcripts. Although several unsupervised pre-training models have been proposed, applyin...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Building Automatic Speech Recognition (ASR) systems from scratch is
significantly challenging, mostly due to the time-consuming and
financially-expensive process of annotating a large amount of audio data with
transcripts. Although several unsupervised pre-training models have been
proposed, applying such models directly might still be sub-optimal if more
labeled, training data could be obtained without a large cost. In this paper,
we present a weakly supervised framework for constructing ASR systems with
massive video data. As videos often contain human-speech audios aligned with
subtitles, we consider videos as an important knowledge source, and propose an
effective approach to extract high-quality audios aligned with transcripts from
videos based on Optical Character Recognition (OCR). The underlying ASR model
can be fine-tuned to fit any domain-specific target training datasets after
weakly supervised pre-training. Extensive experiments show that our framework
can easily produce state-of-the-art results on six public datasets for Mandarin
speech recognition. |
---|---|
DOI: | 10.48550/arxiv.2008.01300 |