Updated Corpora and Benchmarks for Long-Form Speech Recognition
The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances....
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The vast majority of ASR research uses corpora in which both the training and
test data have been pre-segmented into utterances. In most real-word ASR
use-cases, however, test audio is not segmented, leading to a mismatch between
inference-time conditions and models trained on segmented utterances. In this
paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and
VoxPopuli-en - with updated transcription and alignments to enable their use
for long-form ASR research. We use these reconstituted corpora to study the
train-test mismatch problem for transducers and attention-based
encoder-decoders (AEDs), confirming that AEDs are more susceptible to this
issue. Finally, we benchmark a simple long-form training for these models,
showing its efficacy for model robustness under this domain shift. |
---|---|
DOI: | 10.48550/arxiv.2309.15013 |