Leveraging pre-trained representations to improve access to untranscribed speech from endangered languages
Pre-trained speech representations like wav2vec 2.0 are a powerful tool for automatic speech recognition (ASR). Yet many endangered languages lack sufficient data for pre-training such models, or are predominantly oral vernaculars without a standardised writing system, precluding fine-tuning. Query-...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Pre-trained speech representations like wav2vec 2.0 are a powerful tool for
automatic speech recognition (ASR). Yet many endangered languages lack
sufficient data for pre-training such models, or are predominantly oral
vernaculars without a standardised writing system, precluding fine-tuning.
Query-by-example spoken term detection (QbE-STD) offers an alternative for
iteratively indexing untranscribed speech corpora by locating spoken query
terms. Using data from 7 Australian Aboriginal languages and a regional variety
of Dutch, all of which are endangered or vulnerable, we show that QbE-STD can
be improved by leveraging representations developed for ASR (wav2vec 2.0: the
English monolingual model and XLSR53 multilingual model). Surprisingly, the
English model outperformed the multilingual model on 4 Australian language
datasets, raising questions around how to optimally leverage self-supervised
speech representations for QbE-STD. Nevertheless, we find that wav2vec 2.0
representations (either English or XLSR53) offer large improvements (56-86%
relative) over state-of-the-art approaches on our endangered language datasets. |
---|---|
DOI: | 10.48550/arxiv.2103.14583 |