Investigation of Speaker-adaptation methods in Transformer based ASR
End-to-end models are fast replacing the conventional hybrid models in automatic speech recognition. Transformer, a sequence-to-sequence model, based on self-attention popularly used in machine translation tasks, has given promising results when used for automatic speech recognition. This paper expl...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | End-to-end models are fast replacing the conventional hybrid models in
automatic speech recognition. Transformer, a sequence-to-sequence model, based
on self-attention popularly used in machine translation tasks, has given
promising results when used for automatic speech recognition. This paper
explores different ways of incorporating speaker information at the encoder
input while training a transformer-based model to improve its speech
recognition performance. We present speaker information in the form of speaker
embeddings for each of the speakers. We experiment using two types of speaker
embeddings: x-vectors and novel s-vectors proposed in our previous work. We
report results on two datasets a) NPTEL lecture database and b) Librispeech
500-hour split. NPTEL is an open-source e-learning portal providing lectures
from top Indian universities. We obtain improvements in the word error rate
over the baseline through our approach of integrating speaker embeddings into
the model. |
---|---|
DOI: | 10.48550/arxiv.2008.03247 |