Asynchronous Decentralized Distributed Training of Acoustic Models
Large-scale distributed training of deep acoustic models plays an important role in today's high-performance automatic speech recognition (ASR). In this paper we investigate a variety of asynchronous decentralized distributed training strategies based on data parallel stochastic gradient descen...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large-scale distributed training of deep acoustic models plays an important
role in today's high-performance automatic speech recognition (ASR). In this
paper we investigate a variety of asynchronous decentralized distributed
training strategies based on data parallel stochastic gradient descent (SGD) to
show their superior performance over the commonly-used synchronous distributed
training via allreduce, especially when dealing with large batch sizes.
Specifically, we study three variants of asynchronous decentralized parallel
SGD (ADPSGD), namely, fixed and randomized communication patterns on a ring as
well as a delay-by-one scheme. We introduce a mathematical model of ADPSGD,
give its theoretical convergence rate, and compare the empirical convergence
behavior and straggler resilience properties of the three variants. Experiments
are carried out on an IBM supercomputer for training deep long short-term
memory (LSTM) acoustic models on the 2000-hour Switchboard dataset. Recognition
and speedup performance of the proposed strategies are evaluated under various
training configurations. We show that ADPSGD with fixed and randomized
communication patterns cope well with slow learners. When learners are equally
fast, ADPSGD with the delay-by-one strategy has the fastest convergence with
large batches. In particular, using the delay-by-one strategy, we can train the
acoustic model in less than 2 hours using 128 V100 GPUs with competitive word
error rates. |
---|---|
DOI: | 10.48550/arxiv.2110.11199 |