Efficient Parallel Audio Generation Using Group Masked Language Modeling

We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE signal processing letters 2024, Vol.31, p.979-983
Hauptverfasser:	Jeong, Myeonghun, Kim, Minchan, Lee, Joun Yeop, Kim, Nam Soo
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Autoregressive models Codec Computational modeling Inference Iterative decoding Iterative methods Modelling neural audio codec Parallel audio generation Sampling Semantics Tokenization Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this problem, we propose Group-Masked Language Modeling (G-MLM) and Group Iterative Parallel Decoding (G-IPD) for efficient parallel audio generation. Both the training and sampling schemes enable the model to synthesize high-quality audio with a small number of iterations by effectively modeling the group-wise conditional dependencies. In addition, our model employs a cross-attention-based architecture to capture the speaker style of the prompt voice and improves computational efficiency. Experimental results demonstrate that our proposed model outperforms the baselines in prompt-based audio generation.
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2024.3381910