Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

Mandarin text-to-speech (TTS) systems heavily depend on front-end processing, such as grapheme-to-phoneme conversion and prosodic boundary prediction, to produce expressive, human-like speech. Utilizing a pre-trained language model, such as the bidirectional encoder representations from Transformers...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE signal processing letters 2023-01, Vol.30, p.1-5
Hauptverfasser:	Zhao, Wei, Wang, Zuyi, Xu, Li
Format:	Artikel
Sprache:	eng
Schlagworte:	BERT Bit error rate Coders Convolution convolution network Deep learning Distillation Electrical engineering Grapheme phoneme correspondence Inference Kernel knowledge distillation Knowledge engineering Knowledge management Language modeling Lightweight Linguistics Mandarin Mandarin TTS front-end Modules Phonemes Prosody Speech recognition Task analysis Text-to-speech Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Mandarin text-to-speech (TTS) systems heavily depend on front-end processing, such as grapheme-to-phoneme conversion and prosodic boundary prediction, to produce expressive, human-like speech. Utilizing a pre-trained language model, such as the bidirectional encoder representations from Transformers (BERT), could significantly improve the TTS front-end's performance. However, the original BERT is too big for edge TTS applications with tight limits on memory costs and inference latency. Although a distilled BERT can alleviate this problem, a considerable efficiency barrier may still exist due to the self-attention module's quadratic complexity and the feed-forward module's enormous computation. To this end, we propose a lightweight distilled convolution network as an alternative to the distilled BERT. Unlike previous knowledge distillation methods, which commonly used the same self-attention network for the teacher and student models, we transfer knowledge from the self-attention network to a convolution network. Experiments on two major Mandarin TTS front-end tasks have shown that our distilled convolution model can achieve comparable results to various distilled BERT variants while drastically reducing the model size and inference latency.
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2023.3256319