Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

Mandarin text-to-speech (TTS) systems heavily depend on front-end processing, such as grapheme-to-phoneme conversion and prosodic boundary prediction, to produce expressive, human-like speech. Utilizing a pre-trained language model, such as the bidirectional encoder representations from Transformers...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE signal processing letters 2023-01, Vol.30, p.1-5
Hauptverfasser: Zhao, Wei, Wang, Zuyi, Xu, Li
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Mandarin text-to-speech (TTS) systems heavily depend on front-end processing, such as grapheme-to-phoneme conversion and prosodic boundary prediction, to produce expressive, human-like speech. Utilizing a pre-trained language model, such as the bidirectional encoder representations from Transformers (BERT), could significantly improve the TTS front-end's performance. However, the original BERT is too big for edge TTS applications with tight limits on memory costs and inference latency. Although a distilled BERT can alleviate this problem, a considerable efficiency barrier may still exist due to the self-attention module's quadratic complexity and the feed-forward module's enormous computation. To this end, we propose a lightweight distilled convolution network as an alternative to the distilled BERT. Unlike previous knowledge distillation methods, which commonly used the same self-attention network for the teacher and student models, we transfer knowledge from the self-attention network to a convolution network. Experiments on two major Mandarin TTS front-end tasks have shown that our distilled convolution model can achieve comparable results to various distilled BERT variants while drastically reducing the model size and inference latency.
ISSN:1070-9908
1558-2361
DOI:10.1109/LSP.2023.3256319