Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network
Mandarin text-to-speech (TTS) systems heavily depend on front-end processing, such as grapheme-to-phoneme conversion and prosodic boundary prediction, to produce expressive, human-like speech. Utilizing a pre-trained language model, such as the bidirectional encoder representations from Transformers...
Gespeichert in:
Veröffentlicht in: | IEEE signal processing letters 2023-01, Vol.30, p.1-5 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Mandarin text-to-speech (TTS) systems heavily depend on front-end processing, such as grapheme-to-phoneme conversion and prosodic boundary prediction, to produce expressive, human-like speech. Utilizing a pre-trained language model, such as the bidirectional encoder representations from Transformers (BERT), could significantly improve the TTS front-end's performance. However, the original BERT is too big for edge TTS applications with tight limits on memory costs and inference latency. Although a distilled BERT can alleviate this problem, a considerable efficiency barrier may still exist due to the self-attention module's quadratic complexity and the feed-forward module's enormous computation. To this end, we propose a lightweight distilled convolution network as an alternative to the distilled BERT. Unlike previous knowledge distillation methods, which commonly used the same self-attention network for the teacher and student models, we transfer knowledge from the self-attention network to a convolution network. Experiments on two major Mandarin TTS front-end tasks have shown that our distilled convolution model can achieve comparable results to various distilled BERT variants while drastically reducing the model size and inference latency. |
---|---|
ISSN: | 1070-9908 1558-2361 |
DOI: | 10.1109/LSP.2023.3256319 |