Analysis and solution to aliasing artifacts in neural waveform generation models

•The aliasing artifacts produced from upsampling-based waveform generator is analyzed.•A method to suppress aliasing while improving high-frequency details is proposed.•Aliasing suppression performance is assessed using an artifacts detection technique. In recent years, with the application of deep...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied acoustics 2023-02, Vol.203, p.109183, Article 109183
Hauptverfasser:	Shang, Zengqiang, Zhang, Haozhe, Zhang, Pengyuan, Wang, Li, Li, Ta
Format:	Artikel
Sprache:	eng
Schlagworte:	Aliasing artifacts Speech synthesis Transposed convolution Upsampling Waveform generation
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•The aliasing artifacts produced from upsampling-based waveform generator is analyzed.•A method to suppress aliasing while improving high-frequency details is proposed.•Aliasing suppression performance is assessed using an artifacts detection technique. In recent years, with the application of deep learning in speech synthesis, waveform generation models based on generative adversarial networks have achieved high quality comparable to natural speech. In most waveform generators, a neural upsampling unit plays an essential role as it is employed to upsample acoustic features to the sample point level. However, aliasing artifacts are observed in the generated speech regardless of whether transposed convolution, subpixel convolution, or nearest neighbor interpolation are used as temporary upsampling layers. Non-ideal upsampling filters produce aliasing, according to the Shannon-Nyquist sampling theorem. This paper aims to systematically analyze how aliasing artifacts are produced in non-ideal upsampling-based waveform generators. We investigate the HiFi-GAN and VITS generation processes and discover that high-frequency spectral details are generated based on low-frequency structures using the nonlinear transformation. However, the nonlinear transformation was unable to completely remove the low-frequency spectral imprint, which eventually manifested as spectral artifacts in generated waveforms. To suppress aliasing artifacts, a low-pass filter is applied after the upsampling layer, but this results in significant performance drops. The experimental results also show that aliasing speeds up the training process by filling high-frequency vacancies. In this regard, we propose to mix high-frequency components into low-pass filtered features, allowing models to converge faster while naturally avoiding artifacts. In addition, to assess the efficacy of our method, we created an artifact-detection algorithm based on structural similarity.
ISSN:	0003-682X 1872-910X
DOI:	10.1016/j.apacoust.2022.109183