vTTS: visual-text to speech
This paper proposes visual-text to speech (vTTS), a method for synthesizing speech from visual text (i.e., text as an image). Conventional TTS converts phonemes or characters into discrete symbols and synthesizes a speech waveform from them, thus losing the visual features that the characters essent...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper proposes visual-text to speech (vTTS), a method for synthesizing
speech from visual text (i.e., text as an image). Conventional TTS converts
phonemes or characters into discrete symbols and synthesizes a speech waveform
from them, thus losing the visual features that the characters essentially
have. Therefore, our method synthesizes speech not from discrete symbols but
from visual text. The proposed vTTS extracts visual features with a
convolutional neural network and then generates acoustic features with a
non-autoregressive model inspired by FastSpeech2. Experimental results show
that 1) vTTS is capable of generating speech with naturalness comparable to or
better than a conventional TTS, 2) it can transfer emphasis and emotion
attributes in visual text to speech without additional labels and
architectures, and 3) it can synthesize more natural and intelligible speech
from unseen and rare characters than conventional TTS. |
---|---|
DOI: | 10.48550/arxiv.2203.14725 |