Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that connects them together. Experimental results on three dataset...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present Translatotron 2, a neural direct speech-to-speech translation
model that can be trained end-to-end. Translatotron 2 consists of a speech
encoder, a linguistic decoder, an acoustic synthesizer, and a single attention
module that connects them together. Experimental results on three datasets
consistently show that Translatotron 2 outperforms the original Translatotron
by a large margin on both translation quality (up to +15.5 BLEU) and speech
generation quality, and approaches the same of cascade systems. In addition, we
propose a simple method for preserving speakers' voices from the source speech
to the translation speech in a different language. Unlike existing approaches,
the proposed method is able to preserve each speaker's voice on speaker turns
without requiring for speaker segmentation. Furthermore, compared to existing
approaches, it better preserves speaker's privacy and mitigates potential
misuse of voice cloning for creating spoofing audio artifacts. |
---|---|
DOI: | 10.48550/arxiv.2107.08661 |