DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing
Audio-visual alignment after dubbing is a challenging research problem. To this end, we propose a novel method, DubWise Multi-modal Large Language Model (LLM)-based Text-to-Speech (TTS), which can control the speech duration of synthesized speech in such a way that it aligns well with the speakers l...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Audio-visual alignment after dubbing is a challenging research problem. To
this end, we propose a novel method, DubWise Multi-modal Large Language Model
(LLM)-based Text-to-Speech (TTS), which can control the speech duration of
synthesized speech in such a way that it aligns well with the speakers lip
movements given in the reference video even when the spoken text is different
or in a different language. To accomplish this, we propose to utilize
cross-modal attention techniques in a pre-trained GPT-based TTS. We combine
linguistic tokens from text, speaker identity tokens via a voice cloning
network, and video tokens via a proposed duration controller network. We
demonstrate the effectiveness of our system on the Lip2Wav-Chemistry and LRS2
datasets. Also, the proposed method achieves improved lip sync and naturalness
compared to the SOTAs for the same language but different text (i.e.,
non-parallel) and the different language, different text (i.e., cross-lingual)
scenarios. |
---|---|
DOI: | 10.48550/arxiv.2406.08802 |