VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis
We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present VoiceDiT, a multi-modal generative model for producing
environment-aware speech and audio from text and visual prompts. While aligning
speech with text is crucial for intelligible speech, achieving this alignment
in noisy conditions remains a significant and underexplored challenge in the
field. To address this, we present a novel audio generation pipeline named
VoiceDiT. This pipeline includes three key components: (1) the creation of a
large-scale synthetic speech dataset for pre-training and a refined real-world
speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to
efficiently preserve aligned speech information while accurately reflecting
environmental conditions, and (3) a diffusion-based Image-to-Audio Translator
that allows the model to bridge the gap between audio and image, facilitating
the generation of environmental sound that aligns with the multi-modal prompts.
Extensive experimental results demonstrate that VoiceDiT outperforms previous
models on real-world datasets, showcasing significant improvements in both
audio quality and modality integration. |
---|---|
DOI: | 10.48550/arxiv.2412.19259 |