Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data
Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D objects with detail...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent years have witnessed remarkable progress in multi-view diffusion
models for 3D content creation. However, there remains a significant gap in
image quality and prompt-following ability compared to 2D diffusion models. A
critical bottleneck is the scarcity of high-quality 3D objects with detailed
captions. To address this challenge, we propose Bootstrap3D, a novel framework
that automatically generates an arbitrary quantity of multi-view images to
assist in training multi-view diffusion models. Specifically, we introduce a
data generation pipeline that employs (1) 2D and video diffusion models to
generate multi-view images based on constructed text prompts, and (2) our
fine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting
inaccurate captions. Leveraging this pipeline, we have generated 1 million
high-quality synthetic multi-view images with dense descriptive captions to
address the shortage of high-quality 3D data. Furthermore, we present a
Training Timestep Reschedule (TTR) strategy that leverages the denoising
process to learn multi-view consistency while maintaining the original 2D
diffusion prior. Extensive experiments demonstrate that Bootstrap3D can
generate high-quality multi-view images with superior aesthetic quality,
image-text alignment, and maintained view consistency. |
---|---|
DOI: | 10.48550/arxiv.2406.00093 |