Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views
Synthesizing multi-view 3D from one single image is a significant but challenging task. Zero-1-to-3 methods have achieved great success by lifting a 2D latent diffusion model to the 3D scope. The target view image is generated with a single-view source image and the camera pose as condition informat...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Synthesizing multi-view 3D from one single image is a significant but
challenging task. Zero-1-to-3 methods have achieved great success by lifting a
2D latent diffusion model to the 3D scope. The target view image is generated
with a single-view source image and the camera pose as condition information.
However, due to the high sparsity of the single input image, Zero-1-to-3 tends
to produce geometry and appearance inconsistency across views, especially for
complex objects. To tackle this issue, we propose to supply more condition
information for the generation model but in a self-prompt way. A cascade
framework is constructed with two Zero-1-to-3 models, named Cascade-Zero123,
which progressively extract 3D information from the source image. Specifically,
several nearby views are first generated by the first model and then fed into
the second-stage model along with the source image as generation conditions.
With amplified self-prompted condition images, our Cascade-Zero123 generates
more consistent novel-view images than Zero-1-to-3. Experiment results
demonstrate remarkable promotion, especially for various complex and
challenging scenes, involving insects, humans, transparent objects, and stacked
multiple objects etc. More demos and code are available at
https://cascadezero123.github.io. |
---|---|
DOI: | 10.48550/arxiv.2312.04424 |