LaVin-DiT: Large Vision Diffusion Transformer
This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a
scalable and unified foundation model designed to tackle over 20 computer
vision tasks in a generative framework. Unlike existing large vision models
directly adapted from natural language processing architectures, which rely on
less efficient autoregressive techniques and disrupt spatial relationships
essential for vision data, LaVin-DiT introduces key innovations to optimize
generative performance for vision tasks. First, to address the high
dimensionality of visual data, we incorporate a spatial-temporal variational
autoencoder that encodes data into a continuous latent space. Second, for
generative modeling, we develop a joint diffusion transformer that
progressively produces vision outputs. Third, for unified multi-task training,
in-context learning is implemented. Input-target pairs serve as task context,
which guides the diffusion transformer to align outputs with specific tasks
within the latent space. During inference, a task-specific context set and test
data as queries allow LaVin-DiT to generalize across tasks without fine-tuning.
Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B
parameters, demonstrating substantial scalability and state-of-the-art
performance across diverse vision tasks. This work introduces a novel pathway
for large vision foundation models, underscoring the promising potential of
diffusion transformers. The code and models will be open-sourced. |
---|---|
DOI: | 10.48550/arxiv.2411.11505 |