Scalable, Tokenization-Free Diffusion Model Architectures with Efficient Initial Convolution and Fixed-Size Reusable Structures for On-Device Image Generation
Vision Transformers and U-Net architectures have been widely adopted in the implementation of Diffusion Models. However, each architecture presents specific challenges while realizing them on-device. Vision Transformers require positional embedding to maintain correspondence between the tokens proce...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Vision Transformers and U-Net architectures have been widely adopted in the
implementation of Diffusion Models. However, each architecture presents
specific challenges while realizing them on-device. Vision Transformers require
positional embedding to maintain correspondence between the tokens processed by
the transformer, although they offer the advantage of using fixed-size,
reusable repetitive blocks following tokenization. The U-Net architecture lacks
these attributes, as it utilizes variable-sized intermediate blocks for
down-convolution and up-convolution in the noise estimation backbone for the
diffusion process. To address these issues, we propose an architecture that
utilizes a fixed-size, reusable transformer block as a core structure, making
it more suitable for hardware implementation. Our architecture is characterized
by low complexity, token-free design, absence of positional embeddings,
uniformity, and scalability, making it highly suitable for deployment on mobile
and resource-constrained devices. The proposed model exhibit competitive and
consistent performance across both unconditional and conditional image
generation tasks. The model achieved a state-of-the-art FID score of 1.6 on
unconditional image generation with the CelebA. |
---|---|
DOI: | 10.48550/arxiv.2411.06119 |