Faster Diffusion via Temporal Attention Decomposition
We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases:...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We explore the role of attention mechanism during inference in
text-conditional diffusion models. Empirical observations suggest that
cross-attention outputs converge to a fixed point after several inference
steps. The convergence time naturally divides the entire inference process into
two phases: an initial phase for planning text-oriented visual semantics, which
are then translated into images in a subsequent fidelity-improving phase.
Cross-attention is essential in the initial phase but almost irrelevant
thereafter. However, self-attention initially plays a minor role but becomes
crucial in the second phase. These findings yield a simple and training-free
method known as temporally gating the attention (TGATE), which efficiently
generates images by caching and reusing attention outputs at scheduled time
steps. Experimental results show when widely applied to various existing
text-conditional diffusion models, TGATE accelerates these models by 10%-50%.
The code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE. |
---|---|
DOI: | 10.48550/arxiv.2404.02747 |