LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution
Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in proce...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have
demonstrated impressive performance. However, they suffer from significant
complexity, resulting in high inference times and memory usage. Additionally,
ViT models using Window Self-Attention (WSA) face challenges in processing
regions outside their windows. To address these issues, we propose the
Low-to-high Multi-Level Transformer (LMLT), which employs attention with
varying feature sizes for each head. LMLT divides image features along the
channel dimension, gradually reduces spatial size for lower heads, and applies
self-attention to each head. This approach effectively captures both local and
global information. By integrating the results from lower heads into higher
heads, LMLT overcomes the window boundary issues in self-attention. Extensive
experiments show that our model significantly reduces inference time and GPU
memory usage while maintaining or even surpassing the performance of
state-of-the-art ViT-based Image Super-Resolution methods. Our codes are
availiable at https://github.com/jwgdmkj/LMLT. |
---|---|
DOI: | 10.48550/arxiv.2409.03516 |