LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition
The Vision Transformer (ViT) excels in accuracy when handling high-resolution images, yet it confronts the challenge of significant spatial redundancy, leading to increased computational and memory requirements. To address this, we present the Localization and Focus Vision Transformer (LF-ViT). This...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The Vision Transformer (ViT) excels in accuracy when handling high-resolution
images, yet it confronts the challenge of significant spatial redundancy,
leading to increased computational and memory requirements. To address this, we
present the Localization and Focus Vision Transformer (LF-ViT). This model
operates by strategically curtailing computational demands without impinging on
performance. In the Localization phase, a reduced-resolution image is
processed; if a definitive prediction remains elusive, our pioneering
Neighborhood Global Class Attention (NGCA) mechanism is triggered, effectively
identifying and spotlighting class-discriminative regions based on initial
findings. Subsequently, in the Focus phase, this designated region is used from
the original image to enhance recognition. Uniquely, LF-ViT employs consistent
parameters across both phases, ensuring seamless end-to-end optimization. Our
empirical tests affirm LF-ViT's prowess: it remarkably decreases Deit-S's FLOPs
by 63\% and concurrently amplifies throughput twofold. Code of this project is
at https://github.com/edgeai1/LF-ViT.git. |
---|---|
DOI: | 10.48550/arxiv.2402.00033 |