PatchSkip: A lightweight technique for effectively alleviating over-smoothing in vision transformers

Recently vision transformers (ViTs) have encountered the over-smoothing problem, which reduces their capacity by mapping input patches into a similar latent representation. Existing methods introduce regularization terms to alleviate over-smoothing but often increase computational costs. To address...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neurocomputing (Amsterdam) 2024-10, Vol.600, p.128112, Article 128112
Hauptverfasser: Zhao, Jiafeng, Ye, Xiang, Li, Bohan, Li, Yong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Recently vision transformers (ViTs) have encountered the over-smoothing problem, which reduces their capacity by mapping input patches into a similar latent representation. Existing methods introduce regularization terms to alleviate over-smoothing but often increase computational costs. To address this, this paper proposes PatchSkip, a novel and flexible dropout variant, alleviating the over-smoothing problem of ViTs in a lightweight manner. Specifically, PatchSkip draws inspiration from the fact that a similar over-smoothing problem in GNNs is primarily caused by static adjacent matrices leading to solitary message passing mode between nodes. PatchSkip constructs graphs with patch embeddings and analyzes the adjacent matrices in ViTs. By randomly selecting specific patch embeddings to bypass transformer blocks, PatchSkip is proved to generate various adjacent matrices and acts as a multi-mode message passing engine, providing diverse modes of message passing between patches. The effectiveness of PatchSkip in preventing over-smoothing is demonstrated through theoretical proofs and empirical visualizations. Furthermore, PatchSkip is evaluated on various datasets and backbones, showing significant performance improvements while reducing computational costs. For example, when trained on Tiny-ImageNet from scratch, PatchSkip improves the performance of the vanilla CrossViT by 3.85% while reducing computational costs by over 20%.
ISSN:0925-2312
DOI:10.1016/j.neucom.2024.128112