An efficient Transformer with neighborhood contrastive tokenization for hyperspectral images classification

The success of vision Transformers (ViTs) relies heavily on the self-attention mechanism, which requires support from appropriate patch tokenization. However, hyperspectral image (HSI) often suffer from significant noise distortions and spectral uncertainty, which result in unstable attention patter...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of applied earth observation and geoinformation 2024-07, Vol.131, p.103979, Article 103979
Hauptverfasser: Liang, Miaomiao, Zhang, Xianhao, Yu, Xiangchun, Yu, Lingjuan, Meng, Zhe, Zhang, Xiaohong, Jiao, Licheng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The success of vision Transformers (ViTs) relies heavily on the self-attention mechanism, which requires support from appropriate patch tokenization. However, hyperspectral image (HSI) often suffer from significant noise distortions and spectral uncertainty, which result in unstable attention patterns and overfitting due to equivocal tokenization. In this paper, we propose a neighborhood contrastive tokenization task (NeiCoT) to learn compact, semantically meaningful, and context-sensitive tokens for efficient Transformer encoding. Specifically, we employ a predictor on patch embedding to maximize the mutual information between local individuals and their global average anchor. This encourages neighboring tokens’ relevance and active participation in feature learning. Next, we revise a token-level contrastive loss to align predictions with local individuals and distinguish them from other samples in a mini-batch to enhance tokens rich in contextual semantics. Furthermore, we apply a Gaussian weighting to the tokens’ contrastive loss to balance the neighborhood contribution. Finally, we propose a sequence-specific MAE framework with NeiCoT to achieve HSI representation, and additionally validate NeiCoT on a supervised Transformer backbone. The results demonstrate that NeiCoT consistently enhances the robustness and generalization of the Transformer, achieving accurate object recognition and boundary localization even with limited training samples. Our code will be available at https://github.com/zoegnov07/NeiCoT. •Efficient Transformers require compact, semantical, and context-sensitive tokens.•A global-to-local predictor promotes neighboring context-dependent tokenization.•Weighted token-level contrastive loss enriches tokens with instance discriminability.•NeiCoT is effective for both self-supervised and supervised Transformer backbones.
ISSN:1569-8432
1872-826X
DOI:10.1016/j.jag.2024.103979