Inductive Biased Swin-Transformer With Cyclic Regressor for Remote Sensing Scene Classification

Convolutional neural networks (CNNs) have been widely used in remote sensing scene classification. However, the long-range dependencies of local features cannot be taken into account by CNNs. By contrast, a visual transformer (ViT) is good at capturing the long-range dependencies as it considers the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE journal of selected topics in applied earth observations and remote sensing 2023-01, Vol.16, p.1-14
Hauptverfasser: Hao, Siyuan, Li, Nan, Ye, Yuanxin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Convolutional neural networks (CNNs) have been widely used in remote sensing scene classification. However, the long-range dependencies of local features cannot be taken into account by CNNs. By contrast, a visual transformer (ViT) is good at capturing the long-range dependencies as it considers the global relationship of local features by introducing a self-attention mechanism. Although the ViT can obtain a good result when training on large-scale datasets, e.g. , ImageNet, it is hard to be adapted to small-scale datasets ( e.g. , remote sensing image datasets). This is attributed to the fact that the ViT lacks the typical inductive bias capability. Therefore, we propose the inductive biased swin transformer with cyclic regressor used with random dense sampler (IBSwin-CR) to improve the training effect of the swin transformer on remote sensing image datasets, which builds upon three modules, i.e. , inductive biased shifted window multihead self-attention (IBSW-MSA) module, random dense sampler, and a regressor with cyclic regression loss. We obtain the inductive bias information and the long-range dependencies of the attention map by the IBSW-MSA module. Moreover, the final feature map goes through a random dense sampler, in which the additional spatial information is learned. Finally, the network is normalized by a cross-entropy loss function and a cyclic regression loss function. The proposed IBSwin-CR model is evaluated on public datasets such as NWPU-RESISC45 dataset and Aerial Image Dataset, and the experimental results show that the proposed network can achieve better performance than other classification models, especially for the case with a small number of samples.
ISSN:1939-1404
2151-1535
DOI:10.1109/JSTARS.2023.3290676