An Explainable Spatial–Frequency Multiscale Transformer for Remote Sensing Scene Classification

Deep convolutional neural networks (CNNs) are significant in remote sensing. Due to the strong local representation learning ability, CNNs have excellent performance in remote sensing scene classification. However, CNNs focus on location-sensitive representations in the spatial domain and lack conte...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on geoscience and remote sensing 2023, Vol.61, p.1-15
Hauptverfasser:	Yang, Yuting, Jiao, Licheng, Liu, Fang, Liu, Xu, Li, Lingling, Chen, Puhua, Yang, Shuyuan
Format:	Artikel
Sprache:	eng
Schlagworte:	Aggregation Artificial neural networks Classification Coders Feature recognition Frequency dependence Frequency domain analysis Machine learning Neural networks Remote sensing Representations Texture Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Deep convolutional neural networks (CNNs) are significant in remote sensing. Due to the strong local representation learning ability, CNNs have excellent performance in remote sensing scene classification. However, CNNs focus on location-sensitive representations in the spatial domain and lack contextual information mining capabilities. Meanwhile, remote sensing scene classification still faces challenges, such as complex scenes and significant differences in target sizes. To address the problems and challenges above, more robust feature representation learning networks are necessary. In this article, a novel and explainable spatial–frequency multiscale Transformer framework, SF-MSFormer, is proposed for remote sensing scene classification. It mainly comprises spatial-domain and frequency-domain multiscale Transformer branches, which consider the spatial–frequency global multiscale representation features. Besides, the texture-enhanced encoder is designed in the frequency-domain multiscale Transformer branch, which is adaptive to capture the global texture features. In addition, an adaptive feature aggregation module is designed to integrate the spatial–frequency multiscale feature for final recognition. The experimental results verify the effectiveness of SF-MSFormer and show better convergence. It achieves state-of-the-art results [98.72%, 98.6%, 99.72%, and 94.83% overall accuracies (OAs), respectively] on the AID, UCM, WHU-RS19, and NWPU-RESISC45 datasets. Besides, the feature visualizations evaluate the explainability of the texture-enhanced encoder. The code implementation of this article will be available at https://github.com/yutinyang/SF-MSFormer .
ISSN:	0196-2892 1558-0644
DOI:	10.1109/TGRS.2023.3265361