CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation

Convolutional neural networks (CNNs) are powerful in extracting local information but lack the ability to model long-range dependencies. In contrast, the transformer relies on multihead self-attention mechanisms to effectively extract the global contextual information and thus model long-range depen...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on geoscience and remote sensing 2023, Vol.61, p.1-12
Hauptverfasser:	Wu, Honglin, Huang, Peng, Zhang, Min, Tang, Wenlong, Yu, Xinyu
Format:	Artikel
Sprache:	eng
Schlagworte:	Ablation Artificial neural networks Coders Convolution Data mining Decoding Feature extraction Global contextual information Image processing Image resolution Image segmentation Information processing Modules multiscale transformer Neural networks Photogrammetry Remote sensing remote-sensing image Semantic segmentation Semantics Transformers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Convolutional neural networks (CNNs) are powerful in extracting local information but lack the ability to model long-range dependencies. In contrast, the transformer relies on multihead self-attention mechanisms to effectively extract the global contextual information and thus model long-range dependencies. In this article, we propose a novel encoder-decoder structured semantic segmentation network, named CNN and multiscale transformer fusion network (CMTFNet), to extract and fuse local information and multiscale global contextual information of high-resolution remote-sensing images. Specifically, to further process the output features from the CNN encoder, we build a transformer decoder based on the multiscale multihead self-attention (M2SA) module for extracting rich multiscale global contextual information and channel information. Additionally, the transformer block introduces an efficient feed-forward network (E-FFN) to enhance the information interaction between different channels of the feature. Finally, the multiscale attention fusion (MAF) module fully fuses the feature information from different levels. We have conducted extensive comparison experiments and ablation experiments on the International Society for Photogrammetry and Remote Sensing (ISPRS) Vaihingen and Potsdam datasets. The extensive experimental results demonstrate that our proposed CMTFNet can obtain superior performance compared to the currently popular methods. The codes will be available at https://github.com/DrWuHonglin/CMTFNet .
ISSN:	0196-2892 1558-0644
DOI:	10.1109/TGRS.2023.3314641