CTMFNet: CNN and Transformer Multiscale Fusion Network of Remote Sensing Urban Scene Imagery

Semantic segmentation of remotely sensed urban scene images is widely demanded in areas such as land cover mapping, urban change detection, and environmental protection. With the development of deep learning, methods based on convolutional neural networks (CNNs) have been dominant due to their power...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on geoscience and remote sensing 2023, Vol.61, p.1-14
Hauptverfasser: Song, Pengfei, Li, Jinjiang, An, Zhiyong, Fan, Hui, Fan, Linwei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Semantic segmentation of remotely sensed urban scene images is widely demanded in areas such as land cover mapping, urban change detection, and environmental protection. With the development of deep learning, methods based on convolutional neural networks (CNNs) have been dominant due to their powerful ability to represent hierarchical feature information. However, the limitations of the convolution operation itself limit the network's ability to extract global contextual information. With the successful use of transformer in computer vision in recent years, transformer has shown great potential for modeling global contextual information. However, transformer is not sufficiently capable of capturing local detailed information. In this article, to explore the potential of the joint CNN and transformer mechanism for semantic segmentation of remotely sensed urban scenes, we propose a CNN and transformer multiscale fusion network (CTMFNet) based on encoding-decoding for urban scene understanding. To couple local-global context information more efficiently, we designed a dual backbone attention fusion module (DAFM) to couple the local and global context information of the dual-branch encoder. In addition, to bridge the semantic gap between scales, we built a multi-layer dense connectivity network (MDCN) as our decoder. The MDCN enables the full flow of semantic information between multiple scales to be fused with each other through upsampling and residual connectivity. We conducted extensive subjective and objective comparison experiments and ablation experiments on both the International Society of Photogrammetry and Remote Sensing (ISPRS) Vaihingen and ISPRS Potsdam datasets. Numerous experimental results have proven the superiority of our method compared to currently popular methods.
ISSN:0196-2892
1558-0644
DOI:10.1109/TGRS.2022.3232143