Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification

In this study, we develop a novel deep hierarchical vision transformer (DHViT) architecture for hyperspectral and light detection and ranging (LiDAR) data joint classification. Current classification methods have limitations in heterogeneous feature representation and information fusion of multi-mod...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing 2022-01, Vol.31, p.1-1
Hauptverfasser: Xue, Zhixiang, Tan, Xiong, Yu, Xuchu, Liu, Bing, Yu, Anzhu, Zhang, Pengqiang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In this study, we develop a novel deep hierarchical vision transformer (DHViT) architecture for hyperspectral and light detection and ranging (LiDAR) data joint classification. Current classification methods have limitations in heterogeneous feature representation and information fusion of multi-modality remote sensing data (e.g., hyperspectral and LiDAR data), these shortcomings restrict the collaborative classification accuracy of remote sensing data. The proposed deep hierarchical vision transformer architecture utilizes both the powerful modeling capability of long-range dependencies and strong generalization ability across different domains of the transformer network, which is based exclusively on the self-attention mechanism. Specifically, the spectral sequence transformer is exploited to handle the long-range dependencies along the spectral dimension from hyperspectral images, because all diagnostic spectral bands contribute to the land cover classification. Thereafter, we utilize the spatial hierarchical transformer structure to extract hierarchical spatial features from hyperspectral and LiDAR data, which are also crucial for classification. Furthermore, the cross attention (CA) feature fusion pattern could adaptively and dynamically fuse heterogeneous features from multi-modality data, and this contextual aware fusion mode further improves the collaborative classification performance. Comparative experiments and ablation studies are conducted on three benchmark hyperspectral and LiDAR datasets, and the DHViT model could yield an average overall classification accuracy of 99.58%, 99.55%, and 96.40% on three datasets, respectively, which sufficiently certify the effectiveness and superior performance of the proposed method.
ISSN:1057-7149
1941-0042
DOI:10.1109/TIP.2022.3162964