Rethinking Monocular Height Estimation from a Classification Task Perspective Leveraging the Vision Transformer

Height estimation from a single remote sensing image has great potential in generating digital surface models (DSM) efficiently for a quick earth surface reconstruction. Recently, convolutional neural networks (CNN) have emerged as a powerful method to deal with this ill-posed problem. Most existing...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE geoscience and remote sensing letters 2022, Vol.19, p.1-1
Hauptverfasser:	Sun, Wenbo, Zhang, Yichen, Liao, Yifan, Yang, Biao, Lin, Mingchun, Zhai, Ruifang, Gao, Zhi
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Classification Decoding Digital imaging digital surface models (DSM) Earth surface Estimation Height Ill posed problems Methods Modelling Monocular height estimation Neural networks Predictive models regression to classification Remote sensing Task analysis Training Transformers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Height estimation from a single remote sensing image has great potential in generating digital surface models (DSM) efficiently for a quick earth surface reconstruction. Recently, convolutional neural networks (CNN) have emerged as a powerful method to deal with this ill-posed problem. Most existing methods formulate height estimation as a regression problem due to the continuity of object height. However, it is difficult for the model to regress the object heights exactly to the ground-truth values with a wide range. In this letter, we reformulate the height estimation task as a classification task to improve the model performance. Specifically, we discretize the continuous ground-truth height into bins and assign each pixel to a single label according to the bin subdivision. In addition, we propose to generate a unique bin subdivision for each input image adaptively by viewing the bin generation as a set-to-set problem. Compared with the fixed bin subdivision method, a specific bin subdivision for each input image makes the model adaptively focus on the height range that is more probable to occur in the scene of the input image. In our experiments, we qualitatively and quantitatively demonstrate that the proposed method outperforms the state-of-the-art approaches on both Vaihingen and Potsdam datasets.
ISSN:	1545-598X 1558-0571
DOI:	10.1109/LGRS.2022.3222457