A self-training semi-supervised machine learning method for predictive mapping of soil classes with limited sample data

•We proposed a self-training semi-supervised machine learning method for DSM.•The proposed method additionally used unlabeled data to improve the prediction accuracy.•MLR, KNN, and RF were used as the base model for comparing supervised learning (SL) and semi-supervised learning (SSL).•The self-trai...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Geoderma 2021-02, Vol.384, p.114809, Article 114809
Hauptverfasser: Zhang, Lei, Yang, Lin, Ma, Tianwu, Shen, Feixue, Cai, Yanyan, Zhou, Chenghu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•We proposed a self-training semi-supervised machine learning method for DSM.•The proposed method additionally used unlabeled data to improve the prediction accuracy.•MLR, KNN, and RF were used as the base model for comparing supervised learning (SL) and semi-supervised learning (SSL).•The self-training SSL method obtained higher accuracy than the SL.•RF-SSL was the most accurate model for soil class prediction in the case study. Numerous machine learning models have been developed for constructing the relationship between soil classes or properties and its environmental covariates in digital soil mapping (DSM). Most machine learning models are trained with a supervised learning (SL) method based on training samples. However, the collected sample data is often limited in practice due to that field sampling is expensive and time-consuming. The insufficient samples may limit the learning ability of the model to a large extent. Semi-supervised machine learning, a new machine learning paradigm that makes use of both unsampled data and a small amount of sampled data in the learning process, can be a potential effective method for DSM. In this study, we present a self-training semi-supervised learning (SSL) method for DSM. Different with the SL method for machine learning models, the SSL method not only utilizes the sampled locations but also the abundant environmental covariate information at the unvisited locations. Its basic idea is to iteratively enlarge the training data set by adding the unsampled points with high prediction confidence from the unvisited locations until a stopping criterion reached. The proposed SSL method was applied in machine learning models for predicting soil classes in Heshan Farm of Nenjiang County in Heilongjiang Province, China. Three machine learning models, including multinomial logistic regression (MLR), k-nearest neighbor (KNN) and random forest (RF), were selected to evaluate the efficiency of the SSL method. The entropy threshold was an important parameter in the SSL method, and a sensitivity analysis on this parameter was conducted with using a series of entropy thresholds. The SSL method was compared with the SL method for the three machine learning models for soil prediction. A cross-validation was employed to evaluate the accuracy of the predicted soil class maps generated based on each method. The results showed that the prediction accuracies (the proportion of the correctly predicted samples over the total number of va
ISSN:0016-7061
1872-6259
DOI:10.1016/j.geoderma.2020.114809