Predictive performance of machine learning model with varying sampling designs, sample sizes, and spatial extents

Using machine learning and earth observation data to capture real-world variability in spatial predictive mapping depends on sample size, design, and spatial extent. Nonetheless, there is still ambiguity in answering some basic questions: a) How many samples are necessary for fitting the model? b) W...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Ecological informatics 2023-12, Vol.78, p.102294, Article 102294
Hauptverfasser:	Bouasria, Abdelkrim, Bouslihim, Yassine, Gupta, Surya, Taghizadeh-Mehrjardi, Ruhollah, Hengl, Tomislav
Format:	Artikel
Sprache:	eng
Schlagworte:	Conditioned Latin Hypercube Model complexity Random forest Sampling design Simple random sampling Spatial extents Spatial predictive mapping
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Using machine learning and earth observation data to capture real-world variability in spatial predictive mapping depends on sample size, design, and spatial extent. Nonetheless, there is still ambiguity in answering some basic questions: a) How many samples are necessary for fitting the model? b) Which sampling techniques are suitable for modeling? c) Do results vary with changes in spatial extents? These questions are crucial for spatial modeling projects and require proper investigation. In the present study, we evaluated two sampling designs with different sample sizes, considering three nested spatial extents. Specifically, we adopted the conditioned Latin Hypercube Sampling and Simple Random Sampling designs. Based on this, a Random Forest model was used to predict Above-Ground forest Biomass at local, regional, and national spatial extents, comparing different sample sizes (n = 25, 50, 100, 200, 300, and 500). We defined one national extent, five regional extents within the national extent, and a local extent inside each regional extent. Each sampling design and size combination was tested 100 iterations. The results showed that there was no significant difference between the different sampling designs. The accuracy metrics showed marginal differences for 25 and 50 sample sizes, which were then reduced to minimal and provided similar results. However, a deeper analysis of all 100 repetitions exposed a noteworthy pattern: cLHS outperformed the SRS in terms of RMSE and variability. Regarding the sampling size, the R2 values increased with increasing sample size. Nevertheless, beyond a minimum of 300 to 500 samples, the improvement in accuracy became insignificant, emphasizing the diminishing returns with excessively large sample sizes. Moreover, increasing the size of the spatial extent reduced the accuracy of the model, possibly due to the effect of environmental factors or landscape nature. Therefore, this study demonstrates the potential impact of sample size, sampling design, and spatial extents on model accuracy and emphasizes the importance of reducing the sample size to reduce the model's complexity. •Above Ground Biomass was predicted using Random Forest method.•Two sampling designs and three spatial extents were compared.•No significant differences were observed between two sampling designs.•The gain in prediction accuracy is minimum beyond the sample size of 300.•By increasing the spatial extent's size, the model accuracy declined.
ISSN:	1574-9541
DOI:	10.1016/j.ecoinf.2023.102294