Parsimonious Random-Forest-Based Land-Use Regression Model Using Particulate Matter Sensors in Berlin, Germany

Machine learning (ML) methods are widely used in particulate matter prediction modelling, especially through use of air quality sensor data. Despite their advantages, these methods' black-box nature obscures the understanding of how a prediction has been made. Major issues with these types of m...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Sensors (Basel, Switzerland) Switzerland), 2024-06, Vol.24 (13), p.4193
Hauptverfasser:	Venkatraman Jagatha, Janani, Schneider, Christoph, Sauter, Tobias
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Air pollution Cities Decision trees Emissions Health risk assessment Health risks land use Neural networks Nitrogen dioxide Outdoor air quality particulate matter random forest regression modelling sensitivity analysis Sensors VOCs Volatile organic compounds
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Machine learning (ML) methods are widely used in particulate matter prediction modelling, especially through use of air quality sensor data. Despite their advantages, these methods' black-box nature obscures the understanding of how a prediction has been made. Major issues with these types of models include the data quality and computational intensity. In this study, we employed feature selection methods using recursive feature elimination and global sensitivity analysis for a random-forest (RF)-based land-use regression model developed for the city of Berlin, Germany. Land-use-based predictors, including local climate zones, leaf area index, daily traffic volume, population density, building types, building heights, and street types were used to create a baseline RF model. Five additional models, three using recursive feature elimination method and two using a Sobol-based global sensitivity analysis (GSA), were implemented, and their performance was compared against that of the baseline RF model. The predictors that had a large effect on the prediction as determined using both the methods are discussed. Through feature elimination, the number of predictors were reduced from 220 in the baseline model to eight in the parsimonious models without sacrificing model performance. The model metrics were compared, which showed that the parsimonious_GSA-based model performs better than does the baseline model and reduces the mean absolute error (MAE) from 8.69 µg/m to 3.6 µg/m and the root mean squared error (RMSE) from 9.86 µg/m to 4.23 µg/m when applying the trained model to reference station data. The better performance of the GSA_parsimonious model is made possible by the curtailment of the uncertainties propagated through the model via the reduction of multicollinear and redundant predictors. The parsimonious model validated against reference stations was able to predict the PM concentrations with an MAE of less than 5 µg/m for 10 out of 12 locations. The GSA_parsimonious performed best in all model metrics and improved the R from 3% in the baseline model to 17%. However, the predictions exhibited a degree of uncertainty, making it unreliable for regional scale modelling. The GSA_parsimonious model can nevertheless be adapted to local scales to highlight the land-use parameters that are indicative of PM concentrations in Berlin. Overall, population density, leaf area index, and traffic volume are the major predictors of PM , while building type and local clima
ISSN:	1424-8220 1424-8220
DOI:	10.3390/s24134193