Water quality estimates using machine learning techniques in an experimental watershed

This study aims to identify the best machine learning (ML) approach to predict concentrations of biochemical oxygen demand (BOD), nitrate, and phosphate. Four ML techniques including Decision tree, Random Forest, Gradient Boosting and XGBoost were compared to estimate the water quality parameters ba...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of hydroinformatics 2024-11, Vol.26 (11), p.2798
Hauptverfasser: Costa, David, Bayissa, Yared, Barbosa, Kargean Vianna, Villas-Boas, Mariana Dias, Bawa, Arun, Junior, Jader Lugon, Neto, Antônio J. Silva, Srinivasan, Raghavan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This study aims to identify the best machine learning (ML) approach to predict concentrations of biochemical oxygen demand (BOD), nitrate, and phosphate. Four ML techniques including Decision tree, Random Forest, Gradient Boosting and XGBoost were compared to estimate the water quality parameters based on biophysical (i.e., population, basin area, river slope, water level, and stream flow), and physicochemical properties (i.e., conductivity, turbidity, pH, temperature, and dissolved oxygen) input parameters. The innovation lies in the combination of on-the-spot variables with additional characteristics of the watershed. The model performances were evaluated using coefficient of determination (R2), Nash-Sutcliffe efficiency coefficient (NSE), Root Mean Squared Error (RMSE) and Kling-Gupta Efficiency (KGE) coefficient. The robust five-fold cross-validation, along with hyperparameter tuning, achieved R2 values of 0.71, 0.66, and 0.69 for phosphate, nitrate, and BOD; NSE values of 0.67, 0.65, and 0.62, and KGE values of 0.64, 0.75, and 0.60, respectively. XGBoost yielded good results, showcasing superior performance when considering all analysis performed, but his performance was closely match by other algorithms. The overall modeling design and approach, which includes careful consideration of data preprocessing, dataset splitting, statistical evaluation metrics, feature analysis, and learning curve analysis, are just as important as algorithm selection.
ISSN:1464-7141
1465-1734
DOI:10.2166/hydro.2024.132