Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters

•Predicting water contamination by statistical models.•Evaluation of several machine learning techniques and metrics to model imbalanced data.•Imbalanced data-sets requires modified machine learning algorithms and evaluation metrics.•Combining modeling strategies is necessary to anticipate water con...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Water research (Oxford) 2021-09, Vol.202, p.117450-117450, Article 117450
Hauptverfasser: Bourel, Mathias, Segura, Angel M., Crisci, Carolina, López, Guzmán, Sampognaro, Lia, Vidal, Victoria, Kruk, Carla, Piccini, Claudia, Perera, Gonzalo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Predicting water contamination by statistical models.•Evaluation of several machine learning techniques and metrics to model imbalanced data.•Imbalanced data-sets requires modified machine learning algorithms and evaluation metrics.•Combining modeling strategies is necessary to anticipate water contamination. Predicting water contamination by statistical models is a useful tool to manage health risk in recreational beaches. Extreme contamination events, i.e. those exceeding normative are generally rare with respect to bathing conditions and thus the data is said to be imbalanced. Modeling and predicting those rare events present unique challenges. Here we introduce and evaluate several machine learning techniques and metrics to model imbalanced data and evaluate model performance. We do so by using a) simulated data-sets and b) a real data base with records of faecal coliform abundance monitored for 10 years in 21 recreational beaches in Uruguay (N ≈ 19000) using in situ and meteorological variables. We discuss advantages and disadvantages of the methods and provide a simple guide to perform models for a general audience. We also provide R codes to reproduce model fitting and testing. We found that most Machine Learning techniques are sensitive to imbalance and require specific data pre-treatment (e.g. upsampling) to improve performance. Accuracy (i.e. correctly classified cases over total cases) is not adequate to evaluate model performance on imbalanced data set. Instead, true positive rates (TPR) and false positive rates (FPR) are recommended. Among the 52 possible candidate algorithms tested, the stratified Random forest presented the better performance improving TPR in 50% with respect to baseline (0.4) and outperformed baseline in the evaluated metrics. Support vector machines combined with upsampling method or synthetic minority oversampling technique (SMOTE) performed well, similar to Adaboost with SMOTE. These results suggests that combining modeling strategies is necessary to improve our capacity to anticipate water contamination and avoid health risk.
ISSN:0043-1354
1879-2448
DOI:10.1016/j.watres.2021.117450