Comparing Regression Models with Count Data to Artificial Neural Network and Ensemble Models for Prediction of Generic Escherichia coli Population in Agricultural Ponds Based on Weather Station Measurements
•The performance of statistical and machine learning models are compared.•Statistical models are more sensitive to outliers than machine learning models.•Machine learning models provided better performance comparing to statistical models.•AdaBoost was determined as the most suitable model on predict...
Gespeichert in:
Veröffentlicht in: | Microbial risk analysis 2021-12, Vol.19, p.100171, Article 100171 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •The performance of statistical and machine learning models are compared.•Statistical models are more sensitive to outliers than machine learning models.•Machine learning models provided better performance comparing to statistical models.•AdaBoost was determined as the most suitable model on prediction of generic E. coli.
Indicator microorganisms are monitored in agricultural waters to foster produce safety. Various prediction models are used to estimate the population of indicator microorganisms and pathogens when no observation is available. The purpose of this study was to compare the performance of regression models with count data (zero-inflated Poisson and hurdle negative binomial) to artificial neural network and ensemble models (random forest and AdaBoost) for the prediction of generic Escherichia coli population in agricultural surface waters in relation with weather station measurements. Two-part count data models were built on E. coli population count frequencies (0, [1,10), [10,100), [100,1000), [1000, 10000), (>=10000)) based on the data structure. The use of artificial neural network, AdaBoost, and random forest were determined based on the mean absolute error (MAE) value over pre-tested six models. The MAE was also used to compare the performance of two-part count data models with artificial neural network and ensemble models. Over-dispersed E. coli population count frequencies was calculated between 2.2 and 52.2% for all ponds. Observed and predicted zero E. coli population counts for all ponds were matched from 82 to 100% for zero-inflated Poisson and 100% for hurdle negative binomial regression models. Overdispersion reduced the performance of tested models. AdaBoost-Twelve Estimators had the best performance with the lowest MAE values for all ponds (from 0.87 to 46.60). The ensemble models used in this study provided more promising performance when compared to tested regression models with count data. |
---|---|
ISSN: | 2352-3522 2352-3530 |
DOI: | 10.1016/j.mran.2021.100171 |