Hybrid decision tree-based machine learning models for short-term water quality prediction

Water resources are the foundation of people’s life and economic development, and are closely related to health and the environment. Accurate prediction of water quality is the key to improving water management and pollution control. In this paper, two novel hybrid decision tree-based machine learni...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Chemosphere (Oxford) 2020-06, Vol.249, p.126169-126169, Article 126169
Hauptverfasser: Lu, Hongfang, Ma, Xin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 126169
container_issue
container_start_page 126169
container_title Chemosphere (Oxford)
container_volume 249
creator Lu, Hongfang
Ma, Xin
description Water resources are the foundation of people’s life and economic development, and are closely related to health and the environment. Accurate prediction of water quality is the key to improving water management and pollution control. In this paper, two novel hybrid decision tree-based machine learning models are proposed to obtain more accurate short-term water quality prediction results. The basic models of the two hybrid models are extreme gradient boosting (XGBoost) and random forest (RF), which respectively introduce an advanced data denoising technique - complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN). Taking the water resources of Gales Creek site in Tualatin River (one of the most polluted rivers in the world) Basin as an example, a total of 1875 data (hourly data) from May 1, 2019 to July 20, 2019 are collected. Two hybrid models are used to predict six water quality indicators, including water temperature, dissolved oxygen, pH value, specific conductance, turbidity, and fluorescent dissolved organic matter. Six error metrics are introduced as the basis of performance evaluation, and the results of the two models are compared with the other four conventional models. The results reveal that: (1) CEEMDAN-RF performs best in the prediction of temperature, dissolved oxygen and specific conductance, the mean absolute percentage errors (MAPEs) are 0.69%, 1.05%, and 0.90%, respectively. CEEMDAN-XGBoost performs best in the prediction of pH value, turbidity, and fluorescent dissolved organic matter, the MAPEs are 0.27%, 14.94%, and 1.59%, respectively. (2) The average MAPEs of CEEMDAN-RF and CEEMMDAN-XGBoost models are the smallest, which are 3.90% and 3.71% respectively, indicating that their overall prediction performance is the best. In addition, the stability of the prediction model is also discussed in this paper. The analysis shows that the prediction stability of CEEMDAN-RF and CEEMDAN-XGBoost is higher than other benchmark models. •Two hybrid decision tree-based models are proposed to predict the water quality.•An advanced denoising method is used to preprocess raw data.•The case study was conducted on the most polluted river Tualatin River in Oregon, USA.•The prediction stability of the model is analyzed.
doi_str_mv 10.1016/j.chemosphere.2020.126169
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2363067658</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0045653520303623</els_id><sourcerecordid>2363067658</sourcerecordid><originalsourceid>FETCH-LOGICAL-c443t-65f751024fb53e9cc4b4bbedbe56233f33bcb2f397d27beaeb0f9c362d7337643</originalsourceid><addsrcrecordid>eNqNkE1LwzAYx4Mobk6_gsSbl840adPmKEOdMPCiFy8hL09dRttsSafs25uxKR49_eHh__LwQ-gmJ9Oc5PxuNTVL6HxcLyHAlBKa7pTnXJygcV5XIsupqE_RmJCizHjJyhG6iHFFSAqX4hyNGCVVXRdijN7nOx2cxRaMi873eAgAmVYRLO6UWboecAsq9K7_wJ230Ebc-IDj0ochGyB0-EslwZutat2ww-sA1pkhVV2is0a1Ea6OOkFvjw-vs3m2eHl6nt0vMlMUbEj_NVWZE1o0umQgjCl0oTVYDSWnjDWMaaNpw0RlaaVBgSaNMIxTWzFW8YJN0O2hdx38ZgtxkJ2LBtpW9eC3UVLGGeEVL-tkFQerCT7GAI1cB9epsJM5kXu0ciX_oJV7tPKANmWvjzNb3YH9Tf6wTIbZwZAYwaeDIKNx0JvEI4AZpPXuHzPfgKeSJg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2363067658</pqid></control><display><type>article</type><title>Hybrid decision tree-based machine learning models for short-term water quality prediction</title><source>MEDLINE</source><source>Elsevier ScienceDirect Journals Complete</source><creator>Lu, Hongfang ; Ma, Xin</creator><creatorcontrib>Lu, Hongfang ; Ma, Xin</creatorcontrib><description>Water resources are the foundation of people’s life and economic development, and are closely related to health and the environment. Accurate prediction of water quality is the key to improving water management and pollution control. In this paper, two novel hybrid decision tree-based machine learning models are proposed to obtain more accurate short-term water quality prediction results. The basic models of the two hybrid models are extreme gradient boosting (XGBoost) and random forest (RF), which respectively introduce an advanced data denoising technique - complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN). Taking the water resources of Gales Creek site in Tualatin River (one of the most polluted rivers in the world) Basin as an example, a total of 1875 data (hourly data) from May 1, 2019 to July 20, 2019 are collected. Two hybrid models are used to predict six water quality indicators, including water temperature, dissolved oxygen, pH value, specific conductance, turbidity, and fluorescent dissolved organic matter. Six error metrics are introduced as the basis of performance evaluation, and the results of the two models are compared with the other four conventional models. The results reveal that: (1) CEEMDAN-RF performs best in the prediction of temperature, dissolved oxygen and specific conductance, the mean absolute percentage errors (MAPEs) are 0.69%, 1.05%, and 0.90%, respectively. CEEMDAN-XGBoost performs best in the prediction of pH value, turbidity, and fluorescent dissolved organic matter, the MAPEs are 0.27%, 14.94%, and 1.59%, respectively. (2) The average MAPEs of CEEMDAN-RF and CEEMMDAN-XGBoost models are the smallest, which are 3.90% and 3.71% respectively, indicating that their overall prediction performance is the best. In addition, the stability of the prediction model is also discussed in this paper. The analysis shows that the prediction stability of CEEMDAN-RF and CEEMDAN-XGBoost is higher than other benchmark models. •Two hybrid decision tree-based models are proposed to predict the water quality.•An advanced denoising method is used to preprocess raw data.•The case study was conducted on the most polluted river Tualatin River in Oregon, USA.•The prediction stability of the model is analyzed.</description><identifier>ISSN: 0045-6535</identifier><identifier>EISSN: 1879-1298</identifier><identifier>DOI: 10.1016/j.chemosphere.2020.126169</identifier><identifier>PMID: 32078849</identifier><language>eng</language><publisher>England: Elsevier Ltd</publisher><subject>Data denoising ; Decision tree-based model ; Decision Trees ; Environmental Monitoring - methods ; Extreme gradient boosting ; Humans ; Machine Learning ; Models, Statistical ; Oxygen ; Random forest ; Rivers ; Short-term ; Temperature ; Water ; Water Pollution - statistics &amp; numerical data ; Water Quality ; Water quality prediction</subject><ispartof>Chemosphere (Oxford), 2020-06, Vol.249, p.126169-126169, Article 126169</ispartof><rights>2020 Elsevier Ltd</rights><rights>Copyright © 2020 Elsevier Ltd. All rights reserved.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c443t-65f751024fb53e9cc4b4bbedbe56233f33bcb2f397d27beaeb0f9c362d7337643</citedby><cites>FETCH-LOGICAL-c443t-65f751024fb53e9cc4b4bbedbe56233f33bcb2f397d27beaeb0f9c362d7337643</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.sciencedirect.com/science/article/pii/S0045653520303623$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,776,780,3537,27901,27902,65534</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32078849$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Lu, Hongfang</creatorcontrib><creatorcontrib>Ma, Xin</creatorcontrib><title>Hybrid decision tree-based machine learning models for short-term water quality prediction</title><title>Chemosphere (Oxford)</title><addtitle>Chemosphere</addtitle><description>Water resources are the foundation of people’s life and economic development, and are closely related to health and the environment. Accurate prediction of water quality is the key to improving water management and pollution control. In this paper, two novel hybrid decision tree-based machine learning models are proposed to obtain more accurate short-term water quality prediction results. The basic models of the two hybrid models are extreme gradient boosting (XGBoost) and random forest (RF), which respectively introduce an advanced data denoising technique - complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN). Taking the water resources of Gales Creek site in Tualatin River (one of the most polluted rivers in the world) Basin as an example, a total of 1875 data (hourly data) from May 1, 2019 to July 20, 2019 are collected. Two hybrid models are used to predict six water quality indicators, including water temperature, dissolved oxygen, pH value, specific conductance, turbidity, and fluorescent dissolved organic matter. Six error metrics are introduced as the basis of performance evaluation, and the results of the two models are compared with the other four conventional models. The results reveal that: (1) CEEMDAN-RF performs best in the prediction of temperature, dissolved oxygen and specific conductance, the mean absolute percentage errors (MAPEs) are 0.69%, 1.05%, and 0.90%, respectively. CEEMDAN-XGBoost performs best in the prediction of pH value, turbidity, and fluorescent dissolved organic matter, the MAPEs are 0.27%, 14.94%, and 1.59%, respectively. (2) The average MAPEs of CEEMDAN-RF and CEEMMDAN-XGBoost models are the smallest, which are 3.90% and 3.71% respectively, indicating that their overall prediction performance is the best. In addition, the stability of the prediction model is also discussed in this paper. The analysis shows that the prediction stability of CEEMDAN-RF and CEEMDAN-XGBoost is higher than other benchmark models. •Two hybrid decision tree-based models are proposed to predict the water quality.•An advanced denoising method is used to preprocess raw data.•The case study was conducted on the most polluted river Tualatin River in Oregon, USA.•The prediction stability of the model is analyzed.</description><subject>Data denoising</subject><subject>Decision tree-based model</subject><subject>Decision Trees</subject><subject>Environmental Monitoring - methods</subject><subject>Extreme gradient boosting</subject><subject>Humans</subject><subject>Machine Learning</subject><subject>Models, Statistical</subject><subject>Oxygen</subject><subject>Random forest</subject><subject>Rivers</subject><subject>Short-term</subject><subject>Temperature</subject><subject>Water</subject><subject>Water Pollution - statistics &amp; numerical data</subject><subject>Water Quality</subject><subject>Water quality prediction</subject><issn>0045-6535</issn><issn>1879-1298</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqNkE1LwzAYx4Mobk6_gsSbl840adPmKEOdMPCiFy8hL09dRttsSafs25uxKR49_eHh__LwQ-gmJ9Oc5PxuNTVL6HxcLyHAlBKa7pTnXJygcV5XIsupqE_RmJCizHjJyhG6iHFFSAqX4hyNGCVVXRdijN7nOx2cxRaMi873eAgAmVYRLO6UWboecAsq9K7_wJ230Ebc-IDj0ochGyB0-EslwZutat2ww-sA1pkhVV2is0a1Ea6OOkFvjw-vs3m2eHl6nt0vMlMUbEj_NVWZE1o0umQgjCl0oTVYDSWnjDWMaaNpw0RlaaVBgSaNMIxTWzFW8YJN0O2hdx38ZgtxkJ2LBtpW9eC3UVLGGeEVL-tkFQerCT7GAI1cB9epsJM5kXu0ciX_oJV7tPKANmWvjzNb3YH9Tf6wTIbZwZAYwaeDIKNx0JvEI4AZpPXuHzPfgKeSJg</recordid><startdate>202006</startdate><enddate>202006</enddate><creator>Lu, Hongfang</creator><creator>Ma, Xin</creator><general>Elsevier Ltd</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope></search><sort><creationdate>202006</creationdate><title>Hybrid decision tree-based machine learning models for short-term water quality prediction</title><author>Lu, Hongfang ; Ma, Xin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c443t-65f751024fb53e9cc4b4bbedbe56233f33bcb2f397d27beaeb0f9c362d7337643</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Data denoising</topic><topic>Decision tree-based model</topic><topic>Decision Trees</topic><topic>Environmental Monitoring - methods</topic><topic>Extreme gradient boosting</topic><topic>Humans</topic><topic>Machine Learning</topic><topic>Models, Statistical</topic><topic>Oxygen</topic><topic>Random forest</topic><topic>Rivers</topic><topic>Short-term</topic><topic>Temperature</topic><topic>Water</topic><topic>Water Pollution - statistics &amp; numerical data</topic><topic>Water Quality</topic><topic>Water quality prediction</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lu, Hongfang</creatorcontrib><creatorcontrib>Ma, Xin</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Chemosphere (Oxford)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lu, Hongfang</au><au>Ma, Xin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Hybrid decision tree-based machine learning models for short-term water quality prediction</atitle><jtitle>Chemosphere (Oxford)</jtitle><addtitle>Chemosphere</addtitle><date>2020-06</date><risdate>2020</risdate><volume>249</volume><spage>126169</spage><epage>126169</epage><pages>126169-126169</pages><artnum>126169</artnum><issn>0045-6535</issn><eissn>1879-1298</eissn><abstract>Water resources are the foundation of people’s life and economic development, and are closely related to health and the environment. Accurate prediction of water quality is the key to improving water management and pollution control. In this paper, two novel hybrid decision tree-based machine learning models are proposed to obtain more accurate short-term water quality prediction results. The basic models of the two hybrid models are extreme gradient boosting (XGBoost) and random forest (RF), which respectively introduce an advanced data denoising technique - complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN). Taking the water resources of Gales Creek site in Tualatin River (one of the most polluted rivers in the world) Basin as an example, a total of 1875 data (hourly data) from May 1, 2019 to July 20, 2019 are collected. Two hybrid models are used to predict six water quality indicators, including water temperature, dissolved oxygen, pH value, specific conductance, turbidity, and fluorescent dissolved organic matter. Six error metrics are introduced as the basis of performance evaluation, and the results of the two models are compared with the other four conventional models. The results reveal that: (1) CEEMDAN-RF performs best in the prediction of temperature, dissolved oxygen and specific conductance, the mean absolute percentage errors (MAPEs) are 0.69%, 1.05%, and 0.90%, respectively. CEEMDAN-XGBoost performs best in the prediction of pH value, turbidity, and fluorescent dissolved organic matter, the MAPEs are 0.27%, 14.94%, and 1.59%, respectively. (2) The average MAPEs of CEEMDAN-RF and CEEMMDAN-XGBoost models are the smallest, which are 3.90% and 3.71% respectively, indicating that their overall prediction performance is the best. In addition, the stability of the prediction model is also discussed in this paper. The analysis shows that the prediction stability of CEEMDAN-RF and CEEMDAN-XGBoost is higher than other benchmark models. •Two hybrid decision tree-based models are proposed to predict the water quality.•An advanced denoising method is used to preprocess raw data.•The case study was conducted on the most polluted river Tualatin River in Oregon, USA.•The prediction stability of the model is analyzed.</abstract><cop>England</cop><pub>Elsevier Ltd</pub><pmid>32078849</pmid><doi>10.1016/j.chemosphere.2020.126169</doi><tpages>1</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0045-6535
ispartof Chemosphere (Oxford), 2020-06, Vol.249, p.126169-126169, Article 126169
issn 0045-6535
1879-1298
language eng
recordid cdi_proquest_miscellaneous_2363067658
source MEDLINE; Elsevier ScienceDirect Journals Complete
subjects Data denoising
Decision tree-based model
Decision Trees
Environmental Monitoring - methods
Extreme gradient boosting
Humans
Machine Learning
Models, Statistical
Oxygen
Random forest
Rivers
Short-term
Temperature
Water
Water Pollution - statistics & numerical data
Water Quality
Water quality prediction
title Hybrid decision tree-based machine learning models for short-term water quality prediction
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T19%3A54%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Hybrid%20decision%20tree-based%20machine%20learning%20models%20for%20short-term%20water%20quality%20prediction&rft.jtitle=Chemosphere%20(Oxford)&rft.au=Lu,%20Hongfang&rft.date=2020-06&rft.volume=249&rft.spage=126169&rft.epage=126169&rft.pages=126169-126169&rft.artnum=126169&rft.issn=0045-6535&rft.eissn=1879-1298&rft_id=info:doi/10.1016/j.chemosphere.2020.126169&rft_dat=%3Cproquest_cross%3E2363067658%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2363067658&rft_id=info:pmid/32078849&rft_els_id=S0045653520303623&rfr_iscdi=true