Investigating the relationship between time and predictive model maintenance

A majority of predictive models should be updated regularly, since the most recent data associated with the model may have a different distribution from that of the original training data. This difference may be critical enough to impact the effectiveness of the machine learning model. In our paper,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of Big Data 2020-06, Vol.7 (1), p.1-19, Article 36
Hauptverfasser:	Leevy, Joffrey L., Khoshgoftaar, Taghi M., Bauder, Richard A., Seliya, Naeem
Format:	Artikel
Sprache:	eng
Schlagworte:	Analysis Big Data Class imbalance Communications Engineering Computational Science and Engineering Computer Science Data Mining and Knowledge Discovery Database Management Datasets Domains Fraud Fraud prevention Information Storage and Retrieval Insurance fraud Investigations Machine learning Mathematical Applications in Computer Science Medicare Model maintenance Networks Prediction models Predictive maintenance Regression analysis Training Undersampling
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A majority of predictive models should be updated regularly, since the most recent data associated with the model may have a different distribution from that of the original training data. This difference may be critical enough to impact the effectiveness of the machine learning model. In our paper, we investigate the relationship between time and predictive model maintenance. Our work incorporates severely imbalanced big data from three Medicare datasets, namely Part D, DMEPOS, and Combined, that have been used in several fraud detection studies. We build training datasets from year-groupings of 2013, 2014, 2015, 2013–2014, 2014–2015, and 2013–2015. Our test datasets are built from the 2016 data. To mitigate some of the adverse effects from the severe class imbalance in these datasets, the performance of five class ratios obtained by Random Undersampling and five learners is evaluated by the Area Under the Receiver Operating Characteristic Curve metric. The models producing the best values are as follows: Logistic Regression with the 2015 year-grouping at a 99:1 class ratio (Part D); Random Forest with the 2014-2015 year-grouping at a 75:25 class ratio (DMEPOS); and Logistic Regression with the full 2015 year-grouping (Combined). Our experimental results show that the largest training dataset (year-grouping 2013–2015) was not among the selected choices, which indicates that the 2013 data may be outdated. Moreover, we note that because the best model is different for Part D, DMEPOS, and Combined, this suggests that these three datasets may actually be sub-domains requiring unique models within the Medicare fraud detection domain.
ISSN:	2196-1115 2196-1115
DOI:	10.1186/s40537-020-00312-x