Investigating the relationship between time and predictive model maintenance
A majority of predictive models should be updated regularly, since the most recent data associated with the model may have a different distribution from that of the original training data. This difference may be critical enough to impact the effectiveness of the machine learning model. In our paper,...
Gespeichert in:
Veröffentlicht in: | Journal of Big Data 2020-06, Vol.7 (1), p.1-19, Article 36 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A majority of predictive models should be updated regularly, since the most recent data associated with the model may have a different distribution from that of the original training data. This difference may be critical enough to impact the effectiveness of the machine learning model. In our paper, we investigate the relationship between time and predictive model maintenance. Our work incorporates severely imbalanced big data from three Medicare datasets, namely Part D, DMEPOS, and Combined, that have been used in several fraud detection studies. We build training datasets from year-groupings of 2013, 2014, 2015, 2013–2014, 2014–2015, and 2013–2015. Our test datasets are built from the 2016 data. To mitigate some of the adverse effects from the severe class imbalance in these datasets, the performance of five class ratios obtained by
Random Undersampling
and five learners is evaluated by the
Area Under the Receiver Operating Characteristic Curve
metric. The models producing the best values are as follows:
Logistic Regression
with the 2015 year-grouping at a 99:1 class ratio (Part D);
Random Forest
with the 2014-2015 year-grouping at a 75:25 class ratio (DMEPOS); and
Logistic Regression
with the full 2015 year-grouping (Combined). Our experimental results show that the largest training dataset (year-grouping 2013–2015) was not among the selected choices, which indicates that the 2013 data may be outdated. Moreover, we note that because the best model is different for Part D, DMEPOS, and Combined, this suggests that these three datasets may actually be sub-domains requiring unique models within the Medicare fraud detection domain. |
---|---|
ISSN: | 2196-1115 2196-1115 |
DOI: | 10.1186/s40537-020-00312-x |