Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis

BACKGROUNDIn epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large.OBJECTIVEOur study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learnin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of medical Internet research 2021-06, Vol.23 (6), p.e27344-e27344
Hauptverfasser: Nam, Sang Min, Peterson, Thomas A, Seo, Kyoung Yul, Han, Hyun Wook, Kang, Jee In
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:BACKGROUNDIn epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large.OBJECTIVEOur study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis.METHODSAn XGBoost model was trained and tested to classify "current depression" and "no lifetime depression" for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network.RESULTSThe XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (P
ISSN:1438-8871
1439-4456
1438-8871
DOI:10.2196/27344