Outlier detection for questionnaire data in biobanks

Abstract Background Biobanks increasingly collect, process and store omics with more conventional epidemiologic information necessitating considerable effort in data cleaning. An efficient outlier detection method that reduces manual labour is highly desirable. Method We develop an unsupervised mach...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of epidemiology 2019-08, Vol.48 (4), p.1305-1315
Hauptverfasser: Sakurai, Rieko, Ueki, Masao, Makino, Satoshi, Hozawa, Atsushi, Kuriyama, Shinichi, Takai-Igarashi, Takako, Kinoshita, Kengo, Yamamoto, Masayuki, Tamiya, Gen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Abstract Background Biobanks increasingly collect, process and store omics with more conventional epidemiologic information necessitating considerable effort in data cleaning. An efficient outlier detection method that reduces manual labour is highly desirable. Method We develop an unsupervised machine-learning method for outlier detection, namely kurPCA, that uses principal component analysis combined with kurtosis to ascertain the existence of outliers. In addition, we propose a novel regression adjustment approach to improve detection, namely the regression adjustment for data by systematic missing patterns (RAMP). Result Application to epidemiological record data in a large-scale biobank (Tohoku Medical Megabank Organization, Japan) shows that a combination of kurPCA and RAMP effectively detects known errors or inconsistent patterns. Conclusions We confirm through the results of the simulation and the application that our methods showed good performance. The proposed methods are useful for many practical analysis scenarios.
ISSN:0300-5771
1464-3685
DOI:10.1093/ije/dyz012