Systematic feature selection improves accuracy of methylation-based forensic age estimation in Han Chinese males

•A pioneer work on forensic age estimation focusing on 528 Chinese Han males.•Systematic feature selection identified 9 CpG sites as the optimal set for methylation-based forensic age estimation.•An age prediction model with improved prediction accuracy (MAD ∼ 3 years and R2 ∼ 0.90).•Z-score transfo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Forensic science international : genetics 2018-07, Vol.35, p.38-45
Hauptverfasser: Feng, Lei, Peng, Fuduan, Li, Shanfei, Jiang, Li, Sun, Hui, Ji, Anquan, Zeng, Changqing, Li, Caixia, Liu, Fan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•A pioneer work on forensic age estimation focusing on 528 Chinese Han males.•Systematic feature selection identified 9 CpG sites as the optimal set for methylation-based forensic age estimation.•An age prediction model with improved prediction accuracy (MAD ∼ 3 years and R2 ∼ 0.90).•Z-score transformation reduces the batch effects for methylation data generated from EpiTYPER and pyrosequencing platforms.•A freely available online age estimator with the flexibility to deal with missing data. Estimating individual age from biomarkers may provide key information facilitating forensic investigations. Recent progress has shown DNA methylation at age-associated CpG sites as the most informative biomarkers for estimating the individual age of an unknown donor. Optimal feature selection plays a critical role in determining the performance of the final prediction model. In this study we investigate methylation levels at 153 age-associated CpG sites from 21 previously reported genomic regions using the EpiTYPER system for their predictive power on individual age in 390 Han Chinese males ranging from 15 to 75 years of age. We conducted a systematic feature selection using a stepwise backward multiple linear regression analysis as well as an exhaustive searching algorithm. Both approaches identified the same subset of 9 CpG sites, which in linear combination provided the optimal model fitting with mean absolute deviation (MAD) of 2.89 years of age and explainable variance (R2) of 0.92. The final model was validated in two independent Han Chinese male samples (validation set 1, N = 65, MAD = 2.49, R2 = 0.95, and validation set 2, N = 62, MAD = 3.36, R2 = 0.89). Other competing models such as support vector machine and artificial neural network did not outperform the linear model to any noticeable degree. The validation set 1 was additionally analyzed using Pyrosequencing technology for cross-platform validation and was termed as validation set 3. Directly applying our model, in which the methylation levels were detected by the EpiTYPER system, to the data from pyrosequencing technology showed, however, less accurate results in terms of MAD (validation set 3, N = 65 Han Chinese males, MAD = 4.20, R2 = 0.93), suggesting the presence of a batch effect between different data generation platforms. This batch effect could be partially overcome by a z-score transformation (MAD = 2.76, R2 = 0.93). Overall, our systematic feature selection identified 9 CpG sites as the optimal
ISSN:1872-4973
1878-0326
DOI:10.1016/j.fsigen.2018.03.009