Supporting data for "VariantSpark: A Distributed Implementation of Random Forest Tailored for Ultra High Dimensional Genomic Data"
Many traits and diseases are thought to be driven by more than one gene (polygenic). Polygenic Risk Scores (PRS) hence expand on Genome-Wide Association Studies (GWAS) by taking multiple genes into account when building risk models. However, PRS only considers the additive effect of individual genes...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Dataset |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Many traits and diseases are thought to be driven by more than one gene (polygenic). Polygenic Risk Scores (PRS) hence expand on Genome-Wide Association Studies (GWAS) by taking multiple genes into account when building risk models. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions are found in small datasets, large datasets have not been processed yet due to the high computational complexity of the search for epistatic interactions. We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to whole-genome of population-scale datasets with a hundred million genomic variants and hundred thousand samples. Compared to traditional monogenic GWAS, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high dimensional genomic data in a manageable time. |
---|---|
DOI: | 10.5524/100759 |