SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study
Self-supervised pre-training methods have brought remarkable breakthroughs in the understanding of text, image, and speech. Recent developments in genomics has also adopted these pre-training methods for genome understanding. However, they focus only on understanding haploid sequences, which hinders...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Self-supervised pre-training methods have brought remarkable breakthroughs in
the understanding of text, image, and speech. Recent developments in genomics
has also adopted these pre-training methods for genome understanding. However,
they focus only on understanding haploid sequences, which hinders their
applicability towards understanding genetic variations, also known as single
nucleotide polymorphisms (SNPs), which is crucial for genome-wide association
study. In this paper, we introduce SNP2Vec, a scalable self-supervised
pre-training approach for understanding SNP. We apply SNP2Vec to perform
long-sequence genomics modeling, and we evaluate the effectiveness of our
approach on predicting Alzheimer's disease risk in a Chinese cohort. Our
approach significantly outperforms existing polygenic risk score methods and
all other baselines, including the model that is trained entirely with haploid
sequences. We release our code and dataset on
https://github.com/HLTCHKUST/snp2vec. |
---|---|
DOI: | 10.48550/arxiv.2204.06699 |