Scaling the discrete-time Wright–Fisher model to biobank-scale datasets

Abstract The discrete-time Wright–Fisher (DTWF) model and its diffusion limit are central to population genetics. These models can describe the forward-in-time evolution of allele frequencies in a population resulting from genetic drift, mutation, and selection. Computing likelihoods under the diffu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Genetics (Austin) 2023-11, Vol.225 (3)
Hauptverfasser:	Spence, Jeffrey P, Zeng, Tony, Mostafavi, Hakhamanesh, Pritchard, Jonathan K
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Alleles Approximation Biobanks Biological Specimen Banks Computation Diffusion Gene Frequency Genetic Drift Genetics Genetics, Population Investigation Models, Genetic Population genetics Probability Selection, Genetic Transition probabilities
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Abstract The discrete-time Wright–Fisher (DTWF) model and its diffusion limit are central to population genetics. These models can describe the forward-in-time evolution of allele frequencies in a population resulting from genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large samples or in the presence of strong selection. Existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here, we present a scalable algorithm that approximates the DTWF model with provably bounded error. Our approach relies on two key observations about the DTWF model. The first is that transition probabilities under the model are approximately sparse. The second is that transition distributions for similar starting allele frequencies are extremely close as distributions. Together, these observations enable approximate matrix–vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the tens of millions, paving the way for rigorous biobank-scale inference. Finally, we use our results to estimate the impact of larger samples on estimating selection coefficients for loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects. The Discrete-time Wright-Fisher model is central to population genetics, but computing likelihoods under this model is computationally difficult. Here, Spence et al. present a new, provably fast and accurate algorithm to compute these likelihoods. They show that recurrent mutation majorly impacts observed genetic variation in human cohorts and investigate how increasing sample sizes will improve estimation of selection coefficients. The authors find that increasing sample sizes beyond existing cohorts only provides additional information for the most extremely selected variants.
ISSN:	1943-2631 0016-6731 1943-2631
DOI:	10.1093/genetics/iyad168