HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves

Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such es...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PeerJ. Computer science 2020-01, Vol.6, p.e243-e243, Article 243
Hauptverfasser:	Phillips, Jarrett D., French, Steven H., Hanner, Robert H., Gillis, Daniel J.
Format:	Artikel
Sprache:	eng
Schlagworte:	Accumulation Algorithm Algorithms Bar codes Biodiversity Bioinformatics Computational Biology Computer Science Computer Science, Artificial Intelligence Computer Science, Information Systems Computer Science, Theory & Methods Computer simulation Data Science Data systems Deoxyribonucleic acid DNA DNA barcoding Extrapolation Frequency distribution Gene sequencing Genetic diversity Haplotypes Iterative method Novels Optimization Theory and Computation Parameter estimation Robustness Sample size Sampling Sampling sufficiency Science & Technology Scientific Computing and Simulation Species Technology
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such estimates are needed because most species are currently represented by only one or a few sequences in existing databases, which can safely be assumed to be undersampled. Unfortunately, sample sizes of 5 10 specimens per species typically seen in DNA barcoding studies are often insufficient to adequately capture within-species genetic diversity. Here, we introduce a novel iterative extrapolation simulation algorithm of haplotype accumulation curves, called HACSim (Haplotype Accumulation Curve Simulator) that can be employed to calculate likely sample sizes needed to observe the full range of DNA barcode haplotype variation that exists for a species. Using uniform haplotype and non-uniform haplotype frequency distributions, the notion of sampling sufficiency (the sample size at which sampling accuracy is maximized and above which no new sampling information is likely to be gained) can be gleaned. HACSim can be employed in two primary ways to estimate specimen sample sizes: (1) to simulate haplotype sampling in hypothetical species, and (2) to simulate haplotype sampling in real species mined from public reference sequence databases like the Barcode of Life Data Systems (BOLD) or GenBank for any genomic marker of interest. While our algorithm is globally convergent, runtime is heavily dependent on initial sample sizes and skewness of the corresponding haplotype frequency distribution.
ISSN:	2376-5992 2376-5992
DOI:	10.7717/peerj-cs.243