High throughput nonparametric probability density estimation

In high throughput applications, such as those found in bioinformatics and finance, it is important to determine accurate probability distribution functions despite only minimal information about data characteristics, and without using human subjectivity. Such an automated process for univariate dat...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PloS one 2018-05, Vol.13 (5), p.e0196937-e0196937
Hauptverfasser:	Farmer, Jenny, Jacobs, Donald
Format:	Artikel
Sprache:	eng
Schlagworte:	Analysis Bayesian analysis Bioinformatics Computational biology Criteria Density Diagnostic systems Distribution functions Entropy Estimates Fluctuations Lagrange multiplier Mathematical functions Mathematical research Mathematics Maximum entropy Maximum entropy method Methods Models, Theoretical Nonparametric statistics Physical Sciences Probability Probability density function Probability density functions Probability distribution Probability distribution functions Probability distributions Random variables Research and Analysis Methods Scoring Singularities Social Sciences Statistical analysis Statistical inference Statistical methods Statistical models Statistics Technology application
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In high throughput applications, such as those found in bioinformatics and finance, it is important to determine accurate probability distribution functions despite only minimal information about data characteristics, and without using human subjectivity. Such an automated process for univariate data is implemented to achieve this goal by merging the maximum entropy method with single order statistics and maximum likelihood. The only required properties of the random variables are that they are continuous and that they are, or can be approximated as, independent and identically distributed. A quasi-log-likelihood function based on single order statistics for sampled uniform random data is used to empirically construct a sample size invariant universal scoring function. Then a probability density estimate is determined by iteratively improving trial cumulative distribution functions, where better estimates are quantified by the scoring function that identifies atypical fluctuations. This criterion resists under and over fitting data as an alternative to employing the Bayesian or Akaike information criterion. Multiple estimates for the probability density reflect uncertainties due to statistical fluctuations in random samples. Scaled quantile residual plots are also introduced as an effective diagnostic to visualize the quality of the estimated probability densities. Benchmark tests show that estimates for the probability density function (PDF) converge to the true PDF as sample size increases on particularly difficult test probability densities that include cases with discontinuities, multi-resolution scales, heavy tails, and singularities. These results indicate the method has general applicability for high throughput statistical inference.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0196937