Methods for Single Cell and Longevity Genomics

Chapter 1: Single cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero-inflation. Current normalization procedures s...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Townes, F. William
Format: Dissertation
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Chapter 1: Single cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero-inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform current practice in a downstream clustering assessment using ground-truth datasets. Chapter 2: For scRNA-Seq data lacking UMIs, we propose quasi UMIs: quantile normalization of read counts to a compound Poisson distribution empirically derived from UMI datasets. In an assessment using datasets for which both UMIs and read counts were available, quasi UMIs counts were closer to UMI counts than competing normalization methods such as census counts. Chapter 3: Aging is a complex process with poorly understood genetic mechanisms. Recent studies have sought to classify genes as pro-longevity or anti-longevity using a variety of machine learning algorithms. However, assessments based on held-out test data are lacking. Further, it is not clear which types of features are best for improving classification accuracy and precision. Leveraging gene annotations for two model organisms from the GenAge database, we use gene ontology and publicly available gene expression datasets as features to systematically compare five popular classification algorithms. Elastic net regularized logistic regression (GLM-Net) performs well. Using GLM-Net, we make predictions for pro- and anti-longevity genes among those not found in GenAge.