Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants

Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Scientific reports 2017-06, Vol.7 (1), p.2959-12, Article 2959
Hauptverfasser:	Schubach, Max, Re, Matteo, Robinson, Peter N., Valentini, Giorgio
Format:	Artikel
Sprache:	eng
Schlagworte:	45 631/114/1305 631/114/2413 631/114/2785 Algorithms Artificial intelligence Disease Genetic diversity Genetic Predisposition to Disease Genetic Variation Genome-Wide Association Study Genomes Humanities and Social Sciences Humans Learning algorithms Machine Learning Models, Genetic multidisciplinary Mutation Regulatory sequences Reproducibility of Results RNA, Untranslated Science Science (multidisciplinary) Software
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.
ISSN:	2045-2322 2045-2322
DOI:	10.1038/s41598-017-03011-5