What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics

Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) technique...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Human genetics 2022-09, Vol.141 (9), p.1515-1528
Hauptverfasser: Musolf, Anthony M., Holzinger, Emily R., Malley, James D., Bailey-Wilson, Joan E.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.
ISSN:0340-6717
1432-1203
DOI:10.1007/s00439-021-02402-z