Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach
•A new multi-test decision tree classifier tailored to gene expression data.•Evolutionary approach for reducing underfitting in decision tree inducers.•Gene cluster based tests outperform univariate tests in splitting nodes. The problem of underfitting and overfitting in machine learning is often as...
Gespeichert in:
Veröffentlicht in: | Expert systems with applications 2019-12, Vol.137, p.392-404 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •A new multi-test decision tree classifier tailored to gene expression data.•Evolutionary approach for reducing underfitting in decision tree inducers.•Gene cluster based tests outperform univariate tests in splitting nodes.
The problem of underfitting and overfitting in machine learning is often associated with a bias-variance trade-off. The underfitting most clearly manifests in the tree-based inducers when used to classify the gene expression data. To improve the generalization ability of decision trees, we are introducing an evolutionary, multi-test tree approach tailored to this specific application domain. The general idea is to apply gene clusters of varying size, which consist of functionally related genes in each splitting rule. It is achieved by using a few simple tests that mimic each other’s predictions and built-in information about the discriminatory power of genes. The tendencies to underfit and overfit are limited by the multi-objective fitness function that minimizes tree error, split divergence and attribute costs. Evolutionary search for multi-tests in internal nodes, as well as the overall tree structure, is performed simultaneously.
This novel approach called Evolutionary Multi-Test Tree (EMTTree) may bring far-reaching benefits to the domain of molecular biology including biomarker discovery, finding new gene-gene interactions and high-quality prediction. Extensive experiments carried out on 35 publicly available gene expression datasets show that we managed to significantly improve the accuracy and stability of decision tree. Importantly, EMTTree does not substantially increase the overall complexity of the tree, so that the patterns in the predictive structures are kept comprehensible. |
---|---|
ISSN: | 0957-4174 1873-6793 |
DOI: | 10.1016/j.eswa.2019.07.019 |