Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

Recursive partitioning methods producing tree-like models are a long standing staple of predictive modeling. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence 2017-11, Vol.39 (11), p.2142-2153
Hauptverfasser:	Painsky, Amichai, Rosset, Saharon
Format:	Artikel
Sprache:	eng
Schlagworte:	Analytical models Buildings Classification and regression trees Computational modeling Computer simulation Data analysis Data management gradient boosting Input variables Mathematical models Partitioning Performance enhancement Performance prediction Predictive models Production methods random forests Recursive methods Regression tree analysis Splitting Variables Vegetation
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Recursive partitioning methods producing tree-like models are a long standing staple of predictive modeling. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods' inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. We propose a framework to splitting using leave-one-out (LOO) cross validation (CV) for selecting the splitting variable, then performing a regular split (in our case, following CART's approach) for the selected variable. The most important consequence of our approach is that categorical variables with many categories can be safely used in tree building and are only chosen if they contribute to predictive power. We demonstrate in extensive simulation and real data analysis that our splitting approach significantly improves the performance of both single tree models and ensemble methods that utilize trees. Importantly, we design an algorithm for LOO splitting variable selection which under reasonable assumptions does not substantially increase the overall computational complexity compared to CART for two-class classification.
ISSN:	0162-8828 1939-3539 2160-9292
DOI:	10.1109/TPAMI.2016.2636831