Data classification with binary response through the Boosting algorithm and logistic regression

•Review of AIC and BIC information criteria focused on binary data classification.•Usual data classification is presented with its drawbacks (i.e., low performance).•Boosting algorithm showed enhanced results supported by MC simulation.•Hosmer–Lemeshow test sets the partition of the training(test) f...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2017-03, Vol.69, p.62-73
Hauptverfasser:	de Menezes, Fortunato S., Liska, Gilberto R., Cirillo, Marcelo A., Vivanco, Mário J.F.
Format:	Artikel
Sprache:	eng
Schlagworte:	AIC Algorithms BIC Boosting algorithm Classification Computer simulation Data classification Discrimination Information criteria Logistic regression Maximum likelihood estimates Monte Carlo Simulation Regression Selection of models
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Review of AIC and BIC information criteria focused on binary data classification.•Usual data classification is presented with its drawbacks (i.e., low performance).•Boosting algorithm showed enhanced results supported by MC simulation.•Hosmer–Lemeshow test sets the partition of the training(test) for classification.•CHD disease classification is performed with Boosting showing its high performance. The task of classifying is natural to humans, but there are situations in which a person is not best suited to perform this function, which creates the need for automatic methods of classification. Traditional methods, such as logistic regression, are commonly used in this type of situation, but they lack robustness and accuracy. These methods do not not work very well when the data or when there is noise in the data, situations that are common in expert and intelligent systems. Due to the importance and the increasing complexity of problems of this type, there is a need for methods that provide greater accuracy and interpretability of the results. Among these methods, is Boosting, which operates sequentially by applying a classification algorithm to reweighted versions of the training data set. It was recently shown that Boosting may also be viewed as a method for functional estimation. The purpose of the present study was to compare the logistic regressions estimated by the maximum likelihood model (LRMML) and the logistic regression model estimated using the Boosting algorithm, specifically the Binomial Boosting algorithm (LRMBB), and to select the model with the better fit and discrimination capacity in the situation of presence(absence) of a given property (in this case, binary classification). To illustrate this situation, the example used was to classify the presence (absence) of coronary heart disease (CHD) as a function of various biological variables collected from patients. It is shown in the simulations results based on the strength of the indications that the LRMBB model is more appropriate than the LRMML model for the adjustment of data sets with several covariables and noisy data. The following sections report lower values of the information criteria AIC and BIC for the LRMBB model and that the Hosmer–Lemeshow test exhibits no evidence of a bad fit for the LRMBB model. The LRMBB model also presented a higher AUC, sensitivity, specificity and accuracy and lower values of false positives rates and false negatives rates, making it a model with bette
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2016.08.014