Investigating the Role of Clustering in Construction-Accident Severity Prediction Using a Heterogeneous and Imbalanced Data Set

AbstractDespite remarkable advances in the construction industry, it is still among the most hazardous industries; accidents occur in the construction industry with different severity levels. Construction accident data sets are available for analysis, but they face heterogeneity and class imbalance...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of construction engineering and management 2023-02, Vol.149 (2)
Hauptverfasser: Salarian, Ali Akbar, Etemadfard, Hossein, Rahimzadegan, Ali, Ghalehnovi, Mansour
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:AbstractDespite remarkable advances in the construction industry, it is still among the most hazardous industries; accidents occur in the construction industry with different severity levels. Construction accident data sets are available for analysis, but they face heterogeneity and class imbalance issues. Multitudinous complexities and uncertainties in construction projects result in heterogeneity; this leads to poor predictive performance of machine learning algorithms. Class imbalance issues arise because accidents occur at different severities with unequal distribution, producing biased prediction results. This study aimed to assess the impact of clustering on construction accident analysis when a data set is heterogeneous and imbalanced and to take a step toward making incidents more predictable. Accidents were predicted following four data preparation approaches: unmodified, balanced, clustered and clustered + balanced. The k-means clustering algorithm was adopted to split the data into homogenous clusters. Synthetic minority oversampling technique (SMOTE) and k-means SMOTE (KMSMOTE) were used to overcome the class imbalance issue. Five different supervised machine learning algorithms—classification and regression tree (CART), support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB) and artificial neural network (ANN)—were employed for the prediction process. The results indicated that clustering significantly improved the predictive performance of the algorithms. The use of clustering along with oversampling was also the most appropriate approach to analyze accidents, providing more accurate and reliable predictions. The improvements resulting from applying the approach were about 33%, 23%, and 33% in terms of average precision, recall, and F1-score, respectively. Moreover, the ensemble learning classifiers used, RF and XGB, outperformed the other models. Ultimately, this research assisted safety professionals in predicting outcomes more accurately and in undertaking more appropriate safety measures.
ISSN:0733-9364
1943-7862
DOI:10.1061/(ASCE)CO.1943-7862.0002406