Bayesian decision support for coding occupational injury data
Studies on autocoding injury data have found that machine learning algorithms perform well for categories that occur frequently but often struggle with rare categories. Therefore, manual coding, although resource-intensive, cannot be eliminated. We propose a Bayesian decision support system to autoc...
Gespeichert in:
Veröffentlicht in: | Journal of safety research 2016-06, Vol.57, p.71-82 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Studies on autocoding injury data have found that machine learning algorithms perform well for categories that occur frequently but often struggle with rare categories. Therefore, manual coding, although resource-intensive, cannot be eliminated. We propose a Bayesian decision support system to autocode a large portion of the data, filter cases for manual review, and assist human coders by presenting them top k prediction choices and a confusion matrix of predictions from Bayesian models.
We studied the prediction performance of Single-Word (SW) and Two-Word-Sequence (TW) Naïve Bayes models on a sample of data from the 2011 Survey of Occupational Injury and Illness (SOII). We used the agreement in prediction results of SW and TW models, and various prediction strength thresholds for autocoding and filtering cases for manual review. We also studied the sensitivity of the top k predictions of the SW model, TW model, and SW–TW combination, and then compared the accuracy of the manually assigned codes to SOII data with that of the proposed system.
The accuracy of the proposed system, assuming well-trained coders reviewing a subset of only 26% of cases flagged for review, was estimated to be comparable (86.5%) to the accuracy of the original coding of the data set (range: 73%–86.8%). Overall, the TW model had higher sensitivity than the SW model, and the accuracy of the prediction results increased when the two models agreed, and for higher prediction strength thresholds. The sensitivity of the top five predictions was 93%.
The proposed system seems promising for coding injury data as it offers comparable accuracy and less manual coding.
Accurate and timely coded occupational injury data is useful for surveillance as well as prevention activities that aim to make workplaces safer.
•A semi-automated approach using Bayesian models is proposed for coding injury data.•Accuracy of proposed approach assuming expert coders was comparable to original manual coding.•Agreement between different models and prediction strength threshold improve accuracy.•Top 5 predictions from Naïve Bayes model yield very good accuracy.•Confusion matrix is useful for identifying misclassifications of rare categories. |
---|---|
ISSN: | 0022-4375 1879-1247 |
DOI: | 10.1016/j.jsr.2016.03.001 |