Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes

Background The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by ( ) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing-based machine learning models to aut...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of the American Heart Association 2023-07, Vol.12 (13), p.e030046-e030046
Hauptverfasser:	Guo, Yuting, Al-Garadi, Mohammed A, Book, Wendy M, Ivey, Lindsey C, Rodriguez, 3rd, Fred H, Raskind-Hood, Cheryl L, Robichaux, Chad, Sarker, Abeed
Format:	Artikel
Sprache:	eng
Schlagworte:	congenital heart disease Electronic Health Records Electronics Fontan Humans International Classification of Diseases Machine Learning Natural Language Processing Original Research single ventricle
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Background The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by ( ) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing-based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with code-based classification. Methods and Results We included free-text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non-Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer-based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and code-based classification on 20% of the held-out patient data using the score metric. The classification model, support vector machine, and RoBERTa achieved scores of 0.81 (95% CI, 0.79-0.83), 0.95 (95% CI, 0.92-0.97), and 0.89 (95% CI, 0.88-0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance (
ISSN:	2047-9980 2047-9980
DOI:	10.1161/JAHA.123.030046