Natural language processing pipeline to extract prostate cancer-related information from clinical notes

Objectives To develop an automated pipeline for extracting prostate cancer-related information from clinical notes. Materials and methods This retrospective study included 23,225 patients who underwent prostate MRI between 2017 and 2022. Cancer risk factors (family history of cancer and digital rect...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:European radiology 2024-12, Vol.34 (12), p.7878-7891
Hauptverfasser: Nakai, Hirotsugu, Suman, Garima, Adamo, Daniel A., Navin, Patrick J., Bookwalter, Candice A., LeGout, Jordan D., Chen, Frank K., Wellnitz, Clinton V., Silva, Alvin C., Thomas, John V., Kawashima, Akira, Fan, Jungwei W., Froemming, Adam T., Lomas, Derek J., Humphreys, Mitchell R., Dora, Chandler, Korfiatis, Panagiotis, Takahashi, Naoki
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Objectives To develop an automated pipeline for extracting prostate cancer-related information from clinical notes. Materials and methods This retrospective study included 23,225 patients who underwent prostate MRI between 2017 and 2022. Cancer risk factors (family history of cancer and digital rectal exam findings), pre-MRI prostate pathology, and treatment history of prostate cancer were extracted from free-text clinical notes in English as binary or multi-class classification tasks. Any sentence containing pre-defined keywords was extracted from clinical notes within one year before the MRI. After manually creating sentence-level datasets with ground truth, Bidirectional Encoder Representations from Transformers (BERT)-based sentence-level models were fine-tuned using the extracted sentence as input and the category as output. The patient-level output was determined by compilation of multiple sentence-level outputs using tree-based models. Sentence-level classification performance was evaluated using the area under the receiver operating characteristic curve (AUC) on 15% of the sentence-level dataset (sentence-level test set). The patient-level classification performance was evaluated on the patient-level test set created by radiologists by reviewing the clinical notes of 603 patients. Accuracy and sensitivity were compared between the pipeline and radiologists. Results Sentence-level AUCs were ≥ 0.94. The pipeline showed higher patient-level sensitivity for extracting cancer risk factors (e.g., family history of prostate cancer, 96.5% vs. 77.9%, p  
ISSN:1432-1084
0938-7994
1432-1084
DOI:10.1007/s00330-024-10812-6