Comparison of natural language processing algorithms in assessing the importance of head computed tomography reports written in Japanese

Purpose To propose a five-point scale for radiology report importance called Report Importance Category (RIC) and to compare the performance of natural language processing (NLP) algorithms in assessing RIC using head computed tomography (CT) reports written in Japanese. Materials and methods 3728 Ja...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Japanese journal of radiology 2024-07, Vol.42 (7), p.697-708
Hauptverfasser: Wataya, Tomohiro, Miura, Azusa, Sakisuka, Takahisa, Fujiwara, Masahiro, Tanaka, Hisashi, Hiraoka, Yu, Sato, Junya, Tomiyama, Miyuki, Nishigaki, Daiki, Kita, Kosuke, Suzuki, Yuki, Kido, Shoji, Tomiyama, Noriyuki
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Purpose To propose a five-point scale for radiology report importance called Report Importance Category (RIC) and to compare the performance of natural language processing (NLP) algorithms in assessing RIC using head computed tomography (CT) reports written in Japanese. Materials and methods 3728 Japanese head CT reports performed at Osaka University Hospital in 2020 were included. RIC (category 0: no findings, category 1: minor findings, category 2: routine follow-up, category 3: careful follow-up, and category 4: examination or therapy) was established based not only on patient severity but also on the novelty of the information. The manual assessment of RIC for the reports was performed under the consensus of two out of four neuroradiologists. The performance of four NLP models for classifying RIC was compared using fivefold cross-validation: logistic regression, bidirectional long–short-term memory (BiLSTM), general bidirectional encoder representations of transformers (general BERT), and domain-specific BERT (BERT for medical domain). Results The proportion of each RIC in the whole data set was 15.0%, 26.7%, 44.2%, 7.7%, and 6.4%, respectively. Domain-specific BERT showed the highest accuracy (0.8434 ± 0.0063) in assessing RIC and significantly higher AUC in categories 1 (0.9813 ± 0.0011), 2 (0.9492 ± 0.0045), 3 (0.9637 ± 0.0050), and 4 (0.9548 ± 0.0074) than the other models ( p  
ISSN:1867-1071
1867-108X
1867-108X
DOI:10.1007/s11604-024-01549-9