Domain specific word embeddings for natural language processing in radiology

[Display omitted] •Radiopaedia can be used as a domain-specific corpus in radiology NLP tasks.•Domain specific embeddings offer comparable performance on analogy completion.•Domain specific embeddings did significantly better on multi-label classification.•The source code, embeddings, and analogy da...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of biomedical informatics 2021-01, Vol.113, p.103665-103665, Article 103665
Hauptverfasser: Chen, Timothy L., Emerling, Max, Chaudhari, Gunvant R., Chillakuru, Yeshwant R., Seo, Youngho, Vu, Thienkhai H., Sohn, Jae Ho
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:[Display omitted] •Radiopaedia can be used as a domain-specific corpus in radiology NLP tasks.•Domain specific embeddings offer comparable performance on analogy completion.•Domain specific embeddings did significantly better on multi-label classification.•The source code, embeddings, and analogy dataset are publicly released. There has been increasing interest in machine learning based natural language processing (NLP) methods in radiology; however, models have often used word embeddings trained on general web corpora due to lack of a radiology-specific corpus. We examined the potential of Radiopaedia to serve as a general radiology corpus to produce radiology specific word embeddings that could be used to enhance performance on a NLP task on radiological text. Embeddings of dimension 50, 100, 200, and 300 were trained on articles collected from Radiopaedia using a GloVe algorithm and evaluated on analogy completion. A shallow neural network using input from either our trained embeddings or pre-trained Wikipedia 2014 + Gigaword 5 (WG) embeddings was used to label the Radiopaedia articles. Labeling performance was evaluated based on exact match accuracy and Hamming loss. The McNemar’s test with continuity and the Benjamini-Hochberg correction and a 5×2 cross validation paired two-tailed t-test were used to assess statistical significance. For accuracy in the analogy task, 50-dimensional (50-D) Radiopaedia embeddings outperformed WG embeddings on tumor origin analogies (p 
ISSN:1532-0464
1532-0480
DOI:10.1016/j.jbi.2020.103665