Domain specific word embeddings for natural language processing in radiology
[Display omitted] •Radiopaedia can be used as a domain-specific corpus in radiology NLP tasks.•Domain specific embeddings offer comparable performance on analogy completion.•Domain specific embeddings did significantly better on multi-label classification.•The source code, embeddings, and analogy da...
Gespeichert in:
Veröffentlicht in: | Journal of biomedical informatics 2021-01, Vol.113, p.103665-103665, Article 103665 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | [Display omitted]
•Radiopaedia can be used as a domain-specific corpus in radiology NLP tasks.•Domain specific embeddings offer comparable performance on analogy completion.•Domain specific embeddings did significantly better on multi-label classification.•The source code, embeddings, and analogy dataset are publicly released.
There has been increasing interest in machine learning based natural language processing (NLP) methods in radiology; however, models have often used word embeddings trained on general web corpora due to lack of a radiology-specific corpus.
We examined the potential of Radiopaedia to serve as a general radiology corpus to produce radiology specific word embeddings that could be used to enhance performance on a NLP task on radiological text.
Embeddings of dimension 50, 100, 200, and 300 were trained on articles collected from Radiopaedia using a GloVe algorithm and evaluated on analogy completion. A shallow neural network using input from either our trained embeddings or pre-trained Wikipedia 2014 + Gigaword 5 (WG) embeddings was used to label the Radiopaedia articles. Labeling performance was evaluated based on exact match accuracy and Hamming loss. The McNemar’s test with continuity and the Benjamini-Hochberg correction and a 5×2 cross validation paired two-tailed t-test were used to assess statistical significance.
For accuracy in the analogy task, 50-dimensional (50-D) Radiopaedia embeddings outperformed WG embeddings on tumor origin analogies (p |
---|---|
ISSN: | 1532-0464 1532-0480 |
DOI: | 10.1016/j.jbi.2020.103665 |