Detecting automatically the layout of clinical documents to enhance the performances of downstream natural language processing
Objective:Develop and validate an algorithm for analyzing the layout of PDF clinical documents to improve the performance of downstream natural language processing tasks. Materials and Methods: We designed an algorithm to process clinical PDF documents and extract only clinically relevant text. The...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Objective:Develop and validate an algorithm for analyzing the layout of PDF
clinical documents to improve the performance of downstream natural language
processing tasks. Materials and Methods: We designed an algorithm to process
clinical PDF documents and extract only clinically relevant text. The algorithm
consists of several steps: initial text extraction using a PDF parser, followed
by classification into categories such as body text, left notes, and footers
using a Transformer deep neural network architecture, and finally an
aggregation step to compile the lines of a given label in the text. We
evaluated the technical performance of the body text extraction algorithm by
applying it to a random sample of documents that were annotated. Medical
performance was evaluated by examining the extraction of medical concepts of
interest from the text in their respective sections. Finally, we tested an
end-to-end system on a medical use case of automatic detection of acute
infection described in the hospital report. Results:Our algorithm achieved
per-line precision, recall, and F1 score of 98.4, 97.0, and 97.7, respectively,
for body line extraction. The precision, recall, and F1 score per document for
the acute infection detection algorithm were 82.54 (95CI 72.86-91.60), 85.24
(95CI 76.61-93.70), 83.87 (95CI 76, 92-90.08) with exploitation of the results
of the advanced body extraction algorithm, respectively. Conclusion:We have
developed and validated a system for extracting body text from clinical
documents in PDF format by identifying their layout. We were able to
demonstrate that this preprocessing allowed us to obtain better performances
for a common downstream task, i.e., the extraction of medical concepts in their
respective sections, thus proving the interest of this method on a clinical use
case. |
---|---|
DOI: | 10.48550/arxiv.2305.13817 |