Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts

Purpose Natural language processing techniques are essential for unlocking patients’ data from electronic health records. An important NLP task is the ability to recognize morphosyntactic information from the texts, a process called part-of-speech (POS) tagging. Currently, neural network architectur...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Research on Biomedical Engineering 2020-09, Vol.36 (3), p.267-276
Hauptverfasser: de Oliveira, Lucas Ferro Antunes, e Oliveira, Lucas Emanuel Silva, Gumiel, Yohan Bonescki, Carvalho, Deborah Ribeiro, Moro, Claudia Maria Cabral
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Purpose Natural language processing techniques are essential for unlocking patients’ data from electronic health records. An important NLP task is the ability to recognize morphosyntactic information from the texts, a process called part-of-speech (POS) tagging. Currently, neural network architectures are the state-of-the-art method, although there is a lack of studies exploiting this approach within Brazilian Portuguese clinical texts. The objective of this study is to define a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts. Methods We reviewed multiple neural network-based POS-tagging algorithms, and the Flair tool was selected due to its exceptional performance in the journalistic domain, as there is any specific algorithm to Portuguese clinical texts. We executed a normalization process on available corpora from multiple domains (two journalistic, one biomedical, one clinical, and a new corpus composed of all three of these). The Flair algorithm was trained with all corpora, generating five models, which were evaluated with all domains. Results The clinical model achieved 92.39% accuracy (previous POS-tagging clinical work reached 91.5%); the biomedical model achieved 97.9% accuracy. All the models were assessed on their own test set. Conclusion We developed a new state-of-the-art modeling environment for POS tagging of Brazilian Portuguese clinical texts and achieved comparable results to other state-of-the-art studies in journalistic contexts.
ISSN:2446-4732
2446-4740
DOI:10.1007/s42600-020-00067-7