Clinical Text Reports to Stratify Patients Affected with Myeloid Neoplasms Using Natural Language Processing
Background: The availability of multimodal patient data, such as demographics, clinical, imaging, treatment, quality of life, outcomes and wearables data, as well as genome sequencing, have paved the way for the development of multimodal clinical solutions that introduce personalized or precision me...
Gespeichert in:
Veröffentlicht in: | Blood 2023-11, Vol.142 (Supplement 1), p.122-122 |
---|---|
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Background: The availability of multimodal patient data, such as demographics, clinical, imaging, treatment, quality of life, outcomes and wearables data, as well as genome sequencing, have paved the way for the development of multimodal clinical solutions that introduce personalized or precision medicine. The clinical report is an information layer that contains relevant information about the disease in addition to the patient's point of view. Natural language processing (NLP) is a branch of artificial intelligence (AI) and its pre-trained language models are the key technology for extracting value from this data layer.
Aims: This project was conducted by GenoMed4all and Synthema EU consortia, with the aim to: 1) Build an AI language model specific for the hematology domain. 2) Use NLP technology to extract relevant information from clinical reports and perform unsupervised stratification of patients, in order to 3) demonstrate that the clinical report is earlier access to data relative to disease clinical phenotype and biology and provide important information for patient stratification and prediction of clinical outcomes.
Methods: To translate text sentences into numerical embeddings, we implemented bidirectional encoder representations from transformers (BERT) framework. To learn text representations and correlations within data, we performed domain-adaptation by fine-tuned pre-trained model on hematological clinical reports of patients with myeloproliferative neoplasms (MPN), myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML). Patient stratification was performed by HDBSCAN clustering on text embedding encoded by BERT (HematoBERT). Clusters validation was performed by assessing patients' diagnosis and survival probability. Finally, we compared domain-tuned HematoBERT vs pre-trained non-contextualized models.
Results: We implemented HematoBERT based on the bert-base-multilingual-uncased version of BERT. Training data were hematological text reports of 1,328 patients. During fine-tuning, texts were tokenized, then we randomly replaced 15% of the tokens with masked tokens, training the model to predict them. We performed stratification using clinical reports from a validation cohort of 360 patients. We identified 7 clusters, defined according to similar words in meaning that were placed in a specific topic. We extracted the most important words and concepts for each cluster (topic) and we summarized them into effective descriptions for each |
---|---|
ISSN: | 0006-4971 1528-0020 |
DOI: | 10.1182/blood-2023-188292 |