Assessment of Artificial Intelligence Language Models and Information Retrieval Strategies for QA in Hematology
Introduction Large language models (LLMs) have gained popularity due to their natural language generation and interpretation capabilities. Integrating these models in medicine enables multiple tasks like summarizing medical histories, synthesizing literature, and suggesting diagnoses. Models like Ch...
Gespeichert in:
Veröffentlicht in: | Blood 2023-11, Vol.142 (Supplement 1), p.7175-7175 |
---|---|
Hauptverfasser: | , , , , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Introduction
Large language models (LLMs) have gained popularity due to their natural language generation and interpretation capabilities. Integrating these models in medicine enables multiple tasks like summarizing medical histories, synthesizing literature, and suggesting diagnoses. Models like ChatGPT, GPT-4, and Med-PaLM2 (Singhal et al., 2023) have demonstrated their proficiency by achieving high scores in medical tests like the United States Medical Licensing Examination (USMLE) (Kung et al., 2023). However, LLMs may sometimes be inaccurate, providing unverified and erroneous information.
In this study, we investigate the potential uses of LLMs in hematology, assessing their knowledge through hematology questions from the USMLE. Additionally, we propose augmenting LLMs with retrieval capabilities for medical guidelines in order to eliminate incorrect information. By extracting relevant information from specified medical documents, this approach holds the potential to streamline decision-making processes.
Methods
For comparative purposes, all experiments were conducted using both GPT 3.5-turbo and GPT-4 models.
In a first step, we evaluated the general knowledge and performance of LLM in the field of hematology by testing it in a collected dataset of 127 question-answer pairs from the hematology section (covering various aspects of the field) of the USMLE.
In a second step, we evaluated the proposed information retrieval framework using a set of 120 multiple-choice questions. These questions were specifically focused on the 4th revision of the World Health Organization classification of myeloid neoplasms and acute leukemia guidelines (subsequently called WHO 2017). By testing the framework on this domain-specific dataset, we aimed to assess its ability to extract specific clinical context and relevant information from complex clinical guidelines. Each question from the WHO 2017 guideline dataset was subjected to a comprehensive evaluation using two techniques. First, the questions were assessed using a zero-shot approach (the question together with the different options are directly posed to the model) to assess the LLM's capability to respond based on its own knowledge. Second, we employed our proposed retrieval information approach, enabling the system to conduct in-depth searches throughout the external documents (WHO 2017 guideline) to identify relevant (and similar) extracts about each question. Subsequently, the system provided answers based on |
---|---|
ISSN: | 0006-4971 1528-0020 |
DOI: | 10.1182/blood-2023-178528 |