Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how thes...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of medical Internet research 2024-01, Vol.26 (4), p.e52113
Hauptverfasser:	Herrmann-Werner, Anne, Festl-Wietek, Teresa, Holderried, Friederike, Herschbach, Lea, Griewatz, Jan, Masters, Ken, Zipfel, Stephan, Mahling, Moritz
Format:	Artikel
Sprache:	eng
Schlagworte:	Answers Anxiety disorders Application programming interface Blooms taxonomy Chatbots Classification Cognition & reasoning Cognitive ability Data analysis Education, Medical Educational objectives Hallucinations Health care reform Heart attacks Humans Language Learning Medical education Medical schools Medical students Medicine Methods Multiple choice Original Paper Post traumatic stress disorder Psychosomatic Medicine Psychotherapy Qualitative research Research Design Tests
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P
ISSN:	1438-8871 1439-4456 1438-8871
DOI:	10.2196/52113