Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam

Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gem...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of nuclear cardiology 2024-11, p.102089, Article 102089
Hauptverfasser:	Builoff, Valerie, Shanbhag, Aakash, Miller, Robert JH, Dey, Damini, Liang, Joanna X., Flood, Kathleen, Bourque, Jamieson M., Chareonthaitawee, Panithaya, Phillips, Lawrence M., Slomka, Piotr J.
Format:	Artikel
Sprache:	eng
Schlagworte:	cardiovascular imaging questions GPT large language models Nuclear cardiology board exam
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.) - in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination. We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar’s test compared correct response proportions. GPT-4, Gemini, GPT4-Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.9% - 61.3%) and 63.1% (62.5 – 64.3%) of questions, respectively. GPT4o significantly outperformed other models (p=0.007 vs. GPT-4Turbo, p
ISSN:	1071-3581 1532-6551 1532-6551
DOI:	10.1016/j.nuclcard.2024.102089