Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam

Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gem...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of nuclear cardiology 2024-11, p.102089, Article 102089
Hauptverfasser: Builoff, Valerie, Shanbhag, Aakash, Miller, Robert JH, Dey, Damini, Liang, Joanna X., Flood, Kathleen, Bourque, Jamieson M., Chareonthaitawee, Panithaya, Phillips, Lawrence M., Slomka, Piotr J.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page 102089
container_title Journal of nuclear cardiology
container_volume
creator Builoff, Valerie
Shanbhag, Aakash
Miller, Robert JH
Dey, Damini
Liang, Joanna X.
Flood, Kathleen
Bourque, Jamieson M.
Chareonthaitawee, Panithaya
Phillips, Lawrence M.
Slomka, Piotr J.
description Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.) - in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination. We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar’s test compared correct response proportions. GPT-4, Gemini, GPT4-Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.9% - 61.3%) and 63.1% (62.5 – 64.3%) of questions, respectively. GPT4o significantly outperformed other models (p=0.007 vs. GPT-4Turbo, p
doi_str_mv 10.1016/j.nuclcard.2024.102089
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_proquest_miscellaneous_3140893069</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S1071358124007918</els_id><sourcerecordid>3140893069</sourcerecordid><originalsourceid>FETCH-LOGICAL-e1619-6f99aae12390ecb83f2588b4c9e5e3d0ee96d4e7857f336113733dea5a45e89b3</originalsourceid><addsrcrecordid>eNo1kc1OwzAQhC0EolB4hcpHLil2HDsxJ0pVoFL5OcDZcpxNcEnj4iQVfXtc2l52V6tPs9oZhEaUjCmh4nY5bnpTG-2LcUziJCxjkskTdEE5iyPBOT0NM0lpxHhGB-iybZeEEMmkPEcDJgVNaZxeIDvb6LrXnW0qPJnjd-9Kayw0Zottg1_DDdAeT8Md62pXbe_wQvsKQm2qXofhxRVQt7jT34Bdg7svwA8u4EEK1toH5bCd_erVFTordd3C9aEP0efj7GP6HC3enubTySICKqiMRCml1kBjJgmYPGNlzLMsT4wEDqwgAFIUCaQZT0vGBKUsZawAzXXCIZM5G6Kbve7au58e2k6tbGugrnUDrm8Vo0lwihEhAzo6oH2-gkKtvV1pv1VHewJwvwfCi7Cx4FX7bw4U1oPpVOGsokTtElFLdUxE7RJR-0TYH71if2Y</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3140893069</pqid></control><display><type>article</type><title>Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam</title><source>Alma/SFX Local Collection</source><creator>Builoff, Valerie ; Shanbhag, Aakash ; Miller, Robert JH ; Dey, Damini ; Liang, Joanna X. ; Flood, Kathleen ; Bourque, Jamieson M. ; Chareonthaitawee, Panithaya ; Phillips, Lawrence M. ; Slomka, Piotr J.</creator><creatorcontrib>Builoff, Valerie ; Shanbhag, Aakash ; Miller, Robert JH ; Dey, Damini ; Liang, Joanna X. ; Flood, Kathleen ; Bourque, Jamieson M. ; Chareonthaitawee, Panithaya ; Phillips, Lawrence M. ; Slomka, Piotr J.</creatorcontrib><description>Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.) - in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination. We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar’s test compared correct response proportions. GPT-4, Gemini, GPT4-Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.9% - 61.3%) and 63.1% (62.5 – 64.3%) of questions, respectively. GPT4o significantly outperformed other models (p=0.007 vs. GPT-4Turbo, p&lt;0.001 vs. GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (p&lt;0.001, p&lt;0.001, and p=0.001), while Gemini performed worse on image-based questions (p&lt;0.001 for all). GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions. Study Overview. The pie chart shows the breakdown of the percentage and (total count) of questions in each section of the exam. ASNC – American Society of Nuclear Cardiology. [Display omitted] This study assesses the accuracy of four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.) – on the 2023 American Society of Nuclear Cardiology Board Preparation Exam. The LLMs were tested 30 times on image-based and non-image-based questions using standardized prompts. Performance over six weeks was assessed for all models except GPT-4o. GPT4o significantly outperformed other models on the overall exam (p=0.007 vs. GPT-4Turbo, p&lt;0.001 vs. GPT-4 and Gemini), achieving a score (63.1% [95% confidence interval 62.5 – 64.3%] that is likely within or just outside the range required to pass the examination.</description><identifier>ISSN: 1071-3581</identifier><identifier>ISSN: 1532-6551</identifier><identifier>EISSN: 1532-6551</identifier><identifier>DOI: 10.1016/j.nuclcard.2024.102089</identifier><identifier>PMID: 39617127</identifier><language>eng</language><publisher>United States: Elsevier Inc</publisher><subject>cardiovascular imaging questions ; GPT ; large language models ; Nuclear cardiology board exam</subject><ispartof>Journal of nuclear cardiology, 2024-11, p.102089, Article 102089</ispartof><rights>2024</rights><rights>Copyright © 2024. Published by Elsevier Inc.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0001-5199-3071 ; 0000-0003-4676-2433 ; 0000-0002-6110-938X ; 0000-0003-2236-6970 ; 0000-0003-0727-199X ; 0000-0002-3811-8712</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39617127$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Builoff, Valerie</creatorcontrib><creatorcontrib>Shanbhag, Aakash</creatorcontrib><creatorcontrib>Miller, Robert JH</creatorcontrib><creatorcontrib>Dey, Damini</creatorcontrib><creatorcontrib>Liang, Joanna X.</creatorcontrib><creatorcontrib>Flood, Kathleen</creatorcontrib><creatorcontrib>Bourque, Jamieson M.</creatorcontrib><creatorcontrib>Chareonthaitawee, Panithaya</creatorcontrib><creatorcontrib>Phillips, Lawrence M.</creatorcontrib><creatorcontrib>Slomka, Piotr J.</creatorcontrib><title>Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam</title><title>Journal of nuclear cardiology</title><addtitle>J Nucl Cardiol</addtitle><description>Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.) - in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination. We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar’s test compared correct response proportions. GPT-4, Gemini, GPT4-Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.9% - 61.3%) and 63.1% (62.5 – 64.3%) of questions, respectively. GPT4o significantly outperformed other models (p=0.007 vs. GPT-4Turbo, p&lt;0.001 vs. GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (p&lt;0.001, p&lt;0.001, and p=0.001), while Gemini performed worse on image-based questions (p&lt;0.001 for all). GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions. Study Overview. The pie chart shows the breakdown of the percentage and (total count) of questions in each section of the exam. ASNC – American Society of Nuclear Cardiology. [Display omitted] This study assesses the accuracy of four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.) – on the 2023 American Society of Nuclear Cardiology Board Preparation Exam. The LLMs were tested 30 times on image-based and non-image-based questions using standardized prompts. Performance over six weeks was assessed for all models except GPT-4o. GPT4o significantly outperformed other models on the overall exam (p=0.007 vs. GPT-4Turbo, p&lt;0.001 vs. GPT-4 and Gemini), achieving a score (63.1% [95% confidence interval 62.5 – 64.3%] that is likely within or just outside the range required to pass the examination.</description><subject>cardiovascular imaging questions</subject><subject>GPT</subject><subject>large language models</subject><subject>Nuclear cardiology board exam</subject><issn>1071-3581</issn><issn>1532-6551</issn><issn>1532-6551</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo1kc1OwzAQhC0EolB4hcpHLil2HDsxJ0pVoFL5OcDZcpxNcEnj4iQVfXtc2l52V6tPs9oZhEaUjCmh4nY5bnpTG-2LcUziJCxjkskTdEE5iyPBOT0NM0lpxHhGB-iybZeEEMmkPEcDJgVNaZxeIDvb6LrXnW0qPJnjd-9Kayw0Zottg1_DDdAeT8Md62pXbe_wQvsKQm2qXofhxRVQt7jT34Bdg7svwA8u4EEK1toH5bCd_erVFTordd3C9aEP0efj7GP6HC3enubTySICKqiMRCml1kBjJgmYPGNlzLMsT4wEDqwgAFIUCaQZT0vGBKUsZawAzXXCIZM5G6Kbve7au58e2k6tbGugrnUDrm8Vo0lwihEhAzo6oH2-gkKtvV1pv1VHewJwvwfCi7Cx4FX7bw4U1oPpVOGsokTtElFLdUxE7RJR-0TYH71if2Y</recordid><startdate>20241129</startdate><enddate>20241129</enddate><creator>Builoff, Valerie</creator><creator>Shanbhag, Aakash</creator><creator>Miller, Robert JH</creator><creator>Dey, Damini</creator><creator>Liang, Joanna X.</creator><creator>Flood, Kathleen</creator><creator>Bourque, Jamieson M.</creator><creator>Chareonthaitawee, Panithaya</creator><creator>Phillips, Lawrence M.</creator><creator>Slomka, Piotr J.</creator><general>Elsevier Inc</general><scope>NPM</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-5199-3071</orcidid><orcidid>https://orcid.org/0000-0003-4676-2433</orcidid><orcidid>https://orcid.org/0000-0002-6110-938X</orcidid><orcidid>https://orcid.org/0000-0003-2236-6970</orcidid><orcidid>https://orcid.org/0000-0003-0727-199X</orcidid><orcidid>https://orcid.org/0000-0002-3811-8712</orcidid></search><sort><creationdate>20241129</creationdate><title>Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam</title><author>Builoff, Valerie ; Shanbhag, Aakash ; Miller, Robert JH ; Dey, Damini ; Liang, Joanna X. ; Flood, Kathleen ; Bourque, Jamieson M. ; Chareonthaitawee, Panithaya ; Phillips, Lawrence M. ; Slomka, Piotr J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-e1619-6f99aae12390ecb83f2588b4c9e5e3d0ee96d4e7857f336113733dea5a45e89b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>cardiovascular imaging questions</topic><topic>GPT</topic><topic>large language models</topic><topic>Nuclear cardiology board exam</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Builoff, Valerie</creatorcontrib><creatorcontrib>Shanbhag, Aakash</creatorcontrib><creatorcontrib>Miller, Robert JH</creatorcontrib><creatorcontrib>Dey, Damini</creatorcontrib><creatorcontrib>Liang, Joanna X.</creatorcontrib><creatorcontrib>Flood, Kathleen</creatorcontrib><creatorcontrib>Bourque, Jamieson M.</creatorcontrib><creatorcontrib>Chareonthaitawee, Panithaya</creatorcontrib><creatorcontrib>Phillips, Lawrence M.</creatorcontrib><creatorcontrib>Slomka, Piotr J.</creatorcontrib><collection>PubMed</collection><collection>MEDLINE - Academic</collection><jtitle>Journal of nuclear cardiology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Builoff, Valerie</au><au>Shanbhag, Aakash</au><au>Miller, Robert JH</au><au>Dey, Damini</au><au>Liang, Joanna X.</au><au>Flood, Kathleen</au><au>Bourque, Jamieson M.</au><au>Chareonthaitawee, Panithaya</au><au>Phillips, Lawrence M.</au><au>Slomka, Piotr J.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam</atitle><jtitle>Journal of nuclear cardiology</jtitle><addtitle>J Nucl Cardiol</addtitle><date>2024-11-29</date><risdate>2024</risdate><spage>102089</spage><pages>102089-</pages><artnum>102089</artnum><issn>1071-3581</issn><issn>1532-6551</issn><eissn>1532-6551</eissn><abstract>Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.) - in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination. We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar’s test compared correct response proportions. GPT-4, Gemini, GPT4-Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.9% - 61.3%) and 63.1% (62.5 – 64.3%) of questions, respectively. GPT4o significantly outperformed other models (p=0.007 vs. GPT-4Turbo, p&lt;0.001 vs. GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (p&lt;0.001, p&lt;0.001, and p=0.001), while Gemini performed worse on image-based questions (p&lt;0.001 for all). GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions. Study Overview. The pie chart shows the breakdown of the percentage and (total count) of questions in each section of the exam. ASNC – American Society of Nuclear Cardiology. [Display omitted] This study assesses the accuracy of four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.) – on the 2023 American Society of Nuclear Cardiology Board Preparation Exam. The LLMs were tested 30 times on image-based and non-image-based questions using standardized prompts. Performance over six weeks was assessed for all models except GPT-4o. GPT4o significantly outperformed other models on the overall exam (p=0.007 vs. GPT-4Turbo, p&lt;0.001 vs. GPT-4 and Gemini), achieving a score (63.1% [95% confidence interval 62.5 – 64.3%] that is likely within or just outside the range required to pass the examination.</abstract><cop>United States</cop><pub>Elsevier Inc</pub><pmid>39617127</pmid><doi>10.1016/j.nuclcard.2024.102089</doi><orcidid>https://orcid.org/0000-0001-5199-3071</orcidid><orcidid>https://orcid.org/0000-0003-4676-2433</orcidid><orcidid>https://orcid.org/0000-0002-6110-938X</orcidid><orcidid>https://orcid.org/0000-0003-2236-6970</orcidid><orcidid>https://orcid.org/0000-0003-0727-199X</orcidid><orcidid>https://orcid.org/0000-0002-3811-8712</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1071-3581
ispartof Journal of nuclear cardiology, 2024-11, p.102089, Article 102089
issn 1071-3581
1532-6551
1532-6551
language eng
recordid cdi_proquest_miscellaneous_3140893069
source Alma/SFX Local Collection
subjects cardiovascular imaging questions
GPT
large language models
Nuclear cardiology board exam
title Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T19%3A20%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Evaluating%20AI%20Proficiency%20in%20Nuclear%20Cardiology:%20Large%20Language%20Models%20take%20on%20the%20Board%20Preparation%20Exam&rft.jtitle=Journal%20of%20nuclear%20cardiology&rft.au=Builoff,%20Valerie&rft.date=2024-11-29&rft.spage=102089&rft.pages=102089-&rft.artnum=102089&rft.issn=1071-3581&rft.eissn=1532-6551&rft_id=info:doi/10.1016/j.nuclcard.2024.102089&rft_dat=%3Cproquest_pubme%3E3140893069%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3140893069&rft_id=info:pmid/39617127&rft_els_id=S1071358124007918&rfr_iscdi=true