Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology’s “Diagnosis Please” cases

Purpose The diagnostic performance of large language artificial intelligence (AI) models when utilizing radiological images has yet to be investigated. We employed Claude 3 Opus (released on March 4, 2024) and Claude 3.5 Sonnet (released on June 21, 2024) to investigate their diagnostic performances...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Japanese journal of radiology 2024-12, Vol.42 (12), p.1399-1402
Hauptverfasser: Kurokawa, Ryo, Ohizumi, Yuji, Kanzawa, Jun, Kurokawa, Mariko, Sonoda, Yuki, Nakamura, Yuta, Kiguchi, Takao, Gonoi, Wataru, Abe, Osamu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1402
container_issue 12
container_start_page 1399
container_title Japanese journal of radiology
container_volume 42
creator Kurokawa, Ryo
Ohizumi, Yuji
Kanzawa, Jun
Kurokawa, Mariko
Sonoda, Yuki
Nakamura, Yuta
Kiguchi, Takao
Gonoi, Wataru
Abe, Osamu
description Purpose The diagnostic performance of large language artificial intelligence (AI) models when utilizing radiological images has yet to be investigated. We employed Claude 3 Opus (released on March 4, 2024) and Claude 3.5 Sonnet (released on June 21, 2024) to investigate their diagnostic performances in response to the Radiology’s Diagnosis Please quiz questions. Materials and methods In this study, the AI models were tasked with listing the primary diagnosis and two differential diagnoses for 322 quiz questions from Radiology’s “Diagnosis Please” cases, which included cases 1 to 322, published from 1998 to 2023. The analyses were performed under the following conditions: (1) Condition 1: submitter-provided clinical history (text) alone. (2) Condition 2: submitter-provided clinical history and imaging findings (text). (3) Condition 3: clinical history (text) and key images (PNG file). We applied McNemar’s test to evaluate differences in the correct response rates for the overall accuracy under Conditions 1, 2, and 3 for each model and between the models. Results The correct diagnosis rates were 58/322 (18.0%) and 69/322 (21.4%), 201/322 (62.4%) and 209/322 (64.9%), and 80/322 (24.8%) and 97/322 (30.1%) for Conditions 1, 2, and 3 for Claude 3 Opus and Claude 3.5 Sonnet, respectively. The models provided the correct answer as a differential diagnosis in up to 26/322 (8.1%) for Opus and 23/322 (7.1%) for Sonnet. Statistically significant differences were observed in the correct response rates among all combinations of Conditions 1, 2, and 3 for each model ( p  
doi_str_mv 10.1007/s11604-024-01634-z
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11588754</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3132713868</sourcerecordid><originalsourceid>FETCH-LOGICAL-c422t-242d206d86bebbbf550171ebb569023b1a66258507f7db464f03769356fc6b883</originalsourceid><addsrcrecordid>eNp9Uctu1DAUtRCIPuAHWCBLbNik9Su2s0JoeEqVigpI7CwnsVOXxA52gjRdzSd0iwQ_N19StzMMhQUL6175nnvOuToAPMHoCCMkjhPGHLECkfwwp6y4vAf2seSiwEh-ub_rBd4DByldIMQZZewh2KMVqjiTdB9cvXK68yFNroGjiTbEQfvGJBgsXPR6bg2k8HScE9S-3f0clfBj8N5M0MYwwFFPzvgJnrs0hbi8hX41S-gG3WUq5-GZbl3oQ7dcr34kuF793Mq6BD_0RiezXv2CTa7pEXhgdZ_M4209BJ_fvP60eFecnL59v3h5UjSMkKkgjLQE8Vby2tR1bcsSYYFzW_IKEVpjzTkpZYmEFW3NOLOICl7RktuG11LSQ_BiwzvO9WDaJvuPuldjzKbjUgXt1N8T785VF74rjEspRckyw_MtQwzfZpMmNbjUmL7X3oQ5KYpkVpSyIhn67B_oRZijz_cpiikRmEp-Y4lsUE0MKUVjd24wUjeJq03iKieubhNXl3np6d07diu_I84AugGkPPKdiX-0_0N7DfwbuvE</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3132713868</pqid></control><display><type>article</type><title>Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology’s “Diagnosis Please” cases</title><source>MEDLINE</source><source>SpringerLink Journals - AutoHoldings</source><creator>Kurokawa, Ryo ; Ohizumi, Yuji ; Kanzawa, Jun ; Kurokawa, Mariko ; Sonoda, Yuki ; Nakamura, Yuta ; Kiguchi, Takao ; Gonoi, Wataru ; Abe, Osamu</creator><creatorcontrib>Kurokawa, Ryo ; Ohizumi, Yuji ; Kanzawa, Jun ; Kurokawa, Mariko ; Sonoda, Yuki ; Nakamura, Yuta ; Kiguchi, Takao ; Gonoi, Wataru ; Abe, Osamu</creatorcontrib><description>Purpose The diagnostic performance of large language artificial intelligence (AI) models when utilizing radiological images has yet to be investigated. We employed Claude 3 Opus (released on March 4, 2024) and Claude 3.5 Sonnet (released on June 21, 2024) to investigate their diagnostic performances in response to the Radiology’s Diagnosis Please quiz questions. Materials and methods In this study, the AI models were tasked with listing the primary diagnosis and two differential diagnoses for 322 quiz questions from Radiology’s “Diagnosis Please” cases, which included cases 1 to 322, published from 1998 to 2023. The analyses were performed under the following conditions: (1) Condition 1: submitter-provided clinical history (text) alone. (2) Condition 2: submitter-provided clinical history and imaging findings (text). (3) Condition 3: clinical history (text) and key images (PNG file). We applied McNemar’s test to evaluate differences in the correct response rates for the overall accuracy under Conditions 1, 2, and 3 for each model and between the models. Results The correct diagnosis rates were 58/322 (18.0%) and 69/322 (21.4%), 201/322 (62.4%) and 209/322 (64.9%), and 80/322 (24.8%) and 97/322 (30.1%) for Conditions 1, 2, and 3 for Claude 3 Opus and Claude 3.5 Sonnet, respectively. The models provided the correct answer as a differential diagnosis in up to 26/322 (8.1%) for Opus and 23/322 (7.1%) for Sonnet. Statistically significant differences were observed in the correct response rates among all combinations of Conditions 1, 2, and 3 for each model ( p  &lt; 0.01). Claude 3.5 Sonnet outperformed in all conditions, but a statistically significant difference was observed only in the comparison for Condition 3 (30.1% vs. 24.8%, p  = 0.028). Conclusion Two AI models demonstrated a significantly improved diagnostic performance when inputting both key images and clinical history. The models’ ability to identify important differential diagnoses under these conditions was also confirmed.</description><identifier>ISSN: 1867-1071</identifier><identifier>ISSN: 1867-108X</identifier><identifier>EISSN: 1867-108X</identifier><identifier>DOI: 10.1007/s11604-024-01634-z</identifier><identifier>PMID: 39096483</identifier><language>eng</language><publisher>Singapore: Springer Nature Singapore</publisher><subject>Artificial Intelligence ; Diagnosis ; Diagnosis, Differential ; Diagnostic systems ; Differential diagnosis ; Humans ; Imaging ; Medical History Taking - methods ; Medical imaging ; Medicine ; Medicine &amp; Public Health ; Nuclear Medicine ; Original ; Original Article ; Questions ; Radiology ; Radiotherapy ; Response rates ; Statistical analysis ; Ultrasonography - methods</subject><ispartof>Japanese journal of radiology, 2024-12, Vol.42 (12), p.1399-1402</ispartof><rights>The Author(s) 2024</rights><rights>2024. The Author(s).</rights><rights>The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>The Author(s) 2024 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c422t-242d206d86bebbbf550171ebb569023b1a66258507f7db464f03769356fc6b883</cites><orcidid>0000-0002-5018-4683</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11604-024-01634-z$$EPDF$$P50$$Gspringer$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11604-024-01634-z$$EHTML$$P50$$Gspringer$$Hfree_for_read</linktohtml><link.rule.ids>230,314,780,784,885,27924,27925,41488,42557,51319</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39096483$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Kurokawa, Ryo</creatorcontrib><creatorcontrib>Ohizumi, Yuji</creatorcontrib><creatorcontrib>Kanzawa, Jun</creatorcontrib><creatorcontrib>Kurokawa, Mariko</creatorcontrib><creatorcontrib>Sonoda, Yuki</creatorcontrib><creatorcontrib>Nakamura, Yuta</creatorcontrib><creatorcontrib>Kiguchi, Takao</creatorcontrib><creatorcontrib>Gonoi, Wataru</creatorcontrib><creatorcontrib>Abe, Osamu</creatorcontrib><title>Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology’s “Diagnosis Please” cases</title><title>Japanese journal of radiology</title><addtitle>Jpn J Radiol</addtitle><addtitle>Jpn J Radiol</addtitle><description>Purpose The diagnostic performance of large language artificial intelligence (AI) models when utilizing radiological images has yet to be investigated. We employed Claude 3 Opus (released on March 4, 2024) and Claude 3.5 Sonnet (released on June 21, 2024) to investigate their diagnostic performances in response to the Radiology’s Diagnosis Please quiz questions. Materials and methods In this study, the AI models were tasked with listing the primary diagnosis and two differential diagnoses for 322 quiz questions from Radiology’s “Diagnosis Please” cases, which included cases 1 to 322, published from 1998 to 2023. The analyses were performed under the following conditions: (1) Condition 1: submitter-provided clinical history (text) alone. (2) Condition 2: submitter-provided clinical history and imaging findings (text). (3) Condition 3: clinical history (text) and key images (PNG file). We applied McNemar’s test to evaluate differences in the correct response rates for the overall accuracy under Conditions 1, 2, and 3 for each model and between the models. Results The correct diagnosis rates were 58/322 (18.0%) and 69/322 (21.4%), 201/322 (62.4%) and 209/322 (64.9%), and 80/322 (24.8%) and 97/322 (30.1%) for Conditions 1, 2, and 3 for Claude 3 Opus and Claude 3.5 Sonnet, respectively. The models provided the correct answer as a differential diagnosis in up to 26/322 (8.1%) for Opus and 23/322 (7.1%) for Sonnet. Statistically significant differences were observed in the correct response rates among all combinations of Conditions 1, 2, and 3 for each model ( p  &lt; 0.01). Claude 3.5 Sonnet outperformed in all conditions, but a statistically significant difference was observed only in the comparison for Condition 3 (30.1% vs. 24.8%, p  = 0.028). Conclusion Two AI models demonstrated a significantly improved diagnostic performance when inputting both key images and clinical history. The models’ ability to identify important differential diagnoses under these conditions was also confirmed.</description><subject>Artificial Intelligence</subject><subject>Diagnosis</subject><subject>Diagnosis, Differential</subject><subject>Diagnostic systems</subject><subject>Differential diagnosis</subject><subject>Humans</subject><subject>Imaging</subject><subject>Medical History Taking - methods</subject><subject>Medical imaging</subject><subject>Medicine</subject><subject>Medicine &amp; Public Health</subject><subject>Nuclear Medicine</subject><subject>Original</subject><subject>Original Article</subject><subject>Questions</subject><subject>Radiology</subject><subject>Radiotherapy</subject><subject>Response rates</subject><subject>Statistical analysis</subject><subject>Ultrasonography - methods</subject><issn>1867-1071</issn><issn>1867-108X</issn><issn>1867-108X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><sourceid>EIF</sourceid><recordid>eNp9Uctu1DAUtRCIPuAHWCBLbNik9Su2s0JoeEqVigpI7CwnsVOXxA52gjRdzSd0iwQ_N19StzMMhQUL6175nnvOuToAPMHoCCMkjhPGHLECkfwwp6y4vAf2seSiwEh-ub_rBd4DByldIMQZZewh2KMVqjiTdB9cvXK68yFNroGjiTbEQfvGJBgsXPR6bg2k8HScE9S-3f0clfBj8N5M0MYwwFFPzvgJnrs0hbi8hX41S-gG3WUq5-GZbl3oQ7dcr34kuF793Mq6BD_0RiezXv2CTa7pEXhgdZ_M4209BJ_fvP60eFecnL59v3h5UjSMkKkgjLQE8Vby2tR1bcsSYYFzW_IKEVpjzTkpZYmEFW3NOLOICl7RktuG11LSQ_BiwzvO9WDaJvuPuldjzKbjUgXt1N8T785VF74rjEspRckyw_MtQwzfZpMmNbjUmL7X3oQ5KYpkVpSyIhn67B_oRZijz_cpiikRmEp-Y4lsUE0MKUVjd24wUjeJq03iKieubhNXl3np6d07diu_I84AugGkPPKdiX-0_0N7DfwbuvE</recordid><startdate>20241201</startdate><enddate>20241201</enddate><creator>Kurokawa, Ryo</creator><creator>Ohizumi, Yuji</creator><creator>Kanzawa, Jun</creator><creator>Kurokawa, Mariko</creator><creator>Sonoda, Yuki</creator><creator>Nakamura, Yuta</creator><creator>Kiguchi, Takao</creator><creator>Gonoi, Wataru</creator><creator>Abe, Osamu</creator><general>Springer Nature Singapore</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QO</scope><scope>7TK</scope><scope>7U7</scope><scope>8FD</scope><scope>C1K</scope><scope>FR3</scope><scope>K9.</scope><scope>NAPCQ</scope><scope>P64</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-5018-4683</orcidid></search><sort><creationdate>20241201</creationdate><title>Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology’s “Diagnosis Please” cases</title><author>Kurokawa, Ryo ; Ohizumi, Yuji ; Kanzawa, Jun ; Kurokawa, Mariko ; Sonoda, Yuki ; Nakamura, Yuta ; Kiguchi, Takao ; Gonoi, Wataru ; Abe, Osamu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c422t-242d206d86bebbbf550171ebb569023b1a66258507f7db464f03769356fc6b883</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial Intelligence</topic><topic>Diagnosis</topic><topic>Diagnosis, Differential</topic><topic>Diagnostic systems</topic><topic>Differential diagnosis</topic><topic>Humans</topic><topic>Imaging</topic><topic>Medical History Taking - methods</topic><topic>Medical imaging</topic><topic>Medicine</topic><topic>Medicine &amp; Public Health</topic><topic>Nuclear Medicine</topic><topic>Original</topic><topic>Original Article</topic><topic>Questions</topic><topic>Radiology</topic><topic>Radiotherapy</topic><topic>Response rates</topic><topic>Statistical analysis</topic><topic>Ultrasonography - methods</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kurokawa, Ryo</creatorcontrib><creatorcontrib>Ohizumi, Yuji</creatorcontrib><creatorcontrib>Kanzawa, Jun</creatorcontrib><creatorcontrib>Kurokawa, Mariko</creatorcontrib><creatorcontrib>Sonoda, Yuki</creatorcontrib><creatorcontrib>Nakamura, Yuta</creatorcontrib><creatorcontrib>Kiguchi, Takao</creatorcontrib><creatorcontrib>Gonoi, Wataru</creatorcontrib><creatorcontrib>Abe, Osamu</creatorcontrib><collection>Springer Nature OA Free Journals</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Biotechnology Research Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Toxicology Abstracts</collection><collection>Technology Research Database</collection><collection>Environmental Sciences and Pollution Management</collection><collection>Engineering Research Database</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Nursing &amp; Allied Health Premium</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Japanese journal of radiology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kurokawa, Ryo</au><au>Ohizumi, Yuji</au><au>Kanzawa, Jun</au><au>Kurokawa, Mariko</au><au>Sonoda, Yuki</au><au>Nakamura, Yuta</au><au>Kiguchi, Takao</au><au>Gonoi, Wataru</au><au>Abe, Osamu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology’s “Diagnosis Please” cases</atitle><jtitle>Japanese journal of radiology</jtitle><stitle>Jpn J Radiol</stitle><addtitle>Jpn J Radiol</addtitle><date>2024-12-01</date><risdate>2024</risdate><volume>42</volume><issue>12</issue><spage>1399</spage><epage>1402</epage><pages>1399-1402</pages><issn>1867-1071</issn><issn>1867-108X</issn><eissn>1867-108X</eissn><abstract>Purpose The diagnostic performance of large language artificial intelligence (AI) models when utilizing radiological images has yet to be investigated. We employed Claude 3 Opus (released on March 4, 2024) and Claude 3.5 Sonnet (released on June 21, 2024) to investigate their diagnostic performances in response to the Radiology’s Diagnosis Please quiz questions. Materials and methods In this study, the AI models were tasked with listing the primary diagnosis and two differential diagnoses for 322 quiz questions from Radiology’s “Diagnosis Please” cases, which included cases 1 to 322, published from 1998 to 2023. The analyses were performed under the following conditions: (1) Condition 1: submitter-provided clinical history (text) alone. (2) Condition 2: submitter-provided clinical history and imaging findings (text). (3) Condition 3: clinical history (text) and key images (PNG file). We applied McNemar’s test to evaluate differences in the correct response rates for the overall accuracy under Conditions 1, 2, and 3 for each model and between the models. Results The correct diagnosis rates were 58/322 (18.0%) and 69/322 (21.4%), 201/322 (62.4%) and 209/322 (64.9%), and 80/322 (24.8%) and 97/322 (30.1%) for Conditions 1, 2, and 3 for Claude 3 Opus and Claude 3.5 Sonnet, respectively. The models provided the correct answer as a differential diagnosis in up to 26/322 (8.1%) for Opus and 23/322 (7.1%) for Sonnet. Statistically significant differences were observed in the correct response rates among all combinations of Conditions 1, 2, and 3 for each model ( p  &lt; 0.01). Claude 3.5 Sonnet outperformed in all conditions, but a statistically significant difference was observed only in the comparison for Condition 3 (30.1% vs. 24.8%, p  = 0.028). Conclusion Two AI models demonstrated a significantly improved diagnostic performance when inputting both key images and clinical history. The models’ ability to identify important differential diagnoses under these conditions was also confirmed.</abstract><cop>Singapore</cop><pub>Springer Nature Singapore</pub><pmid>39096483</pmid><doi>10.1007/s11604-024-01634-z</doi><tpages>4</tpages><orcidid>https://orcid.org/0000-0002-5018-4683</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1867-1071
ispartof Japanese journal of radiology, 2024-12, Vol.42 (12), p.1399-1402
issn 1867-1071
1867-108X
1867-108X
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11588754
source MEDLINE; SpringerLink Journals - AutoHoldings
subjects Artificial Intelligence
Diagnosis
Diagnosis, Differential
Diagnostic systems
Differential diagnosis
Humans
Imaging
Medical History Taking - methods
Medical imaging
Medicine
Medicine & Public Health
Nuclear Medicine
Original
Original Article
Questions
Radiology
Radiotherapy
Response rates
Statistical analysis
Ultrasonography - methods
title Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology’s “Diagnosis Please” cases
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T01%3A40%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Diagnostic%20performances%20of%20Claude%203%20Opus%20and%20Claude%203.5%20Sonnet%20from%20patient%20history%20and%20key%20images%20in%20Radiology%E2%80%99s%20%E2%80%9CDiagnosis%20Please%E2%80%9D%20cases&rft.jtitle=Japanese%20journal%20of%20radiology&rft.au=Kurokawa,%20Ryo&rft.date=2024-12-01&rft.volume=42&rft.issue=12&rft.spage=1399&rft.epage=1402&rft.pages=1399-1402&rft.issn=1867-1071&rft.eissn=1867-108X&rft_id=info:doi/10.1007/s11604-024-01634-z&rft_dat=%3Cproquest_pubme%3E3132713868%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3132713868&rft_id=info:pmid/39096483&rfr_iscdi=true