Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios

Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios. Topic domains were identified and clinical scenarios were developed within each domain. Each scena...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The Annals of thoracic surgery 2024-07, Vol.118 (1), p.275-281
Hauptverfasser: Platz, Joseph J., Bryan, Darren S., Naunheim, Keith S., Ferguson, Mark K.
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 281
container_issue 1
container_start_page 275
container_title The Annals of thoracic surgery
container_volume 118
creator Platz, Joseph J.
Bryan, Darren S.
Naunheim, Keith S.
Ferguson, Mark K.
description Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios. Topic domains were identified and clinical scenarios were developed within each domain. Each scenario included 3 stems using Key Feature methods related to diagnosis, evaluation, and treatment. Twelve scenarios were presented to ChatGPT-4 (OpenAI), Bard (recently renamed Gemini; Google), Perplexity (Perplexity AI), and Claude 2 (Anthropic) in 3 separate runs. Up to 1 point was awarded for each stem, yielding a potential of 3 points per scenario. Critical failures were identified before scoring; if they occurred, the stem and overall scenario scores were adjusted to 0. We arbitrarily established a threshold of ≥2 points mean adjusted score per scenario as a passing grade and established a critical fail rate of ≥30% as failure to pass. The bot performances varied considerably within each run, and their overall performance was a fail on all runs (critical mean scenario fails of 83%, 71%, and 71%). The bots trended toward “learning” from the first to the second run, but without improvement in overall raw (1.24 ± 0.47 vs 1.63 ± 0.76 vs 1.51 ± 0.60; P = .29) and adjusted (0.44 ± 0.54 vs 0.80 ± 0.94 vs 0.76 ± 0.81; P = .48) scenario scores after all runs. Chatbot performance in managing clinical scenarios was insufficient to provide reliable assistance. This is a cautionary note against reliance on the current accuracy of chatbots in complex thoracic surgery medical decision making.
doi_str_mv 10.1016/j.athoracsur.2024.03.023
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3034240848</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0003497524002534</els_id><sourcerecordid>3034240848</sourcerecordid><originalsourceid>FETCH-LOGICAL-c374t-6f29dfe49060ce04d71f4f797bbe6202ba217580f7a31ec3345cea1d86a5f5d93</originalsourceid><addsrcrecordid>eNqFkMtOwzAQRS0EoqXwCyhLNgl-xskSwlMCIdGythxn3LpKk2InSP170gewZDUzmjtzdQ9CEcEJwSS9Xia6W7Rem9D7hGLKE8wSTNkRGhMhaJxSkR-jMcaYxTyXYoTOQlgOIx3Wp2jEMiF5zvIxuisWuivbLnqH2unS1a7bRK6JXnWj566ZR7OdjzPRtPdzZ3QdFbVrds3UQKO9a8M5OrG6DnBxqBP08XA_K57il7fH5-LmJTZM8i5OLc0rCzzHKTaAeSWJ5VbmsiwhHVKUmhIpMmylZgQMY1wY0KTKUi2sqHI2QVf7v2vffvYQOrVywUBd6wbaPiiGGaccZzwbpNleanwbgger1t6ttN8ogtWWoVqqP4Zqy1BhpgaGw-nlwaUvV1D9Hv5AGwS3ewEMWb8ceBWMg8ZA5TyYTlWt-9_lG0X4h-8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3034240848</pqid></control><display><type>article</type><title>Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios</title><source>Elsevier ScienceDirect Journals</source><creator>Platz, Joseph J. ; Bryan, Darren S. ; Naunheim, Keith S. ; Ferguson, Mark K.</creator><creatorcontrib>Platz, Joseph J. ; Bryan, Darren S. ; Naunheim, Keith S. ; Ferguson, Mark K.</creatorcontrib><description>Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios. Topic domains were identified and clinical scenarios were developed within each domain. Each scenario included 3 stems using Key Feature methods related to diagnosis, evaluation, and treatment. Twelve scenarios were presented to ChatGPT-4 (OpenAI), Bard (recently renamed Gemini; Google), Perplexity (Perplexity AI), and Claude 2 (Anthropic) in 3 separate runs. Up to 1 point was awarded for each stem, yielding a potential of 3 points per scenario. Critical failures were identified before scoring; if they occurred, the stem and overall scenario scores were adjusted to 0. We arbitrarily established a threshold of ≥2 points mean adjusted score per scenario as a passing grade and established a critical fail rate of ≥30% as failure to pass. The bot performances varied considerably within each run, and their overall performance was a fail on all runs (critical mean scenario fails of 83%, 71%, and 71%). The bots trended toward “learning” from the first to the second run, but without improvement in overall raw (1.24 ± 0.47 vs 1.63 ± 0.76 vs 1.51 ± 0.60; P = .29) and adjusted (0.44 ± 0.54 vs 0.80 ± 0.94 vs 0.76 ± 0.81; P = .48) scenario scores after all runs. Chatbot performance in managing clinical scenarios was insufficient to provide reliable assistance. This is a cautionary note against reliance on the current accuracy of chatbots in complex thoracic surgery medical decision making.</description><identifier>ISSN: 0003-4975</identifier><identifier>ISSN: 1552-6259</identifier><identifier>EISSN: 1552-6259</identifier><identifier>DOI: 10.1016/j.athoracsur.2024.03.023</identifier><identifier>PMID: 38574939</identifier><language>eng</language><publisher>Netherlands: Elsevier Inc</publisher><ispartof>The Annals of thoracic surgery, 2024-07, Vol.118 (1), p.275-281</ispartof><rights>2024 The Society of Thoracic Surgeons</rights><rights>Copyright © 2024 The Society of Thoracic Surgeons. Published by Elsevier Inc. All rights reserved.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c374t-6f29dfe49060ce04d71f4f797bbe6202ba217580f7a31ec3345cea1d86a5f5d93</citedby><cites>FETCH-LOGICAL-c374t-6f29dfe49060ce04d71f4f797bbe6202ba217580f7a31ec3345cea1d86a5f5d93</cites><orcidid>0000-0001-9653-1698</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.sciencedirect.com/science/article/pii/S0003497524002534$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,776,780,3537,27901,27902,65306</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38574939$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Platz, Joseph J.</creatorcontrib><creatorcontrib>Bryan, Darren S.</creatorcontrib><creatorcontrib>Naunheim, Keith S.</creatorcontrib><creatorcontrib>Ferguson, Mark K.</creatorcontrib><title>Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios</title><title>The Annals of thoracic surgery</title><addtitle>Ann Thorac Surg</addtitle><description>Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios. Topic domains were identified and clinical scenarios were developed within each domain. Each scenario included 3 stems using Key Feature methods related to diagnosis, evaluation, and treatment. Twelve scenarios were presented to ChatGPT-4 (OpenAI), Bard (recently renamed Gemini; Google), Perplexity (Perplexity AI), and Claude 2 (Anthropic) in 3 separate runs. Up to 1 point was awarded for each stem, yielding a potential of 3 points per scenario. Critical failures were identified before scoring; if they occurred, the stem and overall scenario scores were adjusted to 0. We arbitrarily established a threshold of ≥2 points mean adjusted score per scenario as a passing grade and established a critical fail rate of ≥30% as failure to pass. The bot performances varied considerably within each run, and their overall performance was a fail on all runs (critical mean scenario fails of 83%, 71%, and 71%). The bots trended toward “learning” from the first to the second run, but without improvement in overall raw (1.24 ± 0.47 vs 1.63 ± 0.76 vs 1.51 ± 0.60; P = .29) and adjusted (0.44 ± 0.54 vs 0.80 ± 0.94 vs 0.76 ± 0.81; P = .48) scenario scores after all runs. Chatbot performance in managing clinical scenarios was insufficient to provide reliable assistance. This is a cautionary note against reliance on the current accuracy of chatbots in complex thoracic surgery medical decision making.</description><issn>0003-4975</issn><issn>1552-6259</issn><issn>1552-6259</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNqFkMtOwzAQRS0EoqXwCyhLNgl-xskSwlMCIdGythxn3LpKk2InSP170gewZDUzmjtzdQ9CEcEJwSS9Xia6W7Rem9D7hGLKE8wSTNkRGhMhaJxSkR-jMcaYxTyXYoTOQlgOIx3Wp2jEMiF5zvIxuisWuivbLnqH2unS1a7bRK6JXnWj566ZR7OdjzPRtPdzZ3QdFbVrds3UQKO9a8M5OrG6DnBxqBP08XA_K57il7fH5-LmJTZM8i5OLc0rCzzHKTaAeSWJ5VbmsiwhHVKUmhIpMmylZgQMY1wY0KTKUi2sqHI2QVf7v2vffvYQOrVywUBd6wbaPiiGGaccZzwbpNleanwbgger1t6ttN8ogtWWoVqqP4Zqy1BhpgaGw-nlwaUvV1D9Hv5AGwS3ewEMWb8ceBWMg8ZA5TyYTlWt-9_lG0X4h-8</recordid><startdate>20240701</startdate><enddate>20240701</enddate><creator>Platz, Joseph J.</creator><creator>Bryan, Darren S.</creator><creator>Naunheim, Keith S.</creator><creator>Ferguson, Mark K.</creator><general>Elsevier Inc</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-9653-1698</orcidid></search><sort><creationdate>20240701</creationdate><title>Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios</title><author>Platz, Joseph J. ; Bryan, Darren S. ; Naunheim, Keith S. ; Ferguson, Mark K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c374t-6f29dfe49060ce04d71f4f797bbe6202ba217580f7a31ec3345cea1d86a5f5d93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Platz, Joseph J.</creatorcontrib><creatorcontrib>Bryan, Darren S.</creatorcontrib><creatorcontrib>Naunheim, Keith S.</creatorcontrib><creatorcontrib>Ferguson, Mark K.</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>The Annals of thoracic surgery</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Platz, Joseph J.</au><au>Bryan, Darren S.</au><au>Naunheim, Keith S.</au><au>Ferguson, Mark K.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios</atitle><jtitle>The Annals of thoracic surgery</jtitle><addtitle>Ann Thorac Surg</addtitle><date>2024-07-01</date><risdate>2024</risdate><volume>118</volume><issue>1</issue><spage>275</spage><epage>281</epage><pages>275-281</pages><issn>0003-4975</issn><issn>1552-6259</issn><eissn>1552-6259</eissn><abstract>Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios. Topic domains were identified and clinical scenarios were developed within each domain. Each scenario included 3 stems using Key Feature methods related to diagnosis, evaluation, and treatment. Twelve scenarios were presented to ChatGPT-4 (OpenAI), Bard (recently renamed Gemini; Google), Perplexity (Perplexity AI), and Claude 2 (Anthropic) in 3 separate runs. Up to 1 point was awarded for each stem, yielding a potential of 3 points per scenario. Critical failures were identified before scoring; if they occurred, the stem and overall scenario scores were adjusted to 0. We arbitrarily established a threshold of ≥2 points mean adjusted score per scenario as a passing grade and established a critical fail rate of ≥30% as failure to pass. The bot performances varied considerably within each run, and their overall performance was a fail on all runs (critical mean scenario fails of 83%, 71%, and 71%). The bots trended toward “learning” from the first to the second run, but without improvement in overall raw (1.24 ± 0.47 vs 1.63 ± 0.76 vs 1.51 ± 0.60; P = .29) and adjusted (0.44 ± 0.54 vs 0.80 ± 0.94 vs 0.76 ± 0.81; P = .48) scenario scores after all runs. Chatbot performance in managing clinical scenarios was insufficient to provide reliable assistance. This is a cautionary note against reliance on the current accuracy of chatbots in complex thoracic surgery medical decision making.</abstract><cop>Netherlands</cop><pub>Elsevier Inc</pub><pmid>38574939</pmid><doi>10.1016/j.athoracsur.2024.03.023</doi><tpages>7</tpages><orcidid>https://orcid.org/0000-0001-9653-1698</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0003-4975
ispartof The Annals of thoracic surgery, 2024-07, Vol.118 (1), p.275-281
issn 0003-4975
1552-6259
1552-6259
language eng
recordid cdi_proquest_miscellaneous_3034240848
source Elsevier ScienceDirect Journals
title Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T06%3A23%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Chatbot%20Reliability%20in%20Managing%20Thoracic%20Surgical%20Clinical%20Scenarios&rft.jtitle=The%20Annals%20of%20thoracic%20surgery&rft.au=Platz,%20Joseph%20J.&rft.date=2024-07-01&rft.volume=118&rft.issue=1&rft.spage=275&rft.epage=281&rft.pages=275-281&rft.issn=0003-4975&rft.eissn=1552-6259&rft_id=info:doi/10.1016/j.athoracsur.2024.03.023&rft_dat=%3Cproquest_cross%3E3034240848%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3034240848&rft_id=info:pmid/38574939&rft_els_id=S0003497524002534&rfr_iscdi=true