Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios
Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios. Topic domains were identified and clinical scenarios were developed within each domain. Each scena...
Gespeichert in:
Veröffentlicht in: | The Annals of thoracic surgery 2024-07, Vol.118 (1), p.275-281 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 281 |
---|---|
container_issue | 1 |
container_start_page | 275 |
container_title | The Annals of thoracic surgery |
container_volume | 118 |
creator | Platz, Joseph J. Bryan, Darren S. Naunheim, Keith S. Ferguson, Mark K. |
description | Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios.
Topic domains were identified and clinical scenarios were developed within each domain. Each scenario included 3 stems using Key Feature methods related to diagnosis, evaluation, and treatment. Twelve scenarios were presented to ChatGPT-4 (OpenAI), Bard (recently renamed Gemini; Google), Perplexity (Perplexity AI), and Claude 2 (Anthropic) in 3 separate runs. Up to 1 point was awarded for each stem, yielding a potential of 3 points per scenario. Critical failures were identified before scoring; if they occurred, the stem and overall scenario scores were adjusted to 0. We arbitrarily established a threshold of ≥2 points mean adjusted score per scenario as a passing grade and established a critical fail rate of ≥30% as failure to pass.
The bot performances varied considerably within each run, and their overall performance was a fail on all runs (critical mean scenario fails of 83%, 71%, and 71%). The bots trended toward “learning” from the first to the second run, but without improvement in overall raw (1.24 ± 0.47 vs 1.63 ± 0.76 vs 1.51 ± 0.60; P = .29) and adjusted (0.44 ± 0.54 vs 0.80 ± 0.94 vs 0.76 ± 0.81; P = .48) scenario scores after all runs.
Chatbot performance in managing clinical scenarios was insufficient to provide reliable assistance. This is a cautionary note against reliance on the current accuracy of chatbots in complex thoracic surgery medical decision making. |
doi_str_mv | 10.1016/j.athoracsur.2024.03.023 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3034240848</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0003497524002534</els_id><sourcerecordid>3034240848</sourcerecordid><originalsourceid>FETCH-LOGICAL-c374t-6f29dfe49060ce04d71f4f797bbe6202ba217580f7a31ec3345cea1d86a5f5d93</originalsourceid><addsrcrecordid>eNqFkMtOwzAQRS0EoqXwCyhLNgl-xskSwlMCIdGythxn3LpKk2InSP170gewZDUzmjtzdQ9CEcEJwSS9Xia6W7Rem9D7hGLKE8wSTNkRGhMhaJxSkR-jMcaYxTyXYoTOQlgOIx3Wp2jEMiF5zvIxuisWuivbLnqH2unS1a7bRK6JXnWj566ZR7OdjzPRtPdzZ3QdFbVrds3UQKO9a8M5OrG6DnBxqBP08XA_K57il7fH5-LmJTZM8i5OLc0rCzzHKTaAeSWJ5VbmsiwhHVKUmhIpMmylZgQMY1wY0KTKUi2sqHI2QVf7v2vffvYQOrVywUBd6wbaPiiGGaccZzwbpNleanwbgger1t6ttN8ogtWWoVqqP4Zqy1BhpgaGw-nlwaUvV1D9Hv5AGwS3ewEMWb8ceBWMg8ZA5TyYTlWt-9_lG0X4h-8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3034240848</pqid></control><display><type>article</type><title>Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios</title><source>Elsevier ScienceDirect Journals</source><creator>Platz, Joseph J. ; Bryan, Darren S. ; Naunheim, Keith S. ; Ferguson, Mark K.</creator><creatorcontrib>Platz, Joseph J. ; Bryan, Darren S. ; Naunheim, Keith S. ; Ferguson, Mark K.</creatorcontrib><description>Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios.
Topic domains were identified and clinical scenarios were developed within each domain. Each scenario included 3 stems using Key Feature methods related to diagnosis, evaluation, and treatment. Twelve scenarios were presented to ChatGPT-4 (OpenAI), Bard (recently renamed Gemini; Google), Perplexity (Perplexity AI), and Claude 2 (Anthropic) in 3 separate runs. Up to 1 point was awarded for each stem, yielding a potential of 3 points per scenario. Critical failures were identified before scoring; if they occurred, the stem and overall scenario scores were adjusted to 0. We arbitrarily established a threshold of ≥2 points mean adjusted score per scenario as a passing grade and established a critical fail rate of ≥30% as failure to pass.
The bot performances varied considerably within each run, and their overall performance was a fail on all runs (critical mean scenario fails of 83%, 71%, and 71%). The bots trended toward “learning” from the first to the second run, but without improvement in overall raw (1.24 ± 0.47 vs 1.63 ± 0.76 vs 1.51 ± 0.60; P = .29) and adjusted (0.44 ± 0.54 vs 0.80 ± 0.94 vs 0.76 ± 0.81; P = .48) scenario scores after all runs.
Chatbot performance in managing clinical scenarios was insufficient to provide reliable assistance. This is a cautionary note against reliance on the current accuracy of chatbots in complex thoracic surgery medical decision making.</description><identifier>ISSN: 0003-4975</identifier><identifier>ISSN: 1552-6259</identifier><identifier>EISSN: 1552-6259</identifier><identifier>DOI: 10.1016/j.athoracsur.2024.03.023</identifier><identifier>PMID: 38574939</identifier><language>eng</language><publisher>Netherlands: Elsevier Inc</publisher><ispartof>The Annals of thoracic surgery, 2024-07, Vol.118 (1), p.275-281</ispartof><rights>2024 The Society of Thoracic Surgeons</rights><rights>Copyright © 2024 The Society of Thoracic Surgeons. Published by Elsevier Inc. All rights reserved.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c374t-6f29dfe49060ce04d71f4f797bbe6202ba217580f7a31ec3345cea1d86a5f5d93</citedby><cites>FETCH-LOGICAL-c374t-6f29dfe49060ce04d71f4f797bbe6202ba217580f7a31ec3345cea1d86a5f5d93</cites><orcidid>0000-0001-9653-1698</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.sciencedirect.com/science/article/pii/S0003497524002534$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,776,780,3537,27901,27902,65306</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38574939$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Platz, Joseph J.</creatorcontrib><creatorcontrib>Bryan, Darren S.</creatorcontrib><creatorcontrib>Naunheim, Keith S.</creatorcontrib><creatorcontrib>Ferguson, Mark K.</creatorcontrib><title>Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios</title><title>The Annals of thoracic surgery</title><addtitle>Ann Thorac Surg</addtitle><description>Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios.
Topic domains were identified and clinical scenarios were developed within each domain. Each scenario included 3 stems using Key Feature methods related to diagnosis, evaluation, and treatment. Twelve scenarios were presented to ChatGPT-4 (OpenAI), Bard (recently renamed Gemini; Google), Perplexity (Perplexity AI), and Claude 2 (Anthropic) in 3 separate runs. Up to 1 point was awarded for each stem, yielding a potential of 3 points per scenario. Critical failures were identified before scoring; if they occurred, the stem and overall scenario scores were adjusted to 0. We arbitrarily established a threshold of ≥2 points mean adjusted score per scenario as a passing grade and established a critical fail rate of ≥30% as failure to pass.
The bot performances varied considerably within each run, and their overall performance was a fail on all runs (critical mean scenario fails of 83%, 71%, and 71%). The bots trended toward “learning” from the first to the second run, but without improvement in overall raw (1.24 ± 0.47 vs 1.63 ± 0.76 vs 1.51 ± 0.60; P = .29) and adjusted (0.44 ± 0.54 vs 0.80 ± 0.94 vs 0.76 ± 0.81; P = .48) scenario scores after all runs.
Chatbot performance in managing clinical scenarios was insufficient to provide reliable assistance. This is a cautionary note against reliance on the current accuracy of chatbots in complex thoracic surgery medical decision making.</description><issn>0003-4975</issn><issn>1552-6259</issn><issn>1552-6259</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNqFkMtOwzAQRS0EoqXwCyhLNgl-xskSwlMCIdGythxn3LpKk2InSP170gewZDUzmjtzdQ9CEcEJwSS9Xia6W7Rem9D7hGLKE8wSTNkRGhMhaJxSkR-jMcaYxTyXYoTOQlgOIx3Wp2jEMiF5zvIxuisWuivbLnqH2unS1a7bRK6JXnWj566ZR7OdjzPRtPdzZ3QdFbVrds3UQKO9a8M5OrG6DnBxqBP08XA_K57il7fH5-LmJTZM8i5OLc0rCzzHKTaAeSWJ5VbmsiwhHVKUmhIpMmylZgQMY1wY0KTKUi2sqHI2QVf7v2vffvYQOrVywUBd6wbaPiiGGaccZzwbpNleanwbgger1t6ttN8ogtWWoVqqP4Zqy1BhpgaGw-nlwaUvV1D9Hv5AGwS3ewEMWb8ceBWMg8ZA5TyYTlWt-9_lG0X4h-8</recordid><startdate>20240701</startdate><enddate>20240701</enddate><creator>Platz, Joseph J.</creator><creator>Bryan, Darren S.</creator><creator>Naunheim, Keith S.</creator><creator>Ferguson, Mark K.</creator><general>Elsevier Inc</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-9653-1698</orcidid></search><sort><creationdate>20240701</creationdate><title>Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios</title><author>Platz, Joseph J. ; Bryan, Darren S. ; Naunheim, Keith S. ; Ferguson, Mark K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c374t-6f29dfe49060ce04d71f4f797bbe6202ba217580f7a31ec3345cea1d86a5f5d93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Platz, Joseph J.</creatorcontrib><creatorcontrib>Bryan, Darren S.</creatorcontrib><creatorcontrib>Naunheim, Keith S.</creatorcontrib><creatorcontrib>Ferguson, Mark K.</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>The Annals of thoracic surgery</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Platz, Joseph J.</au><au>Bryan, Darren S.</au><au>Naunheim, Keith S.</au><au>Ferguson, Mark K.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios</atitle><jtitle>The Annals of thoracic surgery</jtitle><addtitle>Ann Thorac Surg</addtitle><date>2024-07-01</date><risdate>2024</risdate><volume>118</volume><issue>1</issue><spage>275</spage><epage>281</epage><pages>275-281</pages><issn>0003-4975</issn><issn>1552-6259</issn><eissn>1552-6259</eissn><abstract>Chatbot use in medicine is growing, and concerns have been raised regarding their accuracy. This study assessed the performance of 4 different chatbots in managing thoracic surgical clinical scenarios.
Topic domains were identified and clinical scenarios were developed within each domain. Each scenario included 3 stems using Key Feature methods related to diagnosis, evaluation, and treatment. Twelve scenarios were presented to ChatGPT-4 (OpenAI), Bard (recently renamed Gemini; Google), Perplexity (Perplexity AI), and Claude 2 (Anthropic) in 3 separate runs. Up to 1 point was awarded for each stem, yielding a potential of 3 points per scenario. Critical failures were identified before scoring; if they occurred, the stem and overall scenario scores were adjusted to 0. We arbitrarily established a threshold of ≥2 points mean adjusted score per scenario as a passing grade and established a critical fail rate of ≥30% as failure to pass.
The bot performances varied considerably within each run, and their overall performance was a fail on all runs (critical mean scenario fails of 83%, 71%, and 71%). The bots trended toward “learning” from the first to the second run, but without improvement in overall raw (1.24 ± 0.47 vs 1.63 ± 0.76 vs 1.51 ± 0.60; P = .29) and adjusted (0.44 ± 0.54 vs 0.80 ± 0.94 vs 0.76 ± 0.81; P = .48) scenario scores after all runs.
Chatbot performance in managing clinical scenarios was insufficient to provide reliable assistance. This is a cautionary note against reliance on the current accuracy of chatbots in complex thoracic surgery medical decision making.</abstract><cop>Netherlands</cop><pub>Elsevier Inc</pub><pmid>38574939</pmid><doi>10.1016/j.athoracsur.2024.03.023</doi><tpages>7</tpages><orcidid>https://orcid.org/0000-0001-9653-1698</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0003-4975 |
ispartof | The Annals of thoracic surgery, 2024-07, Vol.118 (1), p.275-281 |
issn | 0003-4975 1552-6259 1552-6259 |
language | eng |
recordid | cdi_proquest_miscellaneous_3034240848 |
source | Elsevier ScienceDirect Journals |
title | Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T06%3A23%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Chatbot%20Reliability%20in%20Managing%20Thoracic%20Surgical%20Clinical%20Scenarios&rft.jtitle=The%20Annals%20of%20thoracic%20surgery&rft.au=Platz,%20Joseph%20J.&rft.date=2024-07-01&rft.volume=118&rft.issue=1&rft.spage=275&rft.epage=281&rft.pages=275-281&rft.issn=0003-4975&rft.eissn=1552-6259&rft_id=info:doi/10.1016/j.athoracsur.2024.03.023&rft_dat=%3Cproquest_cross%3E3034240848%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3034240848&rft_id=info:pmid/38574939&rft_els_id=S0003497524002534&rfr_iscdi=true |