Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery

Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:European archives of oto-rhino-laryngology 2025-01
Hauptverfasser: Buhr, Christoph Raphael, Ernst, Benjamin Philipp, Blaikie, Andrew, Smith, Harry, Kelsey, Tom, Matthias, Christoph, Fleischmann, Maximilian, Jungmann, Florian, Alt, Jürgen, Brandts, Christian, Kämmerer, Peer W, Foersch, Sebastian, Kuhn, Sebastian, Eckrich, Jonas
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title European archives of oto-rhino-laryngology
container_volume
creator Buhr, Christoph Raphael
Ernst, Benjamin Philipp
Blaikie, Andrew
Smith, Harry
Kelsey, Tom
Matthias, Christoph
Fleischmann, Maximilian
Jungmann, Florian
Alt, Jürgen
Brandts, Christian
Kämmerer, Peer W
Foersch, Sebastian
Kuhn, Sebastian
Eckrich, Jonas
description Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and the use of confidential patient information in web-based LLMs have restricted their widespread adoption and hindered the exploration of their full potential. In this first study of its kind we compared standard human multidisciplinary tumor board recommendations (MDT) against a web-based LLM (ChatGPT-4o) and a locally run LLM (Llama 3) addressing data protection concerns. Twenty-five simulated tumor board cases were presented to an MDT composed of specialists from otorhinolaryngology, craniomaxillofacial surgery, medical oncology, radiology, radiation oncology, and pathology. This multidisciplinary team provided a comprehensive analysis of the cases. The same cases were input into ChatGPT-4o and Llama 3 using structured prompts, and the concordance between the LLMs' and MDT's recommendations was assessed. Four MDT members evaluated the LLMs' recommendations in terms of medical adequacy (using a six-point Likert scale) and whether the information provided could have influenced the MDT's original recommendations. ChatGPT-4o showed 84% concordance (21 out of 25 cases) and Llama 3 demonstrated 92% concordance (23 out of 25 cases) with the MDT in distinguishing between curative and palliative treatment strategies. In 64% of cases (16/25) ChatGPT-4o and in 60% of cases (15/25) Llama, identified all first-line therapy options considered by the MDT, though with varying priority. ChatGPT-4o presented all the MDT's first-line therapies in 52% of cases (13/25), while Llama 3 offered a homologous treatment strategy in 48% of cases (12/25). Additionally, both models proposed at least one of the MDT's first-line therapies as their top recommendation in 28% of cases (7/25). The ratings for medical adequacy yielded a mean score of 4.7 (IQR: 4-6) for ChatGPT-4o and 4.3 (IQR: 3-5) for Llama 3. In 17% of the assessments (33/200), MDT members indicated that the LLM recommendations could potentially enhance the MDT's decisions. This study demonstrates the capability of both LLMs to provide viable therapeutic recommendations in ORL head and neck surgery. Llama 3, operating locally, bypasses many data protection issues and shows promise as a clinical tool to support MDT decisions. Howe
doi_str_mv 10.1007/s00405-024-09153-3
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3153922648</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3153922648</sourcerecordid><originalsourceid>FETCH-LOGICAL-c990-989a5d595518ec491365aa9729f4e33b9248bad5361a391aef20aaaf658d43c03</originalsourceid><addsrcrecordid>eNo9kU9v1TAMwCPExN7-fAEOKEcOBJImaZvjNA2YNInL7pHbuH1lbTLilul9Hr7owt7gYvtg_yz7x9h7JT8rKZsvJKWRVsjKCOmU1UK_YTtltBGmqeq3bCedboQxTXPKzoh-SimtcfodO9WucVUl5Y79uSJCogXjytPAA_YTTSmKBR6mOPKnad3zOfUwzweet8ghBv6EneiAMPAZ8oglxnGDUiwp4Ez8N2baiO-3BSLvEuTAM_ZpKUsCrIVOfIo8rSnvp5gK4xDHNKfx8InvEcLLjoj9A6et4PPhgp0MMBNevuZzdv_15v76u7j78e32-upO9M5J4VoHNlhnrWqxN07p2gK4pnKDQa07V5m2g2B1rUA7BThUEgCG2rbB6F7qc_bxiH3M6deGtPploh7nch6mjbwuLy5fq01bWqtja58TUcbBP-ZpKYd4Jf1fN_7oxhc3_sWN12Xowyt_6xYM_0f-ydDPpOONsw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3153922648</pqid></control><display><type>article</type><title>Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery</title><source>SpringerLink Journals - AutoHoldings</source><creator>Buhr, Christoph Raphael ; Ernst, Benjamin Philipp ; Blaikie, Andrew ; Smith, Harry ; Kelsey, Tom ; Matthias, Christoph ; Fleischmann, Maximilian ; Jungmann, Florian ; Alt, Jürgen ; Brandts, Christian ; Kämmerer, Peer W ; Foersch, Sebastian ; Kuhn, Sebastian ; Eckrich, Jonas</creator><creatorcontrib>Buhr, Christoph Raphael ; Ernst, Benjamin Philipp ; Blaikie, Andrew ; Smith, Harry ; Kelsey, Tom ; Matthias, Christoph ; Fleischmann, Maximilian ; Jungmann, Florian ; Alt, Jürgen ; Brandts, Christian ; Kämmerer, Peer W ; Foersch, Sebastian ; Kuhn, Sebastian ; Eckrich, Jonas</creatorcontrib><description>Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and the use of confidential patient information in web-based LLMs have restricted their widespread adoption and hindered the exploration of their full potential. In this first study of its kind we compared standard human multidisciplinary tumor board recommendations (MDT) against a web-based LLM (ChatGPT-4o) and a locally run LLM (Llama 3) addressing data protection concerns. Twenty-five simulated tumor board cases were presented to an MDT composed of specialists from otorhinolaryngology, craniomaxillofacial surgery, medical oncology, radiology, radiation oncology, and pathology. This multidisciplinary team provided a comprehensive analysis of the cases. The same cases were input into ChatGPT-4o and Llama 3 using structured prompts, and the concordance between the LLMs' and MDT's recommendations was assessed. Four MDT members evaluated the LLMs' recommendations in terms of medical adequacy (using a six-point Likert scale) and whether the information provided could have influenced the MDT's original recommendations. ChatGPT-4o showed 84% concordance (21 out of 25 cases) and Llama 3 demonstrated 92% concordance (23 out of 25 cases) with the MDT in distinguishing between curative and palliative treatment strategies. In 64% of cases (16/25) ChatGPT-4o and in 60% of cases (15/25) Llama, identified all first-line therapy options considered by the MDT, though with varying priority. ChatGPT-4o presented all the MDT's first-line therapies in 52% of cases (13/25), while Llama 3 offered a homologous treatment strategy in 48% of cases (12/25). Additionally, both models proposed at least one of the MDT's first-line therapies as their top recommendation in 28% of cases (7/25). The ratings for medical adequacy yielded a mean score of 4.7 (IQR: 4-6) for ChatGPT-4o and 4.3 (IQR: 3-5) for Llama 3. In 17% of the assessments (33/200), MDT members indicated that the LLM recommendations could potentially enhance the MDT's decisions. This study demonstrates the capability of both LLMs to provide viable therapeutic recommendations in ORL head and neck surgery. Llama 3, operating locally, bypasses many data protection issues and shows promise as a clinical tool to support MDT decisions. However at present, LLMs should augment rather than replace human decision-making.</description><identifier>ISSN: 0937-4477</identifier><identifier>ISSN: 1434-4726</identifier><identifier>EISSN: 1434-4726</identifier><identifier>DOI: 10.1007/s00405-024-09153-3</identifier><identifier>PMID: 39792200</identifier><language>eng</language><publisher>Germany</publisher><ispartof>European archives of oto-rhino-laryngology, 2025-01</ispartof><rights>2024. The Author(s).</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c990-989a5d595518ec491365aa9729f4e33b9248bad5361a391aef20aaaf658d43c03</cites><orcidid>0000-0002-9551-2310</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39792200$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Buhr, Christoph Raphael</creatorcontrib><creatorcontrib>Ernst, Benjamin Philipp</creatorcontrib><creatorcontrib>Blaikie, Andrew</creatorcontrib><creatorcontrib>Smith, Harry</creatorcontrib><creatorcontrib>Kelsey, Tom</creatorcontrib><creatorcontrib>Matthias, Christoph</creatorcontrib><creatorcontrib>Fleischmann, Maximilian</creatorcontrib><creatorcontrib>Jungmann, Florian</creatorcontrib><creatorcontrib>Alt, Jürgen</creatorcontrib><creatorcontrib>Brandts, Christian</creatorcontrib><creatorcontrib>Kämmerer, Peer W</creatorcontrib><creatorcontrib>Foersch, Sebastian</creatorcontrib><creatorcontrib>Kuhn, Sebastian</creatorcontrib><creatorcontrib>Eckrich, Jonas</creatorcontrib><title>Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery</title><title>European archives of oto-rhino-laryngology</title><addtitle>Eur Arch Otorhinolaryngol</addtitle><description>Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and the use of confidential patient information in web-based LLMs have restricted their widespread adoption and hindered the exploration of their full potential. In this first study of its kind we compared standard human multidisciplinary tumor board recommendations (MDT) against a web-based LLM (ChatGPT-4o) and a locally run LLM (Llama 3) addressing data protection concerns. Twenty-five simulated tumor board cases were presented to an MDT composed of specialists from otorhinolaryngology, craniomaxillofacial surgery, medical oncology, radiology, radiation oncology, and pathology. This multidisciplinary team provided a comprehensive analysis of the cases. The same cases were input into ChatGPT-4o and Llama 3 using structured prompts, and the concordance between the LLMs' and MDT's recommendations was assessed. Four MDT members evaluated the LLMs' recommendations in terms of medical adequacy (using a six-point Likert scale) and whether the information provided could have influenced the MDT's original recommendations. ChatGPT-4o showed 84% concordance (21 out of 25 cases) and Llama 3 demonstrated 92% concordance (23 out of 25 cases) with the MDT in distinguishing between curative and palliative treatment strategies. In 64% of cases (16/25) ChatGPT-4o and in 60% of cases (15/25) Llama, identified all first-line therapy options considered by the MDT, though with varying priority. ChatGPT-4o presented all the MDT's first-line therapies in 52% of cases (13/25), while Llama 3 offered a homologous treatment strategy in 48% of cases (12/25). Additionally, both models proposed at least one of the MDT's first-line therapies as their top recommendation in 28% of cases (7/25). The ratings for medical adequacy yielded a mean score of 4.7 (IQR: 4-6) for ChatGPT-4o and 4.3 (IQR: 3-5) for Llama 3. In 17% of the assessments (33/200), MDT members indicated that the LLM recommendations could potentially enhance the MDT's decisions. This study demonstrates the capability of both LLMs to provide viable therapeutic recommendations in ORL head and neck surgery. Llama 3, operating locally, bypasses many data protection issues and shows promise as a clinical tool to support MDT decisions. However at present, LLMs should augment rather than replace human decision-making.</description><issn>0937-4477</issn><issn>1434-4726</issn><issn>1434-4726</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNo9kU9v1TAMwCPExN7-fAEOKEcOBJImaZvjNA2YNInL7pHbuH1lbTLilul9Hr7owt7gYvtg_yz7x9h7JT8rKZsvJKWRVsjKCOmU1UK_YTtltBGmqeq3bCedboQxTXPKzoh-SimtcfodO9WucVUl5Y79uSJCogXjytPAA_YTTSmKBR6mOPKnad3zOfUwzweet8ghBv6EneiAMPAZ8oglxnGDUiwp4Ez8N2baiO-3BSLvEuTAM_ZpKUsCrIVOfIo8rSnvp5gK4xDHNKfx8InvEcLLjoj9A6et4PPhgp0MMBNevuZzdv_15v76u7j78e32-upO9M5J4VoHNlhnrWqxN07p2gK4pnKDQa07V5m2g2B1rUA7BThUEgCG2rbB6F7qc_bxiH3M6deGtPploh7nch6mjbwuLy5fq01bWqtja58TUcbBP-ZpKYd4Jf1fN_7oxhc3_sWN12Xowyt_6xYM_0f-ydDPpOONsw</recordid><startdate>20250110</startdate><enddate>20250110</enddate><creator>Buhr, Christoph Raphael</creator><creator>Ernst, Benjamin Philipp</creator><creator>Blaikie, Andrew</creator><creator>Smith, Harry</creator><creator>Kelsey, Tom</creator><creator>Matthias, Christoph</creator><creator>Fleischmann, Maximilian</creator><creator>Jungmann, Florian</creator><creator>Alt, Jürgen</creator><creator>Brandts, Christian</creator><creator>Kämmerer, Peer W</creator><creator>Foersch, Sebastian</creator><creator>Kuhn, Sebastian</creator><creator>Eckrich, Jonas</creator><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-9551-2310</orcidid></search><sort><creationdate>20250110</creationdate><title>Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery</title><author>Buhr, Christoph Raphael ; Ernst, Benjamin Philipp ; Blaikie, Andrew ; Smith, Harry ; Kelsey, Tom ; Matthias, Christoph ; Fleischmann, Maximilian ; Jungmann, Florian ; Alt, Jürgen ; Brandts, Christian ; Kämmerer, Peer W ; Foersch, Sebastian ; Kuhn, Sebastian ; Eckrich, Jonas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c990-989a5d595518ec491365aa9729f4e33b9248bad5361a391aef20aaaf658d43c03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Buhr, Christoph Raphael</creatorcontrib><creatorcontrib>Ernst, Benjamin Philipp</creatorcontrib><creatorcontrib>Blaikie, Andrew</creatorcontrib><creatorcontrib>Smith, Harry</creatorcontrib><creatorcontrib>Kelsey, Tom</creatorcontrib><creatorcontrib>Matthias, Christoph</creatorcontrib><creatorcontrib>Fleischmann, Maximilian</creatorcontrib><creatorcontrib>Jungmann, Florian</creatorcontrib><creatorcontrib>Alt, Jürgen</creatorcontrib><creatorcontrib>Brandts, Christian</creatorcontrib><creatorcontrib>Kämmerer, Peer W</creatorcontrib><creatorcontrib>Foersch, Sebastian</creatorcontrib><creatorcontrib>Kuhn, Sebastian</creatorcontrib><creatorcontrib>Eckrich, Jonas</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>European archives of oto-rhino-laryngology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Buhr, Christoph Raphael</au><au>Ernst, Benjamin Philipp</au><au>Blaikie, Andrew</au><au>Smith, Harry</au><au>Kelsey, Tom</au><au>Matthias, Christoph</au><au>Fleischmann, Maximilian</au><au>Jungmann, Florian</au><au>Alt, Jürgen</au><au>Brandts, Christian</au><au>Kämmerer, Peer W</au><au>Foersch, Sebastian</au><au>Kuhn, Sebastian</au><au>Eckrich, Jonas</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery</atitle><jtitle>European archives of oto-rhino-laryngology</jtitle><addtitle>Eur Arch Otorhinolaryngol</addtitle><date>2025-01-10</date><risdate>2025</risdate><issn>0937-4477</issn><issn>1434-4726</issn><eissn>1434-4726</eissn><abstract>Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and the use of confidential patient information in web-based LLMs have restricted their widespread adoption and hindered the exploration of their full potential. In this first study of its kind we compared standard human multidisciplinary tumor board recommendations (MDT) against a web-based LLM (ChatGPT-4o) and a locally run LLM (Llama 3) addressing data protection concerns. Twenty-five simulated tumor board cases were presented to an MDT composed of specialists from otorhinolaryngology, craniomaxillofacial surgery, medical oncology, radiology, radiation oncology, and pathology. This multidisciplinary team provided a comprehensive analysis of the cases. The same cases were input into ChatGPT-4o and Llama 3 using structured prompts, and the concordance between the LLMs' and MDT's recommendations was assessed. Four MDT members evaluated the LLMs' recommendations in terms of medical adequacy (using a six-point Likert scale) and whether the information provided could have influenced the MDT's original recommendations. ChatGPT-4o showed 84% concordance (21 out of 25 cases) and Llama 3 demonstrated 92% concordance (23 out of 25 cases) with the MDT in distinguishing between curative and palliative treatment strategies. In 64% of cases (16/25) ChatGPT-4o and in 60% of cases (15/25) Llama, identified all first-line therapy options considered by the MDT, though with varying priority. ChatGPT-4o presented all the MDT's first-line therapies in 52% of cases (13/25), while Llama 3 offered a homologous treatment strategy in 48% of cases (12/25). Additionally, both models proposed at least one of the MDT's first-line therapies as their top recommendation in 28% of cases (7/25). The ratings for medical adequacy yielded a mean score of 4.7 (IQR: 4-6) for ChatGPT-4o and 4.3 (IQR: 3-5) for Llama 3. In 17% of the assessments (33/200), MDT members indicated that the LLM recommendations could potentially enhance the MDT's decisions. This study demonstrates the capability of both LLMs to provide viable therapeutic recommendations in ORL head and neck surgery. Llama 3, operating locally, bypasses many data protection issues and shows promise as a clinical tool to support MDT decisions. However at present, LLMs should augment rather than replace human decision-making.</abstract><cop>Germany</cop><pmid>39792200</pmid><doi>10.1007/s00405-024-09153-3</doi><orcidid>https://orcid.org/0000-0002-9551-2310</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0937-4477
ispartof European archives of oto-rhino-laryngology, 2025-01
issn 0937-4477
1434-4726
1434-4726
language eng
recordid cdi_proquest_miscellaneous_3153922648
source SpringerLink Journals - AutoHoldings
title Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T19%3A08%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Assessment%20of%20decision-making%20with%20locally%20run%20and%20web-based%20large%20language%20models%20versus%20human%20board%20recommendations%20in%20otorhinolaryngology,%20head%20and%20neck%20surgery&rft.jtitle=European%20archives%20of%20oto-rhino-laryngology&rft.au=Buhr,%20Christoph%20Raphael&rft.date=2025-01-10&rft.issn=0937-4477&rft.eissn=1434-4726&rft_id=info:doi/10.1007/s00405-024-09153-3&rft_dat=%3Cproquest_cross%3E3153922648%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3153922648&rft_id=info:pmid/39792200&rfr_iscdi=true