Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery

Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	European archives of oto-rhino-laryngology 2025-01
Hauptverfasser:	Buhr, Christoph Raphael, Ernst, Benjamin Philipp, Blaikie, Andrew, Smith, Harry, Kelsey, Tom, Matthias, Christoph, Fleischmann, Maximilian, Jungmann, Florian, Alt, Jürgen, Brandts, Christian, Kämmerer, Peer W, Foersch, Sebastian, Kuhn, Sebastian, Eckrich, Jonas
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	European archives of oto-rhino-laryngology
container_volume
creator	Buhr, Christoph Raphael Ernst, Benjamin Philipp Blaikie, Andrew Smith, Harry Kelsey, Tom Matthias, Christoph Fleischmann, Maximilian Jungmann, Florian Alt, Jürgen Brandts, Christian Kämmerer, Peer W Foersch, Sebastian Kuhn, Sebastian Eckrich, Jonas
description	Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and the use of confidential patient information in web-based LLMs have restricted their widespread adoption and hindered the exploration of their full potential. In this first study of its kind we compared standard human multidisciplinary tumor board recommendations (MDT) against a web-based LLM (ChatGPT-4o) and a locally run LLM (Llama 3) addressing data protection concerns. Twenty-five simulated tumor board cases were presented to an MDT composed of specialists from otorhinolaryngology, craniomaxillofacial surgery, medical oncology, radiology, radiation oncology, and pathology. This multidisciplinary team provided a comprehensive analysis of the cases. The same cases were input into ChatGPT-4o and Llama 3 using structured prompts, and the concordance between the LLMs' and MDT's recommendations was assessed. Four MDT members evaluated the LLMs' recommendations in terms of medical adequacy (using a six-point Likert scale) and whether the information provided could have influenced the MDT's original recommendations. ChatGPT-4o showed 84% concordance (21 out of 25 cases) and Llama 3 demonstrated 92% concordance (23 out of 25 cases) with the MDT in distinguishing between curative and palliative treatment strategies. In 64% of cases (16/25) ChatGPT-4o and in 60% of cases (15/25) Llama, identified all first-line therapy options considered by the MDT, though with varying priority. ChatGPT-4o presented all the MDT's first-line therapies in 52% of cases (13/25), while Llama 3 offered a homologous treatment strategy in 48% of cases (12/25). Additionally, both models proposed at least one of the MDT's first-line therapies as their top recommendation in 28% of cases (7/25). The ratings for medical adequacy yielded a mean score of 4.7 (IQR: 4-6) for ChatGPT-4o and 4.3 (IQR: 3-5) for Llama 3. In 17% of the assessments (33/200), MDT members indicated that the LLM recommendations could potentially enhance the MDT's decisions. This study demonstrates the capability of both LLMs to provide viable therapeutic recommendations in ORL head and neck surgery. Llama 3, operating locally, bypasses many data protection issues and shows promise as a clinical tool to support MDT decisions. Howe
doi_str_mv	10.1007/s00405-024-09153-3
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3153922648</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3153922648</sourcerecordid><originalsourceid>FETCH-LOGICAL-c990-989a5d595518ec491365aa9729f4e33b9248bad5361a391aef20aaaf658d43c03</originalsourceid><addsrcrecordid>eNo9kU9v1TAMwCPExN7-fAEOKEcOBJImaZvjNA2YNInL7pHbuH1lbTLilul9Hr7owt7gYvtg_yz7x9h7JT8rKZsvJKWRVsjKCOmU1UK_YTtltBGmqeq3bCedboQxTXPKzoh-SimtcfodO9WucVUl5Y79uSJCogXjytPAA_YTTSmKBR6mOPKnad3zOfUwzweet8ghBv6EneiAMPAZ8oglxnGDUiwp4Ez8N2baiO-3BSLvEuTAM_ZpKUsCrIVOfIo8rSnvp5gK4xDHNKfx8InvEcLLjoj9A6et4PPhgp0MMBNevuZzdv_15v76u7j78e32-upO9M5J4VoHNlhnrWqxN07p2gK4pnKDQa07V5m2g2B1rUA7BThUEgCG2rbB6F7qc_bxiH3M6deGtPploh7nch6mjbwuLy5fq01bWqtja58TUcbBP-ZpKYd4Jf1fN_7oxhc3_sWN12Xowyt_6xYM_0f-ydDPpOONsw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3153922648</pqid></control><display><type>article</type><title>Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery</title><source>SpringerLink Journals - AutoHoldings</source><creator>Buhr, Christoph Raphael ; Ernst, Benjamin Philipp ; Blaikie, Andrew ; Smith, Harry ; Kelsey, Tom ; Matthias, Christoph ; Fleischmann, Maximilian ; Jungmann, Florian ; Alt, Jürgen ; Brandts, Christian ; Kämmerer, Peer W ; Foersch, Sebastian ; Kuhn, Sebastian ; Eckrich, Jonas</creator><creatorcontrib>Buhr, Christoph Raphael ; Ernst, Benjamin Philipp ; Blaikie, Andrew ; Smith, Harry ; Kelsey, Tom ; Matthias, Christoph ; Fleischmann, Maximilian ; Jungmann, Florian ; Alt, Jürgen ; Brandts, Christian ; Kämmerer, Peer W ; Foersch, Sebastian ; Kuhn, Sebastian ; Eckrich, Jonas</creatorcontrib><description>Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and the use of confidential patient information in web-based LLMs have restricted their widespread adoption and hindered the exploration of their full potential. In this first study of its kind we compared standard human multidisciplinary tumor board recommendations (MDT) against a web-based LLM (ChatGPT-4o) and a locally run LLM (Llama 3) addressing data protection concerns. Twenty-five simulated tumor board cases were presented to an MDT composed of specialists from otorhinolaryngology, craniomaxillofacial surgery, medical oncology, radiology, radiation oncology, and pathology. This multidisciplinary team provided a comprehensive analysis of the cases. The same cases were input into ChatGPT-4o and Llama 3 using structured prompts, and the concordance between the LLMs' and MDT's recommendations was assessed. Four MDT members evaluated the LLMs' recommendations in terms of medical adequacy (using a six-point Likert scale) and whether the information provided could have influenced the MDT's original recommendations. ChatGPT-4o showed 84% concordance (21 out of 25 cases) and Llama 3 demonstrated 92% concordance (23 out of 25 cases) with the MDT in distinguishing between curative and palliative treatment strategies. In 64% of cases (16/25) ChatGPT-4o and in 60% of cases (15/25) Llama, identified all first-line therapy options considered by the MDT, though with varying priority. ChatGPT-4o presented all the MDT's first-line therapies in 52% of cases (13/25), while Llama 3 offered a homologous treatment strategy in 48% of cases (12/25). Additionally, both models proposed at least one of the MDT's first-line therapies as their top recommendation in 28% of cases (7/25). The ratings for medical adequacy yielded a mean score of 4.7 (IQR: 4-6) for ChatGPT-4o and 4.3 (IQR: 3-5) for Llama 3. In 17% of the assessments (33/200), MDT members indicated that the LLM recommendations could potentially enhance the MDT's decisions. This study demonstrates the capability of both LLMs to provide viable therapeutic recommendations in ORL head and neck surgery. Llama 3, operating locally, bypasses many data protection issues and shows promise as a clinical tool to support MDT decisions. However at present, LLMs should augment rather than replace human decision-making.</description><identifier>ISSN: 0937-4477</identifier><identifier>ISSN: 1434-4726</identifier><identifier>EISSN: 1434-4726</identifier><identifier>DOI: 10.1007/s00405-024-09153-3</identifier><identifier>PMID: 39792200</identifier><language>eng</language><publisher>Germany</publisher><ispartof>European archives of oto-rhino-laryngology, 2025-01</ispartof><rights>2024. The Author(s).</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c990-989a5d595518ec491365aa9729f4e33b9248bad5361a391aef20aaaf658d43c03</cites><orcidid>0000-0002-9551-2310</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39792200$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Buhr, Christoph Raphael</creatorcontrib><creatorcontrib>Ernst, Benjamin Philipp</creatorcontrib><creatorcontrib>Blaikie, Andrew</creatorcontrib><creatorcontrib>Smith, Harry</creatorcontrib><creatorcontrib>Kelsey, Tom</creatorcontrib><creatorcontrib>Matthias, Christoph</creatorcontrib><creatorcontrib>Fleischmann, Maximilian</creatorcontrib><creatorcontrib>Jungmann, Florian</creatorcontrib><creatorcontrib>Alt, Jürgen</creatorcontrib><creatorcontrib>Brandts, Christian</creatorcontrib><creatorcontrib>Kämmerer, Peer W</creatorcontrib><creatorcontrib>Foersch, Sebastian</creatorcontrib><creatorcontrib>Kuhn, Sebastian</creatorcontrib><creatorcontrib>Eckrich, Jonas</creatorcontrib><title>Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery</title><title>European archives of oto-rhino-laryngology</title><addtitle>Eur Arch Otorhinolaryngol</addtitle><description>Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and the use of confidential patient information in web-based LLMs have restricted their widespread adoption and hindered the exploration of their full potential. In this first study of its kind we compared standard human multidisciplinary tumor board recommendations (MDT) against a web-based LLM (ChatGPT-4o) and a locally run LLM (Llama 3) addressing data protection concerns. Twenty-five simulated tumor board cases were presented to an MDT composed of specialists from otorhinolaryngology, craniomaxillofacial surgery, medical oncology, radiology, radiation oncology, and pathology. This multidisciplinary team provided a comprehensive analysis of the cases. The same cases were input into ChatGPT-4o and Llama 3 using structured prompts, and the concordance between the LLMs' and MDT's recommendations was assessed. Four MDT members evaluated the LLMs' recommendations in terms of medical adequacy (using a six-point Likert scale) and whether the information provided could have influenced the MDT's original recommendations. ChatGPT-4o showed 84% concordance (21 out of 25 cases) and Llama 3 demonstrated 92% concordance (23 out of 25 cases) with the MDT in distinguishing between curative and palliative treatment strategies. In 64% of cases (16/25) ChatGPT-4o and in 60% of cases (15/25) Llama, identified all first-line therapy options considered by the MDT, though with varying priority. ChatGPT-4o presented all the MDT's first-line therapies in 52% of cases (13/25), while Llama 3 offered a homologous treatment strategy in 48% of cases (12/25). Additionally, both models proposed at least one of the MDT's first-line therapies as their top recommendation in 28% of cases (7/25). The ratings for medical adequacy yielded a mean score of 4.7 (IQR: 4-6) for ChatGPT-4o and 4.3 (IQR: 3-5) for Llama 3. In 17% of the assessments (33/200), MDT members indicated that the LLM recommendations could potentially enhance the MDT's decisions. This study demonstrates the capability of both LLMs to provide viable therapeutic recommendations in ORL head and neck surgery. Llama 3, operating locally, bypasses many data protection issues and shows promise as a clinical tool to support MDT decisions. However at present, LLMs should augment rather than replace human decision-making.</description><issn>0937-4477</issn><issn>1434-4726</issn><issn>1434-4726</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNo9kU9v1TAMwCPExN7-fAEOKEcOBJImaZvjNA2YNInL7pHbuH1lbTLilul9Hr7owt7gYvtg_yz7x9h7JT8rKZsvJKWRVsjKCOmU1UK_YTtltBGmqeq3bCedboQxTXPKzoh-SimtcfodO9WucVUl5Y79uSJCogXjytPAA_YTTSmKBR6mOPKnad3zOfUwzweet8ghBv6EneiAMPAZ8oglxnGDUiwp4Ez8N2baiO-3BSLvEuTAM_ZpKUsCrIVOfIo8rSnvp5gK4xDHNKfx8InvEcLLjoj9A6et4PPhgp0MMBNevuZzdv_15v76u7j78e32-upO9M5J4VoHNlhnrWqxN07p2gK4pnKDQa07V5m2g2B1rUA7BThUEgCG2rbB6F7qc_bxiH3M6deGtPploh7nch6mjbwuLy5fq01bWqtja58TUcbBP-ZpKYd4Jf1fN_7oxhc3_sWN12Xowyt_6xYM_0f-ydDPpOONsw</recordid><startdate>20250110</startdate><enddate>20250110</enddate><creator>Buhr, Christoph Raphael</creator><creator>Ernst, Benjamin Philipp</creator><creator>Blaikie, Andrew</creator><creator>Smith, Harry</creator><creator>Kelsey, Tom</creator><creator>Matthias, Christoph</creator><creator>Fleischmann, Maximilian</creator><creator>Jungmann, Florian</creator><creator>Alt, Jürgen</creator><creator>Brandts, Christian</creator><creator>Kämmerer, Peer W</creator><creator>Foersch, Sebastian</creator><creator>Kuhn, Sebastian</creator><creator>Eckrich, Jonas</creator><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-9551-2310</orcidid></search><sort><creationdate>20250110</creationdate><title>Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery</title><author>Buhr, Christoph Raphael ; Ernst, Benjamin Philipp ; Blaikie, Andrew ; Smith, Harry ; Kelsey, Tom ; Matthias, Christoph ; Fleischmann, Maximilian ; Jungmann, Florian ; Alt, Jürgen ; Brandts, Christian ; Kämmerer, Peer W ; Foersch, Sebastian ; Kuhn, Sebastian ; Eckrich, Jonas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c990-989a5d595518ec491365aa9729f4e33b9248bad5361a391aef20aaaf658d43c03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Buhr, Christoph Raphael</creatorcontrib><creatorcontrib>Ernst, Benjamin Philipp</creatorcontrib><creatorcontrib>Blaikie, Andrew</creatorcontrib><creatorcontrib>Smith, Harry</creatorcontrib><creatorcontrib>Kelsey, Tom</creatorcontrib><creatorcontrib>Matthias, Christoph</creatorcontrib><creatorcontrib>Fleischmann, Maximilian</creatorcontrib><creatorcontrib>Jungmann, Florian</creatorcontrib><creatorcontrib>Alt, Jürgen</creatorcontrib><creatorcontrib>Brandts, Christian</creatorcontrib><creatorcontrib>Kämmerer, Peer W</creatorcontrib><creatorcontrib>Foersch, Sebastian</creatorcontrib><creatorcontrib>Kuhn, Sebastian</creatorcontrib><creatorcontrib>Eckrich, Jonas</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>European archives of oto-rhino-laryngology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Buhr, Christoph Raphael</au><au>Ernst, Benjamin Philipp</au><au>Blaikie, Andrew</au><au>Smith, Harry</au><au>Kelsey, Tom</au><au>Matthias, Christoph</au><au>Fleischmann, Maximilian</au><au>Jungmann, Florian</au><au>Alt, Jürgen</au><au>Brandts, Christian</au><au>Kämmerer, Peer W</au><au>Foersch, Sebastian</au><au>Kuhn, Sebastian</au><au>Eckrich, Jonas</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery</atitle><jtitle>European archives of oto-rhino-laryngology</jtitle><addtitle>Eur Arch Otorhinolaryngol</addtitle><date>2025-01-10</date><risdate>2025</risdate><issn>0937-4477</issn><issn>1434-4726</issn><eissn>1434-4726</eissn><abstract>Tumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and the use of confidential patient information in web-based LLMs have restricted their widespread adoption and hindered the exploration of their full potential. In this first study of its kind we compared standard human multidisciplinary tumor board recommendations (MDT) against a web-based LLM (ChatGPT-4o) and a locally run LLM (Llama 3) addressing data protection concerns. Twenty-five simulated tumor board cases were presented to an MDT composed of specialists from otorhinolaryngology, craniomaxillofacial surgery, medical oncology, radiology, radiation oncology, and pathology. This multidisciplinary team provided a comprehensive analysis of the cases. The same cases were input into ChatGPT-4o and Llama 3 using structured prompts, and the concordance between the LLMs' and MDT's recommendations was assessed. Four MDT members evaluated the LLMs' recommendations in terms of medical adequacy (using a six-point Likert scale) and whether the information provided could have influenced the MDT's original recommendations. ChatGPT-4o showed 84% concordance (21 out of 25 cases) and Llama 3 demonstrated 92% concordance (23 out of 25 cases) with the MDT in distinguishing between curative and palliative treatment strategies. In 64% of cases (16/25) ChatGPT-4o and in 60% of cases (15/25) Llama, identified all first-line therapy options considered by the MDT, though with varying priority. ChatGPT-4o presented all the MDT's first-line therapies in 52% of cases (13/25), while Llama 3 offered a homologous treatment strategy in 48% of cases (12/25). Additionally, both models proposed at least one of the MDT's first-line therapies as their top recommendation in 28% of cases (7/25). The ratings for medical adequacy yielded a mean score of 4.7 (IQR: 4-6) for ChatGPT-4o and 4.3 (IQR: 3-5) for Llama 3. In 17% of the assessments (33/200), MDT members indicated that the LLM recommendations could potentially enhance the MDT's decisions. This study demonstrates the capability of both LLMs to provide viable therapeutic recommendations in ORL head and neck surgery. Llama 3, operating locally, bypasses many data protection issues and shows promise as a clinical tool to support MDT decisions. However at present, LLMs should augment rather than replace human decision-making.</abstract><cop>Germany</cop><pmid>39792200</pmid><doi>10.1007/s00405-024-09153-3</doi><orcidid>https://orcid.org/0000-0002-9551-2310</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0937-4477
ispartof	European archives of oto-rhino-laryngology, 2025-01
issn	0937-4477 1434-4726 1434-4726
language	eng
recordid	cdi_proquest_miscellaneous_3153922648
source	SpringerLink Journals - AutoHoldings
title	Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T19%3A08%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Assessment%20of%20decision-making%20with%20locally%20run%20and%20web-based%20large%20language%20models%20versus%20human%20board%20recommendations%20in%20otorhinolaryngology,%20head%20and%20neck%20surgery&rft.jtitle=European%20archives%20of%20oto-rhino-laryngology&rft.au=Buhr,%20Christoph%20Raphael&rft.date=2025-01-10&rft.issn=0937-4477&rft.eissn=1434-4726&rft_id=info:doi/10.1007/s00405-024-09153-3&rft_dat=%3Cproquest_cross%3E3153922648%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3153922648&rft_id=info:pmid/39792200&rfr_iscdi=true