Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis

To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology rep...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Academic radiology 2024-09
Hauptverfasser: Can, Elif, Uller, Wibke, Vogt, Katharina, Doppler, Michael C., Busch, Felix, Bayerl, Nadine, Ellmann, Stephan, Kader, Avan, Elkilany, Aboelyazid, Makowski, Marcus R., Bressem, Keno K., Adams, Lisa C.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports. Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis. Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p 
ISSN:1076-6332
1878-4046
1878-4046
DOI:10.1016/j.acra.2024.09.041