A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity
This study evaluates the ability of six popular chatbots; ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity, to provide reliable answers to questions concerning keratoconus. Chatbots responses were assessed using mDISCERN (range: 15-75) and Global Quality Score (GQS) (range: 1-5)...
Gespeichert in:
Veröffentlicht in: | Journal of clinical medicine 2024-10, Vol.13 (21), p.6512 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This study evaluates the ability of six popular chatbots; ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity, to provide reliable answers to questions concerning keratoconus.
Chatbots responses were assessed using mDISCERN (range: 15-75) and Global Quality Score (GQS) (range: 1-5) metrics. Readability was evaluated using nine validated readability assessments. We also addressed the quality and accountability of websites from which the questions originated.
We analyzed 20 websites, 65% "Private practice or independent user" and 35% "Official patient education materials". The mean JAMA benchmark score was 1.40 ± 0.91 (0-4 points), indicating low accountability. Reliability, measured using mDISCERN, ranged from 42.91 ± 3.15 (ChatGPT-3.5) to 46.95 ± 3.53 (Copilot). The most frequent question was "What is keratoconus?" with 70% of websites providing relevant information. This received the highest mDISCERN score (49.30 ± 4.91) and a relatively high GQS score (3.40 ± 0.56) with an Automated Readability Level Calculator score of 13.17 ± 2.13. Moderate positive correlations were determined between the website numbers and both mDISCERN (
= 0.265,
= 0.25) and GQS (
= 0.453,
= 0.05) scores. The quality of information, assessed using the GQS, ranged from 3.02 ± 0.55 (ChatGPT-3.5) to 3.31 ± 0.64 (Gemini) (
= 0.34). The differences between the texts were statistically significant. Gemini emerged as the easiest to read, while ChatGPT-3.5 and Perplexity were the most difficult. Based on mDISCERN scores, Gemini and Copilot exhibited the highest percentage of responses in the "good" range (51-62 points). For the GQS, the Gemini model exhibited the highest percentage of responses in the "good" quality range with 40% of its responses scoring 4-5.
While all chatbots performed well, Gemini and Copilot showed better reliability and quality. However, their readability often exceeded recommended levels. Continuous improvements are essential to match information with patients' health literacy for effective use in ophthalmology. |
---|---|
ISSN: | 2077-0383 2077-0383 |
DOI: | 10.3390/jcm13216512 |