Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model

Purpose We aimed to assess the appropriateness of ChatGPT in providing answers related to prostate cancer (PCa) screening, comparing GPT-3.5 and GPT-4. Methods A committee of five reviewers designed 30 questions related to PCa screening, categorized into three difficulty levels. The questions were f...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International urology and nephrology 2024-08, Vol.56 (8), p.2589-2595
Hauptverfasser: Chiarelli, Giuseppe, Stephens, Alex, Finati, Marco, Cirulli, Giuseppe Ottone, Beatrici, Edoardo, Filipas, Dejan K., Arora, Sohrab, Tinsley, Shane, Bhandari, Mahendra, Carrieri, Giuseppe, Trinh, Quoc-Dien, Briganti, Alberto, Montorsi, Francesco, Lughezzani, Giovanni, Buffi, Nicolò, Rogers, Craig, Abdollah, Firas
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Purpose We aimed to assess the appropriateness of ChatGPT in providing answers related to prostate cancer (PCa) screening, comparing GPT-3.5 and GPT-4. Methods A committee of five reviewers designed 30 questions related to PCa screening, categorized into three difficulty levels. The questions were formulated identically for both GPTs three times, varying the prompts. Each reviewer assigned a score for accuracy, clarity, and conciseness. The readability was assessed by the Flesch Kincaid Grade (FKG) and Flesch Reading Ease (FRE). The mean scores were extracted and compared using the Wilcoxon test. We compared the readability across the three different prompts by ANOVA. Results In GPT-3.5 the mean score (SD) for accuracy, clarity, and conciseness was 1.5 (0.59), 1.7 (0.45), 1.7 (0.49), respectively for easy questions; 1.3 (0.67), 1.6 (0.69), 1.3 (0.65) for medium; 1.3 (0.62), 1.6 (0.56), 1.4 (0.56) for hard. In GPT-4 was 2.0 (0), 2.0 (0), 2.0 (0.14), respectively for easy questions; 1.7 (0.66), 1.8 (0.61), 1.7 (0.64) for medium; 2.0 (0.24), 1.8 (0.37), 1.9 (0.27) for hard. GPT-4 performed better for all three qualities and difficulty levels than GPT-3.5. The FKG mean for GPT-3.5 and GPT-4 answers were 12.8 (1.75) and 10.8 (1.72), respectively; the FRE for GPT-3.5 and GPT-4 was 37.3 (9.65) and 47.6 (9.88), respectively. The 2nd prompt has achieved better results in terms of clarity (all p  
ISSN:1573-2584
0301-1623
1573-2584
DOI:10.1007/s11255-024-04009-5