Surgeons vs ChatGPT: Assessment and Feedback Performance Based on Real Surgical Scenarios

•Artificial Intelligence tools are being integrated into surgical education.•Research on education emphasizes the relevance of feedback in skill acquisition.•ChatGPT could effectively identify errors in written surgical scenarios.•ChatGPT's feedback outputs were comparable to those of surgeons....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of surgical education 2024-07, Vol.81 (7), p.960-966
Hauptverfasser: Jarry Trujillo, Cristián, Vela Ulloa, Javier, Escalona Vivas, Gabriel, Grasset Escobar, Eugenio, Villagrán Gutiérrez, Ignacio, Achurra Tirado, Pablo, Varas Cohen, Julián
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Artificial Intelligence tools are being integrated into surgical education.•Research on education emphasizes the relevance of feedback in skill acquisition.•ChatGPT could effectively identify errors in written surgical scenarios.•ChatGPT's feedback outputs were comparable to those of surgeons.•Findings were corroborated by an experienced surgeon and a feedback expert. Artificial intelligence tools are being progressively integrated into medicine and surgical education. Large language models, such as ChatGPT, could provide relevant feedback aimed at improving surgical skills. The purpose of this study is to assess ChatGPT´s ability to provide feedback based on surgical scenarios. Surgical situations were transformed into texts using a neutral narrative. Texts were evaluated by ChatGPT 4.0 and 3 surgeons (A, B, C) after a brief instruction was delivered: identify errors and provide feedback accordingly. Surgical residents were provided with each of the situations and feedback obtained during the first stage, as written by each surgeon and ChatGPT, and were asked to assess the utility of feedback (FCUR) and its quality (FQ). As control measurement, an Education-Expert (EE) and a Clinical-Expert (CE) were asked to assess FCUR and FQ. Regarding residents’ evaluations, 96.43% of times, outputs provided by ChatGPT were considered useful, comparable to what surgeons’ B and C obtained. Assessing FQ, ChatGPT and all surgeons received similar scores. Regarding EE's assessment, ChatGPT obtained a significantly higher FQ score when compared to surgeons A and B (p = 0.019; p = 0.033) with a median score of 8 vs. 7 and 7.5, respectively; and no difference respect surgeon C (score of 8; p = 0.2). Regarding CE´s assessment, surgeon B obtained the highest FQ score while ChatGPT received scores comparable to that of surgeons A and C. When participants were asked to identify the source of the feedback, residents, CE, and EE perceived ChatGPT's outputs as human-provided in 33.9%, 28.5%, and 14.3% of cases, respectively. When given brief written surgical situations, ChatGPT was able to identify errors with a detection rate comparable to that of experienced surgeons and to generate feedback that was considered useful for skill improvement in a surgical context performing as well as surgical instructors across assessments made by general surgery residents, an experienced surgeon, and a nonsurgeon feedback expert.
ISSN:1931-7204
1878-7452
DOI:10.1016/j.jsurg.2024.03.012