Artificial intelligence in foot and ankle pathology: Can large language models replace us?

Objective: Determine if large language models (LLMs) provide better or similar information compared to an expert trained in foot and ankle pathology in various aspects of daily practice (definition and treatment of pathology, general questions). Methods: Three experts and two artificial intelligent...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of the Foot & Ankle (Online) 2024-05, Vol.18 (1), p.52-58
Hauptverfasser:	Segura, Florencio Pablo, Segura, Facundo Manuel, Porta, Julieta, Heredia, Natalia, Masquijo, Ignacio, Anain, Federico, Casola, Leandro, Trevisson, Agustina, Cafruni, Virginia, Zudaire, Maria Paz Lucero, Toledo, Ignacio, Segura, Florencio Vicente
Format:	Artikel
Sprache:	eng
Schlagworte:	Diagnosis Observer variation Treatment Outcome
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Objective: Determine if large language models (LLMs) provide better or similar information compared to an expert trained in foot and ankle pathology in various aspects of daily practice (definition and treatment of pathology, general questions). Methods: Three experts and two artificial intelligent (AI) models, ChatGPT (GPT-4) and Google Bard, answered 15 specialty-related questions, divided equally among definitions, treatments, and general queries. After coding, responses were redistributed and evaluated by five additional experts, assessing aspects like clarity, factual accuracy, and patient usefulness. The Likert scale was used to score each question, enabling experts to gauge their agreement with the provided information. Results: Using the Likert scale, each question could score between 5 and 25 points, totaling 375 or 75 points for evaluations. Expert 2 led with 69.86%, followed by Expert 1 at 68.53%, ChatGPT at 64.80%, Expert 3 at 58.40%, and Google Bard at 54.93%. Comparing experts, significant differences emerged, especially with Google Bard. The rankings varied in specific sections like definitions and treatments, highlighting GPT-4’s variability across sections. The results emphasize the differences in performance among experts and AI models. Conclusion: Our findings indicate that GPT-4 often performed comparably to or even better than experts, particularly in definition and general question sections. However, both LLMs lagged notably in the treatment section. These results underscore the potential of LLMs as valuable tools in orthopedics but highlight their limitations, emphasizing the irreplaceable role of expert expertise in intricate medical contexts. Evidence Level: III, observational, analytics. Objetivo: Determinar si los grandes modelos de lenguaje proporcionan información mejor o similar en comparación con un experto capacitado en patología del pie y tobillo, en varios aspectos de la práctica diaria (definición de una patología, tratamiento de una patología, preguntas generales). Métodos: Tres expertos y dos modelos de IA, ChatGPT (GPT-4) y Google Bard, respondieron a 15 preguntas relacionadas con la especialidad, divididas equitativamente entre definiciones, tratamientos y consultas generales. Tras codificarlas, las respuestas se redistribuyeron y fueron evaluadas por otros cinco expertos adicionales, valorando aspectos como claridad, precisión y utilidad para el paciente. La puntuación se realizó utilizando una escala Likert, que per
ISSN:	2675-2980 2675-2980
DOI:	10.30795/jfootankle.2024.v18.1757