Automated Patient Note Grading: Examining Scoring Reliability and Feasibility

Scoring post-encounter patient notes (PNs) yields significant insights into student performance, but the resource intensity of scoring limits its use. Recent advances in natural language processing (NLP) and machine learning (ML) allow the application of automated short answer grading (ASAG) for thi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Academic Medicine 2023-11, Vol.98 (11S), p.S90-S97
Hauptverfasser:	Bond, William F, Zhou, Jianing, Bhat, Suma, Park, Yoon Soo, Ebert-Allen, Rebecca A, Ruger, Rebecca L, Yudkowsky, Rachel
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Scoring post-encounter patient notes (PNs) yields significant insights into student performance, but the resource intensity of scoring limits its use. Recent advances in natural language processing (NLP) and machine learning (ML) allow the application of automated short answer grading (ASAG) for this task. This retroactive study evaluates the psychometric characteristics and reliability of an ASAG system for PNs, and secondarily considered factors contributing to implementation, including feasibility and case-specific phrase annotation required to tune the system for a new case. PNs from standardized patient (SP) cases within a graduation competency exam were used to train the ASAG system, applying a feed-forward neural networks algorithm for scoring. Using faculty phrase-level annotation, 10 PNs per case were required to tune the ASAG system. After tuning, ASAG item-level ratings for 20 notes were compared across ASAG-faculty (4 cases, 80 pairings) and ASAG-non-faculty (2 cases, 40 pairings). Psychometric characteristics were examined using item analysis and Cronbach's alpha. Inter-rater reliability (IRR) was examined using kappa. ASAG-scores demonstrated sufficient variability in differentiating learner PN performance and high IRR between machine-human ratings. Across all items the ASAG-faculty scoring mean kappa was 0.83 (SE±0.02). The ASAG-non-faculty pairings kappa was 0.83 (SE±0.02). The ASAG scoring demonstrated high item discrimination. Internal consistency reliability values at the case level ranged from a Cronbach's alpha of 0.65 to 0.77. Faculty time cost to train and supervise non-faculty raters for 4 cases is approximately $1,856. Faculty cost to tune the ASAG system is approximately $928. NLP-based automated scoring of PNs demonstrated a high degree of reliability and psychometric confidence for use as learner feedback. The small number of phrase level annotations required to tune the system to a new case enhances feasibility. ASAG-enabled PN scoring has broad implications for improving feedback in case-based learning contexts in medical education.
ISSN:	1040-2446 1938-808X
DOI:	10.1097/ACM.0000000000005357