Differential Item Functioning Analysis of United States Medical Licensing Examination Step 1 Items

Previous studies have examined and identified demographic group score differences on United States Medical Licensing Examination (USMLE) Step examinations. It is necessary to explore potential etiologies of such differences to ensure fairness of examination use. Although score differences are largel...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Academic Medicine 2022-05, Vol.97 (5), p.718-722
Hauptverfasser: Rubright, Jonathan D., Jodoin, Michael, Woodward, Stephanie, Barone, Michael A.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Previous studies have examined and identified demographic group score differences on United States Medical Licensing Examination (USMLE) Step examinations. It is necessary to explore potential etiologies of such differences to ensure fairness of examination use. Although score differences are largely explained by preceding academic variables, one potential concern is that item-level bias may be associated with remaining group score differences. The purpose of this 2019-2020 study was to statistically identify and qualitatively review USMLE Step 1 exam questions (items) using differential item functioning (DIF) methodology. Logistic regression DIF was used to identify and classify the effect size of DIF on Step 1 items meeting minimum sample size criteria. After using DIF to flag items statistically, subject matter expert (SME) review was used to identify potential reasons why items may have performed differently between racial and gender groups, including characteristics such as content, format, wording, context, or stimulus materials. USMLE SMEs reviewed items to identify the group difference they believed was present, if any; articulate a rationale behind the group difference; and determine whether that rationale would be considered construct relevant or construct irrelevant. All identified DIF rationales were relevant to the constructs being assessed and therefore did not reflect item bias. Where SME-generated rationales aligned with statistical differences (flags), they favored self-identified women on items tagged to women's health content categories and were judged to be construct relevant. This study did not find evidence to support the hypothesis that group-level performance differences beyond those explained by prior academic performance variables are driven by item-level bias. Health professions examination programs have an obligation to assess for group differences, and when present, investigate to what extent, if any, measurement bias plays a role.
ISSN:1040-2446
1938-808X
DOI:10.1097/ACM.0000000000004567