How Survey Scoring Decisions Can Influence Your Study's Results: A Trip Through the IRT Looking Glass

Though much effort is often put into designing psychological studies, the measurement model and scoring approach employed are often an afterthought, especially when short survey scales are used (Flake & Fried, 2020). One possible reason that measurement gets downplayed is that there is generally...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Psychological methods 2024-10, Vol.29 (5), p.1003-1024
Hauptverfasser: Soland, James, Kuhfeld, Megan, Edwards, Kelly
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Though much effort is often put into designing psychological studies, the measurement model and scoring approach employed are often an afterthought, especially when short survey scales are used (Flake & Fried, 2020). One possible reason that measurement gets downplayed is that there is generally little understanding of how calibration/scoring approaches could impact common estimands of interest, including treatment effect estimates, beyond random noise due to measurement error. Another possible reason is that the process of scoring is complicated, involving selecting a suitable measurement model, calibrating its parameters, then deciding how to generate a score, all steps that occur before the score is even used to examine the desired psychological phenomenon. In this study, we provide three motivating examples where surveys are used to understand individuals' underlying social emotional and/or personality constructs to demonstrate the potential consequences of measurement/scoring decisions. These examples also mean we can walk through the different measurement decision stages and, hopefully, begin to demystify them. As we show in our analyses, the decisions researchers make about how to calibrate and score the survey used has consequences that are often overlooked, with likely implications both for conclusions drawn from individual psychological studies and replications of studies. Translational Abstract Considerable effort is often put into designing psychological studies, with great attention paid tovarious aspects of research design. However, when surveys are used to measure the outcome ofinterest, the approach used to score the survey is usually an afterthought. Measurement may be given short shrift because researchers wrongly assume that random noise due to measurement error is the main worry, or that nonrandom error will simply wash out between control andtreatment groups. Alternatively, ignoring measurement may occur not because researchers make those assumptions, but because scoring models are complicated and related decisions labyrinthine. Whatever the reason, in most studies, people simply add up all the item responses to produce a sum score. When a sum score is not used, researchers usually assume that a garden variety item response theory (IRT) model is the main alternative. In this study, we show that, not only are there likely better scoring options available for common study designs like randomized control trials and growth/developmental st
ISSN:1082-989X
1939-1463
1939-1463
DOI:10.1037/met0000506